Abstract
Purpose:
The Multilevel word Accuracy Composite Scale (MACS) is a novel whole-word measure of speech production accuracy designed to evaluate behaviors commonly targeted in motor-based intervention for childhood apraxia of speech (CAS). The MACS yields a composite score generated through ratings of segmental accuracy, word structure maintenance, prosody, and movement transition. This study examined the validity of the MACS through comparison to established measures of speech accuracy. Reliability was also examined within and between practicing speech-language pathologists (SLPs).
Method:
The MACS was used to rate 117 tokens produced by children with severe CAS. Ratings were performed in the laboratory setting by two expert raters and by practicing SLPs (N = 19). Concurrent validity was estimated through comparison of expert MACS ratings (i.e., MACS score and each component rating) to measures of speech accuracy (percent phoneme correct and the 3-point scale) using correlational analyses. Reliability was examined between expert raters and across SLP raters using the intraclass correlation coefficient to examine interrater reliability of expert ratings, in addition to inter- and intrarater reliability of SLP ratings.
Results:
Correlation analyses between MACS ratings (i.e., MACS score and component ratings) and existing measures of speech accuracy revealed small to large positive correlations between measures. Reliability analyses revealed moderate to excellent reliability for MACS ratings performed by expert raters and between (interrater) and within (intrarater) SLP raters.
Conclusions:
Analyses of concurrent validity indicate that the MACS aligns with established measures, yet contributes novel elements for rating speech accuracy. Results further support the MACS as a reliable measure for rating speech accuracy in children with severe speech impairment for ratings performed by expert raters and practicing clinicians.
Childhood apraxia of speech (CAS) is a complex, multivariate speech disorder that involves deficits in praxis (ability to plan, organize, and sequence movements of speech structures; Campbell et al., 2003; Davis et al., 1998; Forrest, 2003; Iuzzini & Forrest, 2010) in the absence of neuromuscular impairment (American Speech-Language-Hearing Association [ASHA], 2007). The underlying deficit of CAS lies within sensorimotor planning and programming of speech movements (Grigos et al., 2015; Shriberg et al., 2012; Terband et al., 2009), which results in a range of segmental and suprasegmental speech deficits (Shriberg et al., 2017). The impact of this disorder is widespread, as children with CAS have highly unintelligible speech, commonly experience comorbid language and literacy disorders (ASHA, 2007; Lewis et al., 2004, 2015; Miller et al., 2019), and can display challenges in speech production through later school-age years (Lewis et al., 2004), adolescence (Lewis et al., 2015), or adulthood (Cassar et al., 2022). Slow progress in treatment is often reported in children with CAS largely due to poor response to traditional forms of intervention for speech sound disorders (Strand, 2020). Intervention for CAS requires a specialized approach that addresses the underlying motor deficit and incorporates a motor-based theoretical framework in the design and implementation of intervention (Maas et al., 2014). One challenge, however, relates to the measurement of treatment outcomes in a way that reliably quantifies the segmental and suprasegmental speech elements targeted in intervention. The objective of this study is to present a novel measure of speech production accuracy that reflects key behaviors targeted in motor-based intervention for CAS, the Multilevel word Accuracy Composite Scale (MACS), and to explore the validity and reliability of this measure for ratings performed by expert laboratory raters and practicing speech-language pathologists (SLPs).
Measuring Treatment Gains in CAS: Probe Data
Motor-based intervention for CAS targets accuracy of the movement gesture across sound and syllable sequences and not the individual phoneme (Maas et al., 2014; Strand, 2020), which has important implications for how treatment gains should be measured. It is essential that treatment outcome measures reflect specific behaviors targeted in treatment. For instance, progress in treatment of a single-sound articulation disorder would be evaluated according to accuracy of the targeted phoneme and compared to a criterion level (e.g., 90% accuracy). Accuracy of the movement gesture targeted within treatment of CAS is not as easily measured. In general, the measurement of treatment gains in CAS is not as straightforward as other speech sound disorders given the range of segmental and suprasegmental features observed in CAS. These errors include consonant and vowel distortions, inconsistent voicing errors, difficulty achieving smooth movement transitions across sounds and syllables, articulatory groping, intrusive schwa, increased difficulty with multisyllabic words, syllable segregation, slow speech rate, and lexical stress errors (Shriberg et al., 2017). Measures of speech sound accuracy alone, such as percent phonemes correct (PPC; Shriberg et al., 1997), do not reflect this complex of symptoms. An outcome measure is needed that captures change with respect to the segment (consonant, vowel accuracy), syllable (ability to maintain targeted word shape), prosody (appropriate use of stress and intonation), and overall movement gesture (ability to produce a smooth co-articulatory transition across sounds and syllables).
Due to the complexity of capturing segmental and suprasegmental features of speech production in CAS, probe data are recommended as one approach for gathering data to quantify progress over an intervention period within the context of motor-based intervention or generalization to untreated words (Jing & Grigos, 2022; Maas et al., 2019; Murray et al., 2015; Namasivayam et al., 2021; Preston et al., 2017; Strand, 2020; Strand & Debertine, 2000; Strand et al., 2006), though there is much variation and lack of consensus in outcome measures used in clinical practice (Morgan et al., 2021). When using probe data to measure treatment gains, an individualized probe word list is designed based on abilities of the child (Strand & Debertine, 2000; Strand et al., 2006). Alternatively, other researchers have used a common set of probe words to evaluate generalization to words not included in treatment sessions (Murray et al., 2015; Namasivayam et al., 2021). Probe words are then presented to children before, during, and after treatment to determine overall changes in speech production accuracy. This approach avoids examining performance during practice, which may not reflect a child's degree of accuracy outside of treatment (Maas et al., 2008).
Primary outcome measures of probe data include the use of binary ratings (Murray et al., 2015; Namasivayam et al., 2021; Preston et al., 2017) and a 3-point scale (Maas et al., 2019; Strand & Debertine, 2000; Strand et al., 2006). The outcome ratings were designed to assess aspects of speech production commonly impacted in children with CAS though using different types of measurement. Binary measures provide a 0/1 rating to indicate whether a word was produced accurately. In Rapid Syllable Transition treatment (ReST; Murray et al., 2015), accurate word production (“1”) is determined according to articulatory accuracy, smooth transitions across sounds and syllables, and prosody. When measuring treatment outcomes of Prompts for Restructuring Oral Muscular Phonetic Targets (Namasivayam et al., 2021) intervention, binary ratings reflect accuracy of speech motor control for jaw, labial, lingual movements, and voicing transitions. The 3-point rating scales also assess whole-word accuracy and uses a 0/1/2 rating to evaluate probe data (Maas et al., 2012, 2019; Maas & Farinella, 2012; Strand & Debertine, 2000; Strand et al., 2006). This method is commonly used in integral stimulation treatment approaches. For instance, Dynamic Temporal Tactile Cueing intervention (Strand, 2020) measures word accuracy as: “0” = several segmental and suprasegmental errors; “1” = minor error impacting no more than one distinctive feature; “2” = accurate production (Jing & Grigos, 2022). A similar measurement scale was also applied by Maas et al. (2019) with accuracy evaluated by syllable rather than solely at the whole-word level. An advantage of the 3-point rating scale is that it includes an additional grade of measurement (“1”), which offers the ability to measure more nuanced changes in speech accuracy over a treatment period. While these existing outcome measures have considered critical elements of speech accuracy commonly addressed in the treatment of CAS, they have all combined differing observations into one score.
There are benefits to using a binary accuracy rating or a 3-point rating scale for assessing probe data. For one, these are holistic measures that reflect accuracy at the whole-word level and evaluate segmental and suprasegmental features. Clinicians can evaluate features unique to CAS using one measure, rather than a combination of measures to reflect segmental accuracy (e.g., percent phoneme correct) or stress accuracy (e.g., lexical stress accuracy). In addition, judgments can be made online within the context of probe testing, which results in an efficient approach to measuring probe accuracy. Finally, use of a broader rating scale addresses difficulties related to reliability in the assessment of severely impaired speech. Traditional measures of speech accuracy, such as PPC, are reported to have poor interrater reliability (Allison, 2020), whereas rating scales offer promise as a more reliable approach. Jing and Grigos (2022) evaluated the reliability of the 3-point rating scale when judging speech accuracy of tokens produced by children with CAS and other speech sound disorders in a group of 30 SLP raters (Jing & Grigos, 2022). SLP ratings were compared to consensus ratings, and reliability was calculated using the kappa statistic. Kappa values across SLP raters ranged from .53 (moderate agreement) to .81 (excellent agreement). On average, the SLPs displayed substantial agreement with the consensus rating (mean kappa = .69; SD = 0.08) supporting the 3-point rating scale as a reliable tool for measuring speech accuracy.
Binary measures and the 3-point rating scale combine segmental and suprasegmental measures within one rating scale. As a result, they do not reflect subtle improvements in speech accuracy. In addition, the high prevalence of speech sound distortions (Grigos et al., 2015; Jing & Grigos, 2022; Pollock & Hall, 1991; Shriberg et al., 2017) and prosodic errors (Shriberg et al., 2017; Strand et al., 2013) in CAS make clinical decision-making extremely challenging. As a result, clinicians often struggle to distinguish between productions that are accurate as compared to those that are close approximations of the target (Jing & Grigos, 2022). Rating scales offer promise as a means to systematically assess accuracy of the whole word instead of the individual sounds; however, they still do not provide a detailed assessment of which elements of speech production are making gains over the treatment period.
The MACS
The MACS was developed to address the need to better quantify segmental and suprasegmental elements of speech production in children with CAS while maintaining elements of scaled ratings, which have shown good reliability (Jing & Grigos, 2022). The MACS is a whole-word measure of accuracy that evaluates four components of speech production: segmental accuracy, word structure maintenance, prosody, and movement transition across sounds and syllables. These elements were selected as they reflect characteristic features of CAS that are commonly addressed over the course of treatment. While other measures have considered these elements (e.g., Sounds/Beats/Smoothness in ReST; Murray et al., 2015), there is no method to date that has individually rated each of these features and in a systematic manner. The MACS therefore allows for a more fine-grained and detailed assessment of speech features that can better inform treatment planning and measurement of treatment gains.
The Following Components Are Evaluated Within the MACS:
Segmental accuracy: Segmental errors occur frequently in CAS, including consonant/vowel substitutions, omissions, or distortions (Grigos & Case, 2018; Iuzzini-Seigel et al., 2017; Shriberg et al., 2017). The segmental accuracy component assesses combined accuracy of all consonants and vowels within a word, including distortions.
Word structure: Sound and/or syllable omissions and additions are prevalent in the speech output of children with CAS resulting in reduced ability to maintain the targeted word shape (Shriberg et al., 2012). The word structure component assesses the maintenance of the targeted word shape, regardless of segmental accuracy.
Prosody: Prosodic errors are a characteristic feature of CAS, including inaccurate lexical stress, equal and exaggerated stress, and syllable segregation (Kopera & Grigos, 2019; Shriberg et al., 2003, 2017). The prosody component assesses accuracy of the stress contour in words with more than two syllables.
Movement transition: Poor co-articulatory transitions characterize speech production in children with CAS (Grigos & Case, 2018; Iuzzini-Seigel et al., 2017; Shriberg et al., 2012, 2017). These may present as difficulty achieving the initial articulatory configuration for a phoneme, lengthened and disrupted transitioning of movements between sound sequences, or reduced ability to smoothly transition speech movements between syllables. Inaccurate prosody, consonant/vowel omissions or additions, effortful production of consonants/vowels, and shortened/lengthened vowels are also indicative of challenges related to movement transitioning. The movement transition component assesses smoothness and fluency of transitions between sounds and syllables.
Calculating the MACS Score per Production:
Segmental accuracy, word structure, prosody, and movement transition are each rated on a binary scale (0 = inaccurate, 1 = accurate). The MACS score reflects the average of these ratings for each word, resulting in a score that ranges from 0 (all inaccurate) to 1 (all accurate). The higher the MACS score is to 1, the more accurate the production would be for that token. Table 1 illustrates the scoring rubric for single and multisyllable words.
For single syllable words, the average of segmental accuracy, word structure, and movement transition ratings is calculated to generate the MACS score (segmental accuracy + word structure + movement transition). Prosody is rated as “na” for single syllable words.
For multisyllable words, the average of all four ratings is calculated to generate the MACS score (segmental accuracy + word structure + prosody + movement transition).
Table 1.
Calculation of the Multilevel word Accuracy Composite Scale (MACS).
Component | “0” rating | “1” rating | Single syllable word: /pɑp/➔[bɑ] |
Multisyllable word: /beɪbi/➔[bibi] |
---|---|---|---|---|
Segmental accuracy | Substitution, omission, distortion of consonants and/or vowels | Accurate consonant and vowel | C1 substitution; C2 omission = 0 | V1 error = 0 |
Word structure | Inaccurate/missing word structure (e.g., CVC ➔ CV) | Accurate word structure | CVC ➔ CV = 0 | CVCV ➔ CVCV = 1 |
Prosody | Segmentation; equal or inaccurate stress; syllable reduction | Accurate prosody | single syllable word = n/a | equal stress = 0 |
Movement transition | Lengthened, disrupted transitions between sounds and syllables; poor initial configuration; effortful production; inaccurate prosody; phoneme omissions or additions | Smooth and effortless transitions | no transition to final consonant due to consonant omission = 0 | inaccurate prosody = 0 |
MACS score | (0 + 0 + 0)/3 = 0 | (0 + 1 + 0 + 0)/4 = 0.25 |
Note. C = consonant; V = vowel; n/a = not applicable.
There are many possible benefits to using this multilevel measure. It is extremely challenging to rate speech errors in children with severe speech impairment. The MACS provides clear guidelines for auditory perceptual judgments that will support more accurate ratings of speech errors. This scale also provides fine-grained information regarding the area of speech production that is responding to treatment. For instance, a child may be struggling with segmental accuracy, yet has made progress in achieving the targeted word shape. Measuring each of these components of speech production can also highlight errors that are more likely to persist if not adequately addressed in treatment, such as inaccurate prosody or movement transitions. For instance, children may achieve segmental and word shape accuracy, yet continue to display prosodic deficits. Use of the MACS may ensure that clinicians consider all elements of speech production before moving on to novel targets.
An additional benefit of the MACS is that it captures the interactional nature of speech production. For instance, if a final consonant is omitted in a single syllable consonant–vowel–consonant (CVC) word, credit would not be given for any components of the MACS as (a) a segment was omitted, (b) the word structure was altered, and (c) the movement transition was impaired as closure into the word final position was not achieved. Similarly, if a syllable were omitted in a CVCV word, credit would not be received for any component of the MACS as (a) segments were omitted, (b) the word structure was altered, (c) the movement transition was impaired as the child did not sequence movements across two syllables, and (d) prosody as related to stress patterning was inaccurate as the targeted bisyllabic stress contour was not achieved.
Finally, as previously stated, it is challenging to reliably rate speech production errors in children with severe speech impairment. We believe that a potential benefit of a multicomponent measure such as the MACS could be high degrees of inter- and intrarater reliability due to the clearly delineated areas of speech production and clear guidelines for assessment.
Objectives and Hypotheses
To examine use of the MACS as a clinical rating scale, analyses were completed across two phases. First, laboratory ratings were used to evaluate reliability between expert raters and to estimate concurrent validity of this novel scale for rating segmental accuracy, word structure, prosody, and movement transition through comparison to existing measures of speech accuracy (i.e., percent phoneme correct, 3-point rating scale). High interrater reliability between laboratory ratings was anticipated. We hypothesized that MACS ratings would align with existing measures of speech accuracy (e.g., PPC, 3-point rating scale) while contributing novel information related to segmental and suprasegmental elements of speech production. Second, we examined the inter- and intrareliability of the MACS across SLP raters. We hypothesized that parsing out the four components of segmental accuracy, word structure, prosody, and movement transition would support listener judgments and result in good agreement within and between SLP raters.
Method
Participants
Laboratory Raters
Two expert raters (first and second author) performed MACS ratings in the laboratory setting. Expert Rater 1 (J.C.) is a clinical researcher and SLP with more than 15 years of clinical experience in pediatric speech sound disorders. Expert Rater 2 (E.W.) is a clinical researcher and SLP with more than 10 years of clinical experience. Both laboratory raters have extensive experience in narrow transcription and rating of speech produced by children with CAS in both clinical and research settings. They have no history of hearing loss, and English is their dominant language.
SLP Raters
A total of 29 SLPs were recruited for this experiment, and ratings from 19 of these SLPs were included in this study. Data from 10 of the SLP raters could not be included, primarily due to challenges associated with remote data collection, and strict protocol for inclusion to ensure the experiment was completed according to the procedure described below. SLP data were excluded for the following reasons: (a) The SLP did not complete the experimental task (n = 2), (b) English is not spoken as the first language (n = 1), (c) the SLP did not follow experimental protocol to wear headphones (n = 2), (d) poor accuracy on initial training tasks (n = 2), and (e) the SLP did not pass catch trials embedded within the experiment to ensure attention to task (n = 3). Participating SLPs had a mean of 17.47 years of clinical practice (SD = 11.82), with a range of 4–43 years of experience. All SLPs had clinical experience working with children with CAS and were recruited via flier, word of mouth, and social media inquiries. Human subjects approval was obtained through the Hofstra University Internal Review Board, and consent was obtained by all participants.
Stimuli
Stimuli included 90 probe words produced by five children with severe CAS ages 2;6 to 3;11 (years;months; M = 36.8 months; SD = 6.76) who participated in ongoing treatment research at New York University (Grigos et al., in press). To be eligible for the treatment study, children were required to meet criteria for a diagnosis of CAS, in addition to normal hearing, intact oral structure and functional integrity, and English as the child's primary language. Exclusion criteria included comorbid history of other neurodevelopmental disorders (e.g., autism spectrum disorder, genetic disorder, intellectual disability) and coexisting dysarthria. The diagnosis of CAS was determined following a motor speech evaluation and comprehensive speech-language assessment using criteria well defined in the literature (Case & Grigos, 2020; Grigos & Case, 2018; Grigos et al., 2015; Shriberg et al., 2017; Strand et al., 2013). To meet criteria for CAS, children were required to display at least four features from the Mayo Diagnostic checklist across more than two speaking contexts, including a dynamic motor speech evaluation, single word productions, syllable sequencing tasks, and/or connected speech (see Table 2). The diagnosis was made independently by two SLPs with extensive experience in differential diagnosis of CAS (first and third authors).
Table 2.
Characteristics of CAS of five children diagnosed with CAS represented in experimental stimuli.
Speakers | Age (months) | No. of Mayo features | Qualitative description of features observed across connected speech, dynamic assessment, and single word production |
---|---|---|---|
P1 | 30 months | 9 | Severe articulatory groping at the onset of words; effortful productions; vowel distortions; timing errors related to nasality and voicing; syllable segregation, and equal stress; inconsistent errors |
P2 | 47 months | 8 | Lengthened and disrupted transitions between sounds/syllables; pervasive intrusive schwa at word boundary; excessive aspiration at word boundary; severe vowel distortion; slow rate and segmented speech production; inconsistent errors |
P3 | 40 months | 8 | Lengthened and disrupted co-articulatory transitions between sounds; vowel distortions and distorted consonant substitutions; excess and inaccurate lexical stress; timing errors related to nasality; articulatory groping; inconsistent errors; intrusive schwa |
P4 | 33 months | 10 | Effortful speech production with disrupted transitions between sounds/syllables; vowel distortions resulting in shortened vowel and segmented diphthongs; distorted consonant substitutions; excess and equal lexical stress; inconsistent voicing errors; inconsistent errors; intrusive schwa |
P5 | 34 months | 7 | Lengthened and disrupted co-articulatory transitions between sounds; vowel distortions and distorted consonant substitutions; syllable segregation and inaccurate lexical stress; timing errors related to nasality and voicing; inconsistent errors |
Note. CAS = childhood apraxia of speech.
Experimental stimuli consisted of 90 probe words (presented in Appendix B) that were selected from a larger dataset collected over the course of Dynamic Temporal and Tactile Cueing (Strand, 2020) intervention as part of a larger treatment efficacy study (Grigos et al., in press). Tokens from five of the seven children were selected based on quality of acoustic files for purpose of analyses performed in this study. Probe data were collected before treatment (baseline), 2 weeks post- (posttreatment), and 6 weeks posttreatment (maintenance). Stimuli were evenly distributed across session (30 probe words from each time point) and child (18 productions from each child with six productions at each time point). Experimental tokens selected for the current work reflected significant gains in speech accuracy over the treatment period from baseline to posttreatment (β = 0.03, SE = 0.01, t = 2.71, p = .006) and from baseline to maintenance (β = 0.13, SE = 0.01, t = 12.36, p < .001). To provide a representative sample of speech performance in children with severe CAS over a treatment period, tokens were not otherwise balanced for accuracy.
To assess intrarater reliability, 20% of tokens were repeated resulting in a total of 108 tokens produced by children with CAS. Repeated tokens were also balanced by participant and session. Nine catch trials (i.e., 10% of probe words) produced by a child with typical speech-language development (female, age 6;0) were also included to ensure that attention was maintained throughout the experiment (Fernández et al., 2019; Nightingale et al., 2020; Peterson et al., 2022). The final stimulus set therefore consisted of 117 tokens: 90 experimental probe words, 18 repeated tokens for intrarater reliability, and nine catch trials.
Probe data were audio-recorded using a Fostex digital recorder at a sampling rate of 44.1 kHz in a sound-treated room. A headset microphone was used for probe data collection in two of the children with a 5-cm mouth-to-microphone distance. A tabletop microphone was used for children who did not tolerate the headset microphone (P1, P5, P6). All tokens were reviewed by the first and second authors. A token was deemed acceptable if it was free of any background noise (e.g., clinician talk-over, chair scraping, child tapping the table while speaking), with the first acceptable token chosen to be included in the experiment. Acceptable tokens were amplitude normalized using a Praat script to ensure that all sound files were approximately the same volume. Amplitude normalization was necessary given use of a tabletop microphone for three children. To ensure that sound files did not sound “cut off” when played by the listeners, 50 ms of silence was added to the onset of each sound file using a Praat script.
Procedure
Laboratory Ratings
Laboratory ratings were performed (a) to evaluate reliability of the MACS between two expert raters and (b) to achieve consensus ratings for the purpose of comparing MACS ratings to existing measures of speech accuracy in CAS, including PPC (Shriberg et al., 1997) and the 3-point rating scale (Jing & Grigos, 2022). Expert raters independently completed MACS ratings using the online platform, Gorilla Experiment Builder (Irvine et al., 2018). Ratings were conducted using high-quality headphones and in a quiet setting. Training was not provided to laboratory raters given their roles in developing this measure. For all ratings, tokens were presented in a randomized order and raters were allowed up to five opportunities to listen to each production. Following presentation of a token, all MACS components were rated per word before proceeding to the next trial. Audio files did not contain any identifying information (i.e., participant, treatment phase) to avoid the risk of potential bias. Completion of the experimental portion took an average of 25 min across raters (J.C. = 28 min; E.W. = 22 min).
Narrow transcription and 3-point ratings were conducted at a different timepoint. Each rater independently rated and transcribed each of the experimental tokens using only the acoustic signal for each word. Consensus ratings were then conducted for all ratings (i.e., MACS, narrow transcription, 3-point rating). A modification of the Shriberg et al. (1984) consensus procedure was used to review tokens in disagreement. Laboratory raters simultaneously listened to these tokens up to 3 times each and came to an agreement across all ratings (MACS, 3-point rating, transcription). Upon achieving consensus on transcriptions and ratings, PPC was computed for each token to indicate the accuracy of consonants and vowels in each token. Consensus ratings across all measures were then used to examine the relationship between MACS ratings, PPC, and the 3-point rating scale.
Analyses
The absolute mean difference and standard deviation was calculated between the MACS ratings, 3-point ratings, and PPC. Interrater reliability between laboratory ratings was analyzed using intraclass correlation coefficients (ICCs). ICCs were generated using the “irr” package in R Studio (Gamer et al., 2012; R Core Team, 2021), based on a two-way random effects model for agreement on average ratings for all variables (Hallgren, 2012). The ICC, 95% confidence interval (CI), and F tests of significance for interrater reliability of the MACS score (average of segment, word shape, prosody, movement transition) and each individual component rating were calculated. Qualitative interpretation of ICC values was completed using standards outlined by Koo and Li (2016) as seen in Table 3.
Table 3.
Qualitative interpretation of ICC values (Koo & Li, 2016).
ICC value | Interpretation |
---|---|
< .50 | Poor |
Between .50 and .75 | Moderate |
Between .75 and .90 | Good |
> .90 | Excellent |
Note. ICC = intraclass correlation coefficient.
Correlation analyses were performed using consensus ratings to examine the degree to which consensus MACS ratings and each of its individual components aligned with the 3-point rating scale and PPC measures for laboratory ratings. Given that both the MACS total score and the 3-point rating scale are classified as ranked variables, Spearman rank correlations were used to measure correlations between the MACS total score and PPC, as well as the MACS total score and the 3-point rating scale (McDonald, 2014). As MACS component variables (e.g., segmental accuracy) are binary ratings, point biserial correlations were performed when making comparisons with these ratings and the 3-poing rating scale and PPC (Brown, 2001). The size of correlations was interpreted using Cohen's effect size interpretation (Cohen, 1988): .50–1.00 = large; .3–.5 = medium; .1–.3 = small; < .1 = absent.
SLP Ratings
SLP raters first completed a training and then completed the experimental task to evaluate the reliability of the MACS across practicing clinicians. The training outlined details of the MACS and offered guided practice in implementing this rating scale. Within 1 week of completing the MACS training, SLP raters completed the experimental task. Both tasks were completed remotely using SLP raters' personal devices.
Training. All SLP raters completed a guided tutorial prior to the experimental task designed by the first and third authors. SLP raters were instructed to complete the tutorial in a quiet setting and while wearing headphones. As additional support, SLP raters were given a MACS Information Sheet (Appendix A) with detailed information about each component of the rating scale in addition to scoring criterion. The tutorial was approximately 35 min in duration and is accessible online, including a template Excel sheet for inputting MACS ratings (http://www.casespeechlab.com).
The tutorial consisted of a narrated presentation with detailed explanations of each component of the MACS (segmental accuracy, word shape, prosody, movement transition) and what qualifies a “0” or “1” rating. Each component was presented individually and followed with sample audio clips where SLP raters practiced rating a word using that particular component. These tokens did not include any words or speakers included in the experimental task. Once all four components of the MACS were reviewed, SLP raters were then provided guided practice in rating words using all elements simultaneously. None of the practice tokens were included in the experimental task.
Following the tutorial, SLP raters completed three training blocks with five words in each block using Google Forms. In each block, the SLP raters played an audio clip and rated each production using the MACS. The first block contained five accurate and inaccurate productions by a 6-year-old child with a speech sound disorder. The second and third blocks contained five tokens each, which were produced by children with severe CAS aged 2;6–3;11 (same five speakers as those who produced experimental stimuli, resulting in two productions per child). Tokens used for training were not included in the experimental task. Once each block was completed, immediate feedback was provided regarding rating accuracy. Performance accuracy on training blocks was populated to a central spreadsheet through Google Forms. At least 80% accuracy across training blocks was required to participate in the experiment. Overall, SLP raters displayed a mean accuracy of 86.9% (SD = 5.84) across all three training blocks (Block 1: 88% accuracy; Block 2: 84% accuracy; Block 3: 87% accuracy). Two SLP raters did not achieve the 80% cutoff and were not included in the experiment.
Experimental task. The experiment was built using Gorilla Experiment Builder, a validated and reliable website for conducting experimental research (Irvine et al., 2018). Once training was complete, SLP raters received a unique link directing them to the Gorilla website to complete the experiment. The experiment was completed within 1 week of the MACS training tutorial with an average of 5–6 days lapsing between sessions. SLP raters were asked to complete the experiment in one sitting and while wearing headphones.
The experiment consisted of three parts: (a) initial questionnaire and headphone check, (b) MACS rating blocks, and (c) post-experiment questionnaire. The initial questionnaire was designed to ensure that SLP raters were wearing headphones, completing the experiment in a quiet room, and had the MACS Information Sheet as a reference. Following the questionnaire, a headphone check task was administered where three sound files varying in loudness were played. SLP raters were asked to identify “the softest sound.” Upon completing the headphone check, SLP raters were directed to the experiment. The softest sound could only be identified if SLP raters were wearing headphones. Those who did not pass this task were not included in the experiment (n = 2). When asked, these individuals confirmed that they were not wearing headphones while completing the experiment.
The experiment consisted of three blocks, containing 39 sound files in each block (total sound files = 117). Participants were notified of the start and stop of each block to keep track of their own progress through the experiment and to offer an opportunity to take a brief break, if needed. Each block consisted of 30 trials presented in a randomized order within each block. Experimental tokens were produced by all five children across all three time points (baseline, posttreatment, and maintenance), six repeated trials repeated for intrarater reliability, and three catch trials (trials/block = 39). All trials within each block were randomized for each participant. Randomized ordering was also conducted to reduce potential effects of listener bias (Hustad, 2020) or talker familiarity (Levi et al., 2011). SLP raters were not aware of any identifying information related to children producing probe data or the point in treatment. SLP raters were allowed to play each sound file up to 5 times and asked to perform ratings for each MACS component at the same time for each word. Responses for each of the rating scale components (segmental accuracy, word structure, prosody, movement transition) were required in order to move on to the next trial. SLP raters rated each MACS component using a binary rating of 0 (inaccurate) or 1 (accurate). The prosody category also offered an “n/a” option for single syllable words since prosody was not rated in these tokens.
Upon completion of the experiment, a final questionnaire was provided to SLP raters. SLP raters were asked to rate the loudness of their environment in which they completed the task (1 = noisy; 5 = quiet), as well as to rate how challenging they felt the experiment to be (1 = not challenging; 5 = difficult). They were also asked to indicate what type of headphones they were wearing. On average, SLP raters rated their environment as 4.5 (SD = 1.02), indicating a quiet listening environment, and difficulty as 3.52 (SD = 1.07), suggesting a moderate degree of difficulty across SLP raters. Most SLP raters reported wearing in-ear headphones (13 of 19 SLP raters) as compared to over-the-ear headphones (six of 19 SLP raters).
Analyses
Data were downloaded from Gorilla, and data reduction was completed using Excel pivot tables. The MACS score was calculated by averaging the binary ratings for the three or four rating scale components. Interrater reliability was calculated on two levels: (a) mean interrater reliability between each SLP rater and consensus laboratory ratings, and (b) interrater reliability across all SLP raters. The mean absolute difference and interrater reliability between each SLP rater and the consensus laboratory ratings were calculated. These values were then averaged to reflect the mean absolute difference and interrater reliability between SLP raters and the consensus laboratory ratings. Interrater reliability across all SLP raters was calculated using the ICC across all ratings. ICC values were calculated using the “irr” package in R Studio (R Core Team, 2021), based on a two-way random effects model for agreement on average ratings (Hallgren, 2012) for both consensus ratings and interrater reliability across all clinicians. The ICC, 95% CI, and F tests of significance for interrater reliability of the MACS (segment, word shape, prosody, movement transition) and all component ratings were calculated. Qualitative interpretation of ICC values was reported for the upper and lower limits of the 95% CI using Koo and Li (2016) guidelines (see Table 3).
Results
Laboratory Ratings
The MACS score is the average of all component ratings, and scores ranged from 0 (inaccurate across all components) to 1 (accurate across all components). The mean (SD) MACS score across all combined tokens was 0.34 (0.26) for Expert Rater 1 and 0.28 (0.22) for Expert Rater 2. The MACS component ratings were binary ratings of either 0 (inaccurate) or 1 (accurate). Expert Rater 1 displayed the following mean (SD) ratings for each component rating across all tokens: (a) segmental accuracy = 0.13 (0.34); (b) word structure = 0.77 (0.43); (c) prosody = 0.05 (0.24); (d) movement transition = 0.17 (0.37). Expert Rater 2 displayed the following mean (SD) ratings: (a) segmental accuracy = 0.11 (0.32); (b) word structure = 0.74 (0.44); (c) prosody = 0.05 (0.24); (d) movement transition = 0.03 (0.18). Interrater reliability was calculated between both laboratory raters using the ICC. Table 4 displays the absolute mean difference between laboratory ratings, in addition to the ICC, 95% CI, and F test for the MACS score and each of the component ratings. Analyses revealed moderate to excellent interrater agreement for the MACS score across the 95% CI (ICC = .82; CI [0.74, 0.87]; moderate–good) and component ratings of segmental accuracy (ICC = .77; CI [0.67, 0.84]; moderate–good), word structure (ICC = .86; CI [0.80, 0.90]; good–excellent), prosody (ICC = 1.0; CI [1.0, 1.0]; excellent), and movement transition (ICC = .69; CI [0.55, 0.79]; moderate–good).
Table 4.
Descriptive statistics for absolute differences in expert MACS ratings and two-way random-effects intraclass correlation coefficient for agreement.
Variable | Absolute mean difference (SD) | Intraclass correlation coefficient | 95% confidence interval (CI) | F test | p value | Qualitative interpretation for 95% CI a |
---|---|---|---|---|---|---|
Laboratory expert ratings | ||||||
MACS score | 0.16 (0.36) | .82 | [0.74, 0.87] | F(116, 116) = 5.48 | < .001 | Moderate–good |
Segmental accuracy | 0.11 (0.32) | .77 | [0.67, 0.84] | F(116, 116) = 4.43 | < .001 | Moderate–good |
Word structure | 0.08 (0.28) | .86 | [0.80, 0.90] | F(116, 116) = 7.11 | < .001 | Good–excellent |
Prosody | 0 (0) | 1 | [1, 1] | F(22, 22) = Inf | 0 | Excellent |
Movement transition | 0.15 (0.36) | .69 | [0.55, 0.79] | F(116, 116) = 3.22 | < .001 | Moderate–good |
SLP ratings: interrater reliability | ||||||
MACS score | 0.16 (0.16) | .95 | [0.93, 0.96] | F(79, 1422) = 18.7 | < .001 | Excellent |
Segmental accuracy | 0.22 (0.24) | .92 | [0.89, 0.94] | F(79, 1422) = 12.8 | < .001 | Good–excellent |
Word structure | 0.12 (0.14) | .95 | [0.93, 0.97] | F(79, 1422) = 20.8 | < .001 | Excellent |
Prosody | 0.25 (0.24) | .88 | [0.79, 0.95] | F(16, 288) = 8.74 | < .001 | Good–excellent |
Movement transition | 0.18 (0.14) | .9 | [0.87, 0.93] | F(79, 1422) = 10.6 | < .001 | Good–excellent |
SLP ratings: intrarater reliability | ||||||
MACS score | 0.11 (0.19) | .86 | [0.83, 0.89] | F(430, 430) = 7.3 | < .001 | Good |
Segmental accuracy | 0.08 (0.27) | .86 | [0.83, 0.88] | F(430, 430) = 6.97 | < .001 | Good |
Word structure | 0.10 (0.29) | .86 | [0.83, 0.88] | F(430, 430) = 7.02 | < .001 | Good |
Prosody | 0.18 (0.38) | .66 | [0.52, 0.76] | F(119, 119) = 2.96 | < .001 | Moderate–good |
Movement transition | 0.17 (0.40) | .73 | [0.67, 0.77] | F(430, 430) = 3.68 | < .001 | Moderate–good |
Note. MACS = Multilevel word Accuracy Composite Scale; Inf = infinity; SLP = speech-language pathologist.
Concurrent validity of the MACS was estimated through examining correlations between consensus ratings of speech accuracy (3-point scale, PPC) and MACS ratings. Consensus ratings were achieved using independent judgments by Expert Rater 1 and Expert Rater 2. Across the 117 words, there were 36 words total where the MACS score differed between raters with the following differences noted in each component: segmental accuracy (n = 13), word structure (n = 10), prosody (n = 0), and movement transition (n = 18). For the 3-point rating, there were 34 points of disagreement between expert raters. Review of narrow transcription revealed 47 points of disagreement for consonant transcriptions and 42 disagreements for vowel transcriptions. The expert raters simultaneously listened to each token in disagreement and reached 100% consensus on disputed items for MACS ratings, 3-point ratings, and transcription analyses of consonant/vowel accuracy.
Following the consensus procedure, correlations were performed between (a) MACS ratings and the 3-point scale and (b) MACS ratings and PPC (see Table 5). Analyses revealed small to large positive correlations between the 3-point scale and (a) the total MACS score (r s = .583, p < .001); (b) segmental accuracy (r pb = .685, p < .001); (c) word structure (r pb = .323, p = .0003); (d) prosody (r pb = .16, p = .0001); and (e) movement transition (r pb = .612, p < .001). Absent to medium–large positive correlations were also observed between PPC and (a) the MACS score (r s = .41, p < .001); (b) segmental accuracy (r pb = .562, p < .001); (c) word structure (r pb = .290, p = .001); (d) prosody (r pb = .075, p = .08); and (e) movement transition (r pb = .477, p < .001).
Table 5.
Descriptive statistics for consensus expert MACS ratings, 3-point rating scale, and percent phoneme correct (PPC) and correlation coefficients for associations between ratings.
Variable | M (SD) | Correlation: 3-point rating scale | Qualitative interpretation a | Correlation: PPC | Qualitative interpretation a |
---|---|---|---|---|---|
Consensus expert ratings | |||||
MACS score | 0.34 (0.27) | r s = .583** | Medium–large | r s = .410** | Medium |
Segmental accuracy | 0.12 (0.33) | r pb = .685** | Large | r pb = .562** | Medium–large |
Word structure | 0.79 (0.41) | r pb = .323** | Medium | r pb = .290** | Small |
Prosody | 0 (0) | r pb = .160** | Small | r pb = .075 | — |
Movement transition | 0.14 (0.35) | r pb = .612** | Large | r pb = .477** | Medium |
3-point rating | 0.49 (0.70) | ||||
PPC | 0.38 (0.36) |
Note. The dash indicates no relationship. MACS = Multilevel word Accuracy Composite Scale
p < .001.
SLP Ratings
For SLP ratings, the MACS scores ranged from 0 (inaccurate across all components) to 1 (accurate across all components). The MACS component ratings were binary ratings of either 0 (inaccurate) or 1 (accurate). The following mean (SD) ratings were observed for all SLP raters across all combined tokens: (a) MACS score = 0.40 (0.32); (b) segmental accuracy = 0.19 (0.40); (c) word structure = 0.76 (0.42); (d) prosody = 0.21 (0.41); (e) movement transition = 0.28 (0.45). Interrater reliability across all SLP raters was calculated to evaluate agreement across clinicians (see Table 4). Analyses revealed good to excellent agreement for the MACS score across the 95% CI (ICC = .95; 95% CI [0.90, 0.96]; excellent), segmental accuracy (ICC = .92; CI [0.89, 0.94]; good–excellent), word structure (ICC .95; CI [0.93, 0.97]; excellent), prosody (ICC = .88; CI [0.79, 0.95]; good–excellent), and movement transition (ICC = .90; CI [0.87, 0.93]; good–excellent).
Intrarater reliability was measured with SLP raters to examine internal consistency for repeated tokens in the experimental set. Analyses revealed good agreement for the 95% CI of the MACS score (ICC = .86; CI [0.83, 0.89]; good) and moderate to good agreement for segmental accuracy (ICC = .86; CI [0.83, 0.88]; good), word structure (ICC .86; CI [0.83, 0.88]; good), prosody (ICC = .66; CI [0.52, 0.76]; moderate–good), and movement transition (ICC = .73; CI [0.67, 0.77]; moderate–good).
Discussion
The MACS is a new rating scale designed to quantify multiple levels of speech production in children with severe CAS to capture segmental, suprasegmental, and motor behaviors. It yields a composite score based on measurements of segmental accuracy, word structure, prosody, and movement transition. This study examined the concurrent validity and reliability of this measure across ratings performed by expert raters in the laboratory setting and by 19 practicing SLPs. Moderate to excellent reliability was observed between expert raters and across/within SLP raters. Analyses of concurrent validity for ratings performed by expert raters indicated medium to large correlations between the MACS score and existing measures of speech accuracy (i.e., 3-point rating, PPC) with differences observed in the relationship between individual components of the MACS and these metrics. Large correlations were noted between existing measures of speech accuracy (i.e., 3-point rating, PPC) and segmental accuracy and small correlations with prosody. These findings illustrate that the MACS aligns with established measures, yet also contributes novel elements for rating speech accuracy not reflected in the 3-point rating scale or PPC. Taken together, results provide initial evidence of the MACS as a valid and reliable measure for rating speech accuracy in children with severe speech impairment in the research setting for ratings performed by both clinical researchers and practicing SLPs.
Support for a Novel Rating Scale
Differing correlations between each component of the MACS and existing measures of speech accuracy (PPC, 3-point ratings) shed light on aspects of speech production that are similarly measured by these approaches (segmental accuracy, movement transition), in addition to novel contributions offered by the MACS (word structure, prosody). Correlations between the MACS (both the MACS score and segmental accuracy) and PPC support the ability of the MACS to evaluate gains in consonant and vowel accuracy in the absence of time-intensive narrow transcription. We would not recommend the MACS as a replacement to narrow transcription; however, it is an efficient way to monitor gains in segmental accuracy over the course of a treatment period using semi-regular (e.g., weekly) probe data. In contrast, PPC might be more effective for measuring generalization to nontreated words (Murray et al., 2015) or within connected speech sample analyses (Barrett et al., 2020).
Given that much of the treatment literature for CAS uses binary or scaled ratings for measuring probe data (Maas et al., 2019; Murray et al., 2015; Strand & Debertine, 2000; Strand et al., 2006), it was important to examine correlations between the MACS and the 3-point rating scale. Overall, the medium–large correlation between the MACS score and the 3-point rating scale supports that these measures align with one another. This finding was important as the MACS was designed to rate probe data in a similar manner as the 3-point rating scale. However, unlike the 3-point rating scale, the individual component ratings of the MACS indicate which elements of speech production are accurate thereby providing more specific and informative clinical information than the 3-point rating scale. Future work is needed to measure clinical gains over a treatment period using the MACS to analyze patterns of progress as measured by the total MACS score and the individual component ratings.
Correlations between MACS component ratings and the 3-point rating scale/PPC further highlight how the MACS captures behaviors not explicitly measured by these existing measures. Segmental accuracy and movement transition ratings both displayed large correlation coefficients across all comparisons within this dataset while word structure and prosody had small correlation coefficients when compared to the 3-point rating. Although other clinical outcome measures embed assessment of motoric elements of speech production within binary (Murray et al., 2015) or scaled ratings (Jing & Grigos, 2022), the MACS explicitly assesses movement transitions between sounds and syllables. The large correlation between movement transition ratings and the 3-point scale, an established measure for rating speech accuracy in children with CAS, supports the validity of this component of the MACS to assess speech movements in CAS. Furthermore, the medium correlation between movement transition and PPC may reflect an intersect between these measures where degree of fluency and effort of speech movements across sounds/syllables has a moderate relationship with improved segmental accuracy. Taken together, correlations with movement transition ratings reflect a promising contribution of the MACS for measuring motor-based speech deficits.
The word structure and prosody ratings reflect additional novel contributions of the MACS. The relationship between word structure and PPC (small) and the 3-point rating scale (medium) reflects how word structure captures elements of speech production that other measures cannot by design. Therefore, when used clinically, word structure ratings can reveal shifts in the degree to which speakers may move closer toward achieving a targeted word shape. There was little to no relationship between prosody and PPC or 3-point ratings. This finding was expected for PPC given that it is a measure of segmental accuracy; however, we anticipated a stronger relationship between prosody and the 3-point rating scale as the latter specifically guides the listener to judge prosody (i.e., “Prosodic error in bisyllabic words = inaccurate stress, equal stress, or word segmentation in bisyllabic words,” which yields a score of “1”; Jing & Grigos, 2022, p. 6). When making accuracy judgments at a whole-word level using a 3-point rating scale, listeners may overlook prosodic errors, particularly when other components of the words are accurate (e.g., segments and word structure). Parsing out prosody as a point of measurement with the MACS may result in more careful consideration of this core feature of CAS and thereby explain reduced agreement between prosody and the 3-point rating.
Reliability of Component Ratings
Researchers have acknowledged the challenges associated with reliably rating speech output in children with motor speech disorders (Allison, 2020; Jing & Grigos, 2022). For the current work, reliability analyses revealed excellent agreement for the MACS score between laboratory ratings, across 19 practicing SLPs (interrater), and within individual SLP ratings (intrarater) when rating tokens produced by children with severe CAS. Past research has reported substantial-to-high interrater reliability between two to three listeners (Maas et al., 2012; Maas & Farinella, 2012; Strand et al., 2006), as well as substantial reliability within 30 SLP raters (Jing & Grigos, 2022). Further analysis of reliability for Jing and Grigos (2022), however, indicated overall agreement to be lowest for “1” ratings (58.22% agreement), or words produced with a minor error (or one distinctive feature error), as compared to inaccurate “0” ratings (66.63% agreement) or accurate “2” ratings (79.2% agreement). Thus, use of binary ratings in this study is believed to increase reliability through simplifying clinical decision making to a binary choice of accurate (1) or inaccurate (0). Furthermore, breaking down ratings to specified components (segmental accuracy, word structure, prosody, movement transition) focuses the raters' attention to key elements of speech production.
Analysis of agreement across component ratings reveals several interesting patterns. Good to excellent reliability was seen for segmental accuracy and word structure across SLP ratings. Moderate to excellent inter- and intrarater reliability was achieved for ratings of movement transition, which is promising yet also reflects the relative challenge related to introducing this novel element to speech ratings. While existing measures consider the quality of movement transitions (i.e., ReST binary accuracy ratings; 3-point rating scale), movement transitions are not explicitly measured in current metrics of speech accuracy and are likely to be less familiar to clinicians. Adding to the challenge of rating movement transitions, all ratings were performed using only the acoustic signal. Listeners may have found it difficult to rate the smoothness and fluency of speech movements without also viewing video-recordings. Moderate to excellent agreement was observed for ratings of prosody, which is promising given that prosody is known to be challenging to reliably rate (Strand et al., 2013). Of note, prosody was rated in a smaller subset of tokens (i.e., 23 bisyllabic tokens out of 117 experimental tokens). While these initial findings are encouraging, future work should explore the reliability of prosody in a larger stimulus set.
A closer look at component ratings revealed a disparity between inter- and intrarater agreement for ratings of prosody and movement transition. Interrater agreement across all SLP raters was higher than intrarater agreement for ratings of prosody and movement. There may have been less internal consistency for prosody and movement transition because these elements can be challenging for clinicians to judge, particularly for SLPs who may not regularly evaluate these areas in their clinical practice. The size of the stimuli set may also play a role in these findings, as suggested by Hustad et al. (2015) who reported lower intrarater than interrater reliability in a task that required listeners to orthographically transcribe words, phrases, and sentences produced by children with speech motor impairment and typical development. The authors suggested that more consistent ratings across all listeners may reflect greater reliability when measuring a wider range of stimuli. Recall that in the present work intrarater reliability was based on a smaller set of tokens (18/117) than the stimuli set used to measure interrater reliability (108/117). Thus, stimulus items used to measure reliability across raters presented many more opportunities to rate tokens and included tokens with varied segmental and suprasegmental features. Similar to Hustad et al. (2015), this wider range of stimuli may have supported interrater reliability but not intrarater reliability.
Factors that likely contribute to the high interrater reliability in this dataset are the comprehensive training and practice that all SLPs completed prior to conducting the experiment. We anticipated the need for detailed training given that most graduate programs do not provide the opportunity for students to refine their listening skills and practicing SLPs do not commonly use rating scales to form clinical judgments. The design of the current work allowed additional time, practice, and feedback to shape understanding of the ratings and to guide use of this tool. As a result, 93% of the SLPs who completed the training (27/29) achieved at least 80% accuracy across all three training blocks. We considered whether years of clinical experience and expertise with CAS influenced performance on the rating task, as these were two parameters that we did not control for. This was not the case, however, as the SLP raters widely varied in clinical experience (i.e., 4–41 years) with differing exposure to children with CAS (i.e., less than five to more than 10 clients with CAS over their career). Therefore, we attribute high reliability to careful training using this composite rating tool. It is not clear whether the same level of reliability would be achieved in the absence of such training. There is a reduced focus on the development of auditory perceptual skills within clinical preparation and continued education programs. We would therefore encourage completion of the developed training protocol prior to use of the MACS. The training paradigm used in this work can potentially serve as a foundation from which educators can build to better prepare graduate students and practicing clinicians to rate severely impaired speech in a clinical setting.
While not a research question within this study, it is interesting to note patterns of agreement between expert raters when using the MACS as compared to the 3-point rating scale and transcription analyses. For the MACS score, there were 36 of the 117 stimuli where scores differed between raters with the greatest number of disagreements for movement transition (n = 18). A similar number of disagreements was observed for 3-point ratings (n = 34 words). In contrast, transcription analyses resulted in a greater number of disagreements between raters (consonant transcription = 47; vowel transcription = 42). Post hoc analyses were performed to measure the interrater reliability for these measures and revealed excellent reliability for the 3-point measure (ICC = .88, F(116, 116) = 8.71, p < .001) but lower reliability for PPC (ICC = .66, F(116, 116) = 2.97, p < .001). These findings further support use of scaled ratings for reliably rating speech production in children with severe speech impairment, as compared to transcription analyses (Allison, 2020).
Recall that tokens were produced over the course of a treatment period. Given that accuracy differed from tokens produced at baseline as compared to posttreatment and maintenance, we considered the possibility that reliability of MACS ratings could differ depending on the point in intervention. Post hoc analyses were performed to explore whether reliability differed at each point in treatment and revealed similar ICC values for the MACS score at baseline (ICC = .951, F(29, 522) = 20.3, p < .001), posttreatment (ICC = .926, F(32, 576) = 13.4, p < .001), and maintenance (ICC = .945, F(31, 558) = 18.2, p < .001). While analyses are needed on a complete dataset over the course of treatment, these findings do not indicate that reliability of the MACS differs based on treatment phase.
Clinical Application
The MACS has potential to be an efficient and informative approach for quantifying key elements of speech accuracy in both the clinical and research settings though continued research is needed to test the clinical utility of this metric in the context of intervention. Evidence from this work demonstrates that the total MACS score provides an overarching value for rating speech accuracy, which aligns with traditional measures. This rating scale is more time efficient than narrowly transcribing words to calculate PPC, which may not be feasible on a regular basis in a busy clinical setting. Furthermore, analyses of each segmental, suprasegmental, and motoric component can provide more detailed information about which aspects of speech performance are more/less accurate.
While this study demonstrates the application of the MACS within an online experiment, the MACS is currently being used to measure treatment outcomes in ongoing clinical research in CAS (Grigos et al., in press; Iuzzini-Seigel et al., 2023). Close examination of MACS component ratings from Grigos et al. (in press) has also shed light on the observation that children often display gains in certain aspects of probe word productions but not others, over time. For instance, word structure maintenance and movement transitions between sounds and syllables may improve while segmental errors persist. In such an example, PPC would remain low and words would be rated as a “0” on the 3-point scale given the persistence of segmental errors. The MACS has the benefit of being able to highlight clinical gains not captured by these measures. Furthermore, as motor planning and programming of speech movements occurs at the level of the syllable (Nijland et al., 2003), measurement of gains in word structure may capture the beginning of organizational changes in the programming of speech movements that other measures would not evaluate.
Categories evaluated within the MACS were designed to guide clinicians and researchers to attend to aspects of speech production that are most meaningful and relevant to core deficits in CAS, such as prosody or movement transitions. Use of a metric designed to evaluate these core features can shape how SLPs make clinical decisions within the context of treatment, such as determining whether a child has achieved a treatment goal. Traditionally, treatment progress is measured as the percentage of accurate productions within a treatment session and then may compare a child's performance to an established criterion level (e.g., 80% accuracy; Moore, 2018). Although this approach often has been the status quo, it may both underestimate and overestimate a child's performance. In addition, it fails to capture suprasegmental and motoric elements of speech production that may or may not be improving. Such information can heighten SLP awareness of areas in need of more attention during treatment and can inform the SLP decisions about the types of cuing that best facilitate accurate word production. For instance, children with CAS commonly require extensive practice of movement gestures to establish and refine speech motor programs in order to facilitate maintenance and generalization of treatment gains. Importantly, however, movement transitions can be refined even when sound production is perceived to be accurate (Grigos & Case, 2018). A measure focused solely on segmental accuracy does not capture information about the movement gesture, word structure, or prosody, which can lead the SLP to conclude that a child has achieved a goal and prematurely move on to more challenging treatment targets. In these instances, children may require additional practice to achieve more fluent movement transitions or to refine prosody. As a result, there is a risk that words may be rated as accurate despite requiring additional practice on elements less commonly evaluated. The MACS addresses these issues and has the potential benefit to guide clinicians in taking a more holistic approach to determining target accuracy.
Limitations and Future Directions
Several limitations are associated with this study. For one, the current work was conducted in the context of a listening experiment. While the MACS has been used to quantify treatment outcomes in recent and ongoing treatment studies (Grigos et al., in press; Iuzzini-Seigel et al., 2023), continued work is needed to evaluate the utility of this metric in the clinical setting and whether sustained reliability of this measure would be achieved over a longer treatment period. Furthermore, in consideration of the International Classification of Functioning, Disability and Health (ICF) framework (World Health Organization, 2007), the MACS and other measures of speech accuracy evaluated in this study (PPC, 3-point rating) largely measure performance at the level of Body Structure & Function. Future work is needed to explore the relationship between MACS ratings and functional outcome measures at the Activities and Participation level of this ICF framework.
An additional limitation relates to the tokens and narrow sample of children included in this dataset. Tokens were balanced by child and session to ensure that each rating block contained the same number of tokens produced by each child at each point in intervention (baseline, posttreatment, maintenance). However, all children presented with severe CAS and the number of accurate/inaccurate tokens was not evenly distributed. While we intentionally designed the experiment to reflect probe data that would be collected over the course of treatment to provide a “real-world” analysis of the MACS in this context, a next step is to test the reliability of the MACS in a stimulus set containing a more diverse cohort of speakers and tokens more systematically balanced by rating across each of the component measures. Furthermore, a subset of training tokens was produced by the same speakers as in the experimental task (i.e., two tokens per child). While more prolonged exposure would be needed to result in a processing advantage (e.g., Nygaard & Pisoni, 1998), future work would incorporate training items produced by different speakers than in the experimental tokens. We anticipate the MACS to maintain good reliability given that it uses binary ratings within each component, though additional research is needed to confirm reliability of MACS ratings for tokens produced by a more diverse group of speakers across both training and experimental tasks.
Challenges encountered throughout this experiment reflect difficulties associated with remote data collection, which was necessary due to restrictions related to the COVID-19 pandemic. We were not able to control all environmental conditions when SLP raters completed the experiment, nor were we able to precisely measure certain environmental variables (e.g., experiment duration). Three SLP raters were eliminated as they did not accurately rate the catch trials produced by a child with typical development. Catch trials have been used within remote experiments (Fernández et al., 2019; Nightingale et al., 2020; Peterson et al., 2022) to ensure attention to task and have in some cases resulted in high rates of participant elimination (e.g., 18% elimination; Fernández et al., 2019). Some SLP raters also began the experiment and could not complete all three blocks in one sitting due to unpredicted circumstances that came up in the midst of completing the task in their home setting. Ratings from two clinicians were excluded as they did not meet the 80% accuracy criterion across training tasks. This measure was necessary to demonstrate that all SLP raters achieved a similar level of proficiency in using an unfamiliar clinical rating scale. As a result of these combined circumstances, a higher-than-expected number of SLP raters were excluded from the study, a necessary measure to maintain experimental control in the context of a remote research study. Nonetheless, it is encouraging that so many clinicians successfully completed this work in their own communities and achieved considerable reliability across the MACS ratings. Thus, elimination of participants is not interpreted as challenges associated with applying the MACS to clinical data, but rather measures taken to achieve as much experimental control as possible within a remote, self-paced experiment. Furthermore, we feel these results suggest potential for good ecological validity for use of this rating scale in the clinical setting.
Future work should examine the reliability of the MACS in graduate student clinicians. Recent work reported substantial agreement for ratings performed by graduate student clinicians using the 3-point rating scale (Jung et al., 2022). Research is needed to evaluate whether similar degrees of reliability among students would be found for the MACS. Traditionally, graduate students receive inadequate training in auditory perceptual analyses of speakers with severe speech impairment. Participation in this experiment could provide valuable training for rising clinicians while exploring the reliability of this measure in raters with limited experience.
Conclusions
The MACS is a novel rating scale that measures segmental, suprasegmental, and motoric aspects of speech performance not seen in existing measures and including segmental accuracy, word structure, prosody, and movement transition. These components reflect core speech behaviors that characterize CAS (ASHA, 2007; Shriberg et al., 2017) and are targeted within motor-based intervention (e.g., Strand, 2020). Our analyses revealed the MACS to be a valid and reliable rating metric for laboratory ratings of stimuli produced by children with severe CAS over the course of motor-based intervention when performed by clinical researchers and practicing SLPs. Future analyses will continue to explore the application of the MACS with a wider range of stimuli and populations. The MACS offers great promise to improve clinical management through more specific and reliable measurement of speech performance for children with CAS in the research and clinical settings. Finally, in an attempt to support clinical application of this measure, all training materials and a template for MACS ratings are available online: http://www.casespeechlab.com.
Author Contributions
Julie Case: Conceptualization (Equal), Formal analysis (Lead), Investigation (Lead), Methodology (Equal), Writing – original draft (Equal), Writing – review & editing (Equal). Emily W. Wang: Formal analysis (Supporting), Investigation (Supporting), Resources (Lead), Writing – original draft (Supporting), Writing – review & editing (Supporting). Maria I. Grigos: Conceptualization (Equal), Formal analysis (Supporting), Investigation (Equal), Methodology (Equal), Writing – original draft (Equal), Writing – review & editing (Equal).
Data Availability Statement
Data are available upon reasonable request to the authors.
Acknowledgments
This research was supported by the Hofstra Faculty Research and Development Grant awarded to Julie Case and the National Institute on Deafness and Other Communication Disorders Grant R01DC018581 awarded to Maria I. Grigos. We would like to acknowledge Hailey Kopera, Lanie Jung, Gabriela Lovishuk, Abigail Brown, and Lacie Berkowicz for assistance with data collection and processing. We would also like to extend our gratitude to Nicole Kolenda and Kate Nealon for piloting earlier versions of this rating scale. We are especially grateful to all of the speech-language pathologists, the children, and their families who dedicated their time and participated in this work.
Appendix A
Multilevel word Accuracy Composite Scale (MACS)
The Multilevel word Accuracy Composite Scale (MACS) evaluates detailed elements of speech production in ratings used to measure treatment gains.
Multilevel word Accuracy Composite Scale (MACS)
Whole-word accuracy will be measured using a composite score that reflects the following four elements:
(I) Segmental accuracy - whether child produces the consonants and vowels accurately
(II) Word structure – whether child maintains the word shape of the target word
(III) Prosody – accuracy of stress patterning and intonation
(IV) Movement transition – effort and fluency of production
Each component receives a binary rating of “0” or “1.” Ratings from components are averaged to yield one composite score per production. Here are two examples::
MACS scoring rubric*
Category | “0” rating | “1” rating | EXAMPLE 1: “pop” /pap/➔[ba] |
EXAMPLE 2: “baby” /'beɪbi/➔['bi'bi] |
|
---|---|---|---|---|---|
I | Segmental accuracy | Substitution, omission, distortion | Accurate consonant and vowel | C1 substitution; V accurate; C2 omission = 0 | C1 omission; V1 omission = 0 |
II | Word structure | Inaccurate/missing word structure (e.g., CVC ➔ CV) |
Accurate word structure | CVC ➔ CV = 0 | CVCV ➔ CVCV = 1 |
III | Prosody | Segmentation; equal/inaccurate stress; syllable reduction |
Accurate prosody | Single syllable word = n/a | Equal stress = 0 |
IV | Movement transition | Transition not smooth or fluid across word |
Accurate transition | No transition to final consonant due to consonant omission = 0 | Inaccurate prosody = 0 |
MACS Score | (0 + 0 + 0)/3 = 0 | (1 + 0 + 0 + 0)/4 = 0.25 |
Category Details
-
Segmental accuracy: provide a score of “0” or “1” to rate the accuracy of consonants and vowels.
• “1” rating: All consonants and vowels in a word are correct (e.g., pop produced as “pop”).
-
• “0” rating: One or more inaccurate consonants or vowels in a word (e.g., pop produced as “bop”),
Includes consonant and/or vowel substitutions, omissions and distortions;
Vowels are rated as inaccurate if they are shortened, lengthened, produced with excess effort or otherwise distorted. Consonants are rated as inaccurate if they are produced with excess effort or if distorted. Segmental accuracy is a “0” if a consonant or vowel is inserted.
Word structure – provide a score of 0 or 1 to indicate whether the child maintained the word shape of the targeted word. The word shape must be completely maintained to receive a rating of 1 (e.g., CVC ➔ CVC). Note that word shape can be accurate even when the child displays consonant or vowel errors (e.g., pop produced as “bob,” there are segmental errors but the word maintains the CVC shape). Any omissions or additions of segments or syllables, would receive a rating of 0 for segmental accuracy (e.g., CVC ➔ CV; CV ➔ CVV; CVCV ➔ CV).
Prosody – prosody will only be rated on words with 2+ syllables. 2+ syllable words with appropriate stress patterning, rate, and intonation would receive a rating of 1 for prosody. Words with equal stress, syllable segregation, lack of intonation, vowel insertions, syllable omissions or otherwise inaccurate prosody would receive a rating of 0. For words with 1 syllable, input “n/a” as prosody is not scored in these words.
Movement transition – provide a score of 0 or 1 to indicate whether movement transitions were accurate. Productions that are smooth, effortless, and fluent would receive a rating of 1. If a word is produced with consonant/vowel errors but production remains smooth and effortless, it may still be rated as a 1. Words that are produced with excess effort and segmentation would receive a rating of 0. Movement transitions will also be rated as inaccurate if prosody is inaccurate, consonants or vowels are deleted or inserted (e.g., inserted schwa), effortful production of consonants/vowels, and shortened/lengthened vowels.
Appendix B
Stimulus Set
Block 1 | Block 2 | Block 3 | Timepoint | Child |
---|---|---|---|---|
happy | bike | up | Baseline | Child 1 |
me | mom | bye | ||
hi | do | out | Posttreatment | |
uh-oh | happy | up | ||
me | do | bye | Maintenance | |
out | hi | uh-oh | ||
me | eat | happy | Baseline | Child 2 |
uh-oh | up | pop | ||
me | bye | eat | Posttreatment | |
pop | hi | uh-oh | ||
eat | happy | mom | Maintenance | |
me | hi | peep | ||
daddy | down | do | Baseline | Child 3 |
in | go | home | ||
go | home | bed | Maintenance | |
out | in | mine | ||
bunny | daddy | down | Posttreatment | |
mine | do | in | ||
hi | peep | baby | Baseline | Child 4 |
me | baby | be | ||
eat | be | bye | Maintenance | |
peep | me | hi | ||
bye | baby | be | Posttreatment | |
hi | me | puppy | ||
bye | baby | eat | Baseline | Child 5 |
peep | up | pop | ||
eat | bye | mine | Maintenance | |
home | hi | up | ||
bye | pop | baby | Posttreatment | |
peep | up | puppy |
Funding Statement
This research was supported by the Hofstra Faculty Research and Development Grant awarded to Julie Case and the National Institute on Deafness and Other Communication Disorders Grant R01DC018581 awarded to Maria I. Grigos.
References
- Allison, K. M. (2020). Measuring speech intelligibility in children with motor speech disorders. Perspectives of the ASHA Special Interest Groups, 5(4), 809–820. 10.1044/2020_PERSP-19-00110 [DOI] [Google Scholar]
- American Speech-Language-Hearing Association. (2007). Childhood apraxia of speech [Technical report] . http://www.asha.org/policy
- Barrett, C. , McCabe, P. , Masso, S. , & Preston, J. (2020). Protocol for the connected speech transcription of children with speech disorders: An example from childhood apraxia of speech. Folia Phoniatrica et Logopaedica, 72(2), 152–166. 10.1159/000500664 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown, J. D. (2001). Point-biserial correlation coefficients. Statistics, 5(3). [Google Scholar]
- Campbell, T. F. , Dollaghan, C. A. , Rockette, H. E. , Paradise, J. L. , Feldman, H. M. , Shriberg, L. D. , Sabo, D. L. , & Kurs-Lasky, M. (2003). Risk factors for speech delay of unknown origin in 3-year-old children. Child Development, 74(2), 346–357. 10.1111/1467-8624.7402002 [DOI] [PubMed] [Google Scholar]
- Case, J. , & Grigos, M. I. (2020). A framework of motoric complexity: An investigation in children with typical and impaired speech development. Journal of Speech, Language, and Hearing Research, 63(10), 3326–3348. 10.1044/2020_JSLHR-20-00020 [DOI] [PubMed] [Google Scholar]
- Cassar, C. , Mccabe, P. , & Cumming, S. (2022). “I still have issues with pronunciation of words”: A mixed methods investigation of the psychosocial and speech effects of childhood apraxia of speech in adults. International Journal of Speech-Language Pathology, 1–13. 10.1080/17549507.2021.2018496 [DOI] [PubMed] [Google Scholar]
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) , Erlbaum. [Google Scholar]
- Davis, B. L. , Jakielski, K. J. , & Marquardt, T. P. (1998). Developmental apraxia of speech: Determiners of differential diagnosis. Clinical Linguistics & Phonetics, 12(1), 25–45. 10.3109/02699209808985211 [DOI] [Google Scholar]
- Fernández, D. , Harel, D. , Ipeirotis, P. , & McAllister, T. (2019). Statistical considerations for crowdsourced perceptual ratings of human speech productions. Journal of Applied Statistics, 46(8), 1364–1384. 10.1080/02664763.2018.1547692 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forrest, K. (2003). Diagnostic criteria of developmental apraxia of speech used by clinical speech-language pathologists. American Journal of Speech-Language Pathology, 12(3), 376–380. 10.1044/1058-0360(2003/083) [DOI] [PubMed] [Google Scholar]
- Gamer, M. , Lemon, J. , Gamer, M. M. , Robinson, A. , & Kendall's, W. (2012). Package ‘irr’. Various coefficients of interrater reliability and agreement, 22. [Google Scholar]
- Grigos, M. I. , & Case, J. (2018). Changes in movement transitions across a practice period in childhood apraxia of speech. Clinical Linguistics & Phonetics, 32(7), 661–687. 10.1080/02699206.2017.1419378 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grigos, M. I. , Case, J. ,, Lu, Y. , & Lyu, Z. (in press). Dynamic temporal and tactile cueing: Quantifying speech motor changes and individual factors that contribute to treatment gains in childhood apraxia of speech. Journal of Speech-Language Pathology and Audiology. [DOI] [PubMed]
- Grigos, M. I. , Moss, A. , & Lu, Y. (2015). Oral articulatory control in childhood apraxia of speech. Journal of Speech, Language, and Hearing Research, 58(4), 1103–1118. 10.1044/2015_JSLHR-S-13-0221 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34. 10.20982/tqmp.08.1.p023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hustad, K. C. (2020). Augmentative and alternative communication. In Neurologic and neurodegenerative diseases of the larynx (pp. 407–413). Springer. 10.1007/978-3-030-28852-5_34 [DOI] [Google Scholar]
- Hustad, K. C. , Oakes, A. , & Allison, K. (2015). Variability and diagnostic accuracy of speech intelligibility scores in children. Journal of Speech, Language, and Hearing Research, 58(6), 1695–1707. 10.1044/2015_JSLHR-S-14-0365 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Irvine, A. , Massonnié, J. , Flitton, A. , Kirkham, N. , & Evershed, J. (2018). Gorilla in our midst: An online behavioral experiment builder. [DOI] [PMC free article] [PubMed]
- Iuzzini, J. , & Forrest, K. (2010). Evaluation of a combined treatment approach for childhood apraxia of speech. Clinical Linguistics & Phonetics, 24(4–5), 335–345. 10.3109/02699200903581083 [DOI] [PubMed] [Google Scholar]
- Iuzzini-Seigel, J. , Case, J. , Grigos, M. , Velleman, S. , Thomas, D. , & Murray, E. (2023). Dose frequency randomized control trial for dynamic temporal and tactile cueing (DTTC) treatment for childhood apraxia of speech: Protocol paper. 10.21203/rs.3.rs-2407181/v1 [DOI] [PMC free article] [PubMed]
- Iuzzini-Seigel, J. , Hogan, T. P. , & Green, J. R. (2017). Speech inconsistency in children with childhood apraxia of speech, language impairment, and speech delay: Depends on the stimuli. Journal of Speech, Language, and Hearing Research, 60(5), 1194–1210. 10.1044/2016_JSLHR-S-15-0184 [DOI] [PubMed] [Google Scholar]
- Jing, L. , & Grigos, M. I. (2022). Speech-language pathologists' ratings of speech accuracy in children with speech sound disorders. American Journal of Speech-Language Pathology, 31(1), 419–430. 10.1044/2021_AJSLP-20-00381 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jung, S. , Jing, L. , & Grigos, M. I. (2022). Graduate student clinicians' perceptions of child speech sound errors. Perspectives of the ASHA special interest groups, 7(4), 1275–1283. 10.1044/2022_PERSP-21-00332 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koo, T. K. , & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. 10.1016/j.jcm.2016.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kopera, H. C. , & Grigos, M. I. (2019). Lexical stress in childhood apraxia of speech: Acoustic and kinematic findings. International Journal of Speech-Language Pathology, 1–12. 10.1080/17549507.2019.1568571 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levi, S. V. , Winters, S. J. , & Pisoni, D. B. (2011). Effects of cross-language voice training on speech perception: Whose familiar voices are more intelligible? The Journal of the Acoustical Society of America, 130(6), 4053–4062. 10.1121/1.3651816 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis, B. A. , Freebairn, L. , Tag, J. , Ciesla, A. A. , Iyengar, S. K. , Stein, C. M. , & Taylor, H. G. (2015). Adolescent outcomes of children with early speech sound disorders with and without language impairment. American Journal of Speech-Language Pathology, 24(2), 150–163. 10.1044/2014_AJSLP-14-0075 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis, B. A. , Freebairn, L. A. , Hansen, A. J. , Iyengar, S. K. , & Taylor, H. G. (2004). School-age follow-up of children with childhood apraxia of speech. Language, Speech, and Hearing Services in Schools, 35(2), 122–140. 10.1044/0161-1461(2004/014) [DOI] [PubMed] [Google Scholar]
- Maas, E. , Butalla, C. E. , & Farinella, K. A. (2012). Feedback frequency in treatment for childhood apraxia of speech. American Journal of Speech-Language Pathology, 21(3), 239–257. 10.1044/1058-0360(2012/11-0119) [DOI] [PubMed] [Google Scholar]
- Maas, E. , & Farinella, K. A. (2012). Random versus blocked practice in treatment for childhood apraxia of speech. Journal of Speech, Language, and Hearing Research, 55(2), 561–578. 10.1044/1092-4388(2011/11-0120) [DOI] [PubMed] [Google Scholar]
- Maas, E. , Gildersleeve-Neumann, C. , Jakielski, K. , Kovacs, N. , Stoeckel, R. , Vradelis, H. , & Welsh, M. (2019). Bang for your buck: A single-case experimental design study of practice amount and distribution in treatment for childhood apraxia of speech. Journal of Speech, Language, and Hearing Research, 62(9), 3160–3182. 10.1044/2019_JSLHR-S-18-0212 [DOI] [PubMed] [Google Scholar]
- Maas, E. , Gildersleeve-Neumann, C. , Jakielski, K. , & Stoeckel, R. (2014). Motor-based intervention protocols in treatment of childhood apraxia of speech (CAS). Current Developmental Disorders Reports, 1(3), 197–206. 10.1007/s40474-014-0016-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maas, E. , Robin, D. A. , Hula, S. N. A. , Freedman, S. E. , Wulf, G. , Ballard, K. J. , & Schmidt, R. A. (2008). Principles of motor learning in treatment of motor speech disorders. American Journal of Speech-Language Pathology, 17(3), 277–298. 10.1044/1058-0360(2008/025) [DOI] [PubMed] [Google Scholar]
- McDonald, J. (2014). Handbook of biological statistics (3rd ed.). Sparky House Publishing. [Google Scholar]
- Miller, G. J. , Lewis, B. , Benchek, P. , Freebairn, L. , Tag, J. , Budge, K. , Iyengar, S. K. , Voss-Hoynes, H. , Taylor, H. G. , & Stein, C. (2019). Reading outcomes for individuals with histories of suspected childhood apraxia of speech. American Journal of Speech-Language Pathology, 28(4), 1432–1447. 10.1044/2019_AJSLP-18-0132 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore, R. (2018). Beyond 80-percent accuracy. The ASHA Leader, 23(5), 6–7. 10.1044/leader.FMP.23052018.6 [DOI] [Google Scholar]
- Morgan, L. , Overton, S. , Bates, S. , Titterington, J. , & Wren, Y. (2021). Making the case for the collection of a minimal dataset for children with speech sound disorder. International Journal of Language & Communication Disorders, 56(5), 1097–1107. 10.1111/1460-6984.12649 [DOI] [PubMed] [Google Scholar]
- Murray, E. , McCabe, P. , & Ballard, K. J. (2015). A randomized controlled trial for children with childhood apraxia of speech comparing rapid syllable transition treatment and the nuffield dyspraxia programme–Third edition. Journal of Speech, Language, and Hearing Research, 58(3), 669–686. 10.1044/2015_JSLHR-S-13-0179 [DOI] [PubMed] [Google Scholar]
- Namasivayam, A. K. , Huynh, A. , Granata, F. , Law, V. , & van Lieshout, P. (2021). Prompt intervention for children with severe speech motor delay: A randomized control trial. Pediatric Research, 89(3), 613–621. 10.1038/s41390-020-0924-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nightingale, C. , Swartz, M. , Ramig, L. O. , & McAllister, T. (2020). Using crowdsourced listeners' ratings to measure speech changes in hypokinetic dysarthria: A proof-of-concept study. American Journal of Speech-Language Pathology, 29(2), 873–882. 10.1044/2019_AJSLP-19-00162 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nijland, L. , Maassen, B. , Van Der Meulen, S. , Gabreëls, F. , Kraaimaat, F. W. , & Schreuder, R. (2003). Planning of syllables in children with developmental apraxia of speech. Clinical Linguistics & Phonetics, 17(1), 1–24. 10.1080/0269920021000050662 [DOI] [PubMed] [Google Scholar]
- Nygaard, L. C. , & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60(3), 355–376. 10.3758/BF03206860 [DOI] [PubMed] [Google Scholar]
- Peterson, L. , Savarese, C. , Campbell, T. , Ma, Z. , Simpson, K. O. , & McAllister, T. (2022). Telepractice treatment of residual rhotic errors using app-based biofeedback: A pilot study. Language, Speech, and Hearing Services in Schools, 53(2), 256–274. 10.1044/2021_LSHSS-21-00084 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pollock, K. , & Hall, P. (1991). An analysis of the vowel misarticulations of five children with developmental apraxia of speech. Clinical Linguistics & Phonetics, 5(3), 207–224. 10.3109/02699209108986112 [DOI] [Google Scholar]
- Preston, J. L. , Leece, M. C. , McNamara, K. , & Maas, E. (2017). Variable practice to enhance speech learning in ultrasound biofeedback treatment for childhood apraxia of speech: A single case experimental study. American Journal of Speech-Language Pathology, 26(3), 840–852. 10.1044/2017_AJSLP-16-0155 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shriberg, L. D. , Austin, D. , Lewis, B. A. , McSweeny, J. L. , & Wilson, D. L. (1997). The percentage of consonants correct (PCC) metric. Journal of Speech, Language, and Hearing Research, 40(4), 708–722. 10.1044/jslhr.4004.708 [DOI] [PubMed] [Google Scholar]
- Shriberg, L. D. , Campbell, T. F. , Karlsson, H. B. , Brown, R. L. , McSweeny, J. L. , & Nadler, C. J. (2003). A diagnostic marker for childhood apraxia of speech: The lexical stress ratio. Clinical Linguistics & Phonetics, 17(7), 549–574. 10.1080/0269920031000138123 [DOI] [PubMed] [Google Scholar]
- Shriberg, L. D. , Kwiatkowski, J. , & Hoffmann, K. (1984). A procedure for phonetic transcription by consensus. Journal of Speech Language, and Hearing Research, 27(3), 456–465. 10.1044/jshr.2703.456 [DOI] [PubMed] [Google Scholar]
- Shriberg, L. D. , Lohmeier, H. L. , Strand, E. A. , & Jakielski, K. J. (2012). Encoding, memory, and transcoding deficits in childhood apraxia of speech. Clinical Linguistics & Phonetics, 26(5), 445–482. 10.3109/02699206.2012.655841 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shriberg, L. D. , Strand, E. A. , Fourakis, M. , Jakielski, K. J. , Hall, S. D. , Karlsson, H. B. , Mabie, H. L. , McSweeny, J. L. , Tilkens, C. M. , & Wilson, D. L. (2017). A diagnostic marker to discriminate childhood apraxia of speech from speech delay: Introduction. Journal of Speech, Language, and Hearing Research, 60(4), S1096–S1117. 10.1044/2016_JSLHR-S-16-0148 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Strand, E. A. (2020). Dynamic temporal and tactile cueing: A treatment strategy for childhood apraxia of speech. American Journal of Speech-Language Pathology, 29(1), 30–48. 10.1044/2019_AJSLP-19-0005 [DOI] [PubMed] [Google Scholar]
- Strand, E. A. , & Debertine, P. (2000). The efficacy of integral stimulation intervention with developmental apraxia of speech. Journal of Medical Speech-Language Pathology, 8(4), 295–300. [Google Scholar]
- Strand, E. A. , McCauley, R. J. , Weigand, S. D. , Stoeckel, R. E. , & Baas, B. S. (2013). A motor speech assessment for children with severe speech disorders: Reliability and validity evidence. Journal of Speech, Language, and Hearing Research, 56(2), 505–520. 10.1044/1092-4388(2012/12-0094) [DOI] [PubMed] [Google Scholar]
- Strand, E. A. , Stoeckel, R. , & Baas, B. (2006). Treatment of severe childhood apraxia of speech: A treatment efficacy study. Journal of Medical Speech-Language Pathology, 14(4), 297–308. [Google Scholar]
- Team, R. C. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Terband, H. , Maassen, B. , Guenther, F. H. , & Brumberg, J. (2009). Computational neural modeling of speech motor control in childhood apraxia of speech (CAS). Journal of Speech, Language, and Hearing Research, 52(6), 1595–1609. 10.1044/1092-4388(2009/07-0283) [DOI] [PMC free article] [PubMed] [Google Scholar]
- World Health Organization. (2007). International classification of functioning, disability, and health: Children & youth version: ICF-CY.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data are available upon reasonable request to the authors.