Abstract
Purpose:
Typically developing children assigned male at birth (AMAB) and children assigned female at birth (AFAB) produce the fricative /s/ differently: AFAB children produce /s/ with a higher spectral peak frequency. This study examined whether implicit knowledge of these differences affects speech-language pathologists’/speech and language therapists’ (SLPs’/SLTs’) ratings of /s/ accuracy, by comparing ratings made in conditions where SLPs/SLTs were blind to children’s sex assigned at birth (SAB) to conditions in which they were told this information.
Methods:
SLPs (n=95) varying in clinical experience rated the accuracy of word-initial /s/ productions (n=87) of eight children with speech sound disorder (SSD) in one of four conditions: one in which no information about the children was revealed, one in which children’s SAB was revealed, one in which children’s age was revealed, and one in which both were revealed.
Results:
Despite there being no statistically significant differences between AFAB and AMAB children’s /s/ production in researcher-determined accuracy or in one acoustic characteristic, spectral centroid, SLPs in all four conditions judged the /s/ productions of AFAB children as more accurate than AMAB children. Listeners were significantly less likely to judge the productions of AMAB children to be inaccurate in the conditions in which age, or age and SAB, were revealed. These effects were consistent across SLPs with greatly varying levels of clinical experience.
Conclusion.
Knowing or imputing children’s age and SAB can affect ratings of /s/ accuracy. Clinicians should be mindful of these potential effects. Future research should understand how expectations about sociolinguistic variation in speech affect appraisals of their speech and language.
Keywords: gender, speech sound disorder, assessment, bias, fricative
Introduction
Children with speech sound disorder (SSD) produce speech sounds substantially less accurately than their peers (Bowen, 2023). This definition invites an interrogation of what accuracy means. Definitions of accuracy often appeal to two different constructs: speaking intelligibly—i.e., in a way that others can understand—and speaking in an adult-like manner. These two constructs are not identical: some ways of speaking that are not adult like, such as distortion errors, do not affect intelligibility (Shriberg et al., 1997). The focus of this study is on the latter of these two constructs The determination of whether a child is speaking in an adult-like manner is complicated by the fact that every adult speaks differently, and that there are hence many different potential definitions of 'adult-like'. Understanding the causes and consequences of this variation, and how it affects assessments of children’s speech accuracy, are critical to the equitable practice of speech-language pathology/speech and language therapy (SLP/SLT).
The way people speak reflects, in part, their gender. For example, as a group, women produce /s/ with a higher peak frequency and a more compact spectrum than men (Jongman et al., 2000; Munson, McDonald et al., 2006). These differences, like many phonetic differences between men and women, are not solely the consequence of men’s thicker vocal folds and longer larynxes than women. Such differences emerge during puberty in cisgender individuals. It was long assumed that phonetic differences between men and women were due entirely to sex dimorphism in the speech-production mechanism in adults (Lieberman, 1986). Social-constructivist models of gender and speech (Calder, 2019; Munson & Babel, 2019; Tripp & Munson, 2022; Zimman, 2017) argue that male-female speech differences, including differences in /s/, represent a combination of anatomical influences and socially and culturally learned, conventional ways of speaking to convey gender and related social meanings. One such key piece of evidence for a social-constructivist account of gender and speech is the finding that children assigned male at birth (AMAB1) and children assigned female (AFAB) speak differently in ways that mirror adult men and women, respectively, as early as the third year of life (Fung et al., 2021; Munson et al., 2022; Perry et al., 2001). These differences appear far in advance of the sex dimorphism that occurs at puberty, and hence must represent learned ways of speaking.
One of the ways that AMAB and AFAB children differ is in their production of /s/: AFAB children as young as 3 years old produce /s/ with a higher peak frequency than AMAB children (Fox & Nissen, 2005; Li, 2017; Li et al., 2016; Nissen & Fox, 2005; Munson et al., 2022; Koeppe, 2022). This mirrors the differences between adult men and women (Jongman et al., 2000; Munson, McDonald et al., 2006): adult women produce /s/ with higher peak frequencies than men. There is ample evidence that these differences do not solely reflect sex dimorphism in the speech-production mechanism (Calder, 2019; Fuchs & Toda, 2010). There is some evidence to suggest that differences between AMAB and AFAB children’s /s/ production reflect active gender marking. Li et al. (2016) showed that the peak frequency of AMAB children’s /s/ was associated with measures of their adherence to cisnormative ‘male’ behaviors. Munson et al. (2015) showed that 5- to 10-year-old AMAB children’s /s/ productions differed between age-matched groups of cisgender boys and AMAB children who had been identified as gender nonconforming. These results are broadly consistent with a growing body of literature showing that children’s language acquisition reflects their differential social affiliation with the various individuals they encounter during language acquisition (Tripp et al., 2021).
Judgments of the gender of adults’ speech are influenced by the acoustic characteristics of /s/ (Munson, McDonald et al., 2006; Munson, 2007). The speech of people who produce a higher peak-frequency, more spectrally compact /s/ is rated as more-feminine and less-masculine sounding. This shows that people are aware, either implicitly or explicitly, of gender differences in /s/, and that they apply this knowledge when ascribing gender to a person based on their speech. There is also evidence that the perception of a talker’s gender affects expectations of the way that they produce fricatives like /s/. The identification of a synthetic /s/-/ʃ/ continuum differs depending on whether people believe the stimuli to have been produced by a man or a woman (Munson, 2011; Munson, Jefferson et al., 2006; Munson et al., 2017; Strand and Johnson, 1996; Winn et al., 2011). In these studies, people are more likely to identify a sound with a low peak frequency to be /s/ if they believe the person who produced it to be a man and /ʃ/ if they believe the person who produced it to be a woman. This is consistent with the production differences in /s/ between men and women. The peak frequency of /s/ is higher than the peak frequency of /ʃ/ for both men and women. Moreover, women produce both /s/ and /ʃ/ with higher peak frequencies than men. These effects interact, such that there is a frequency range that corresponds to /s/ produced by men and /ʃ/ produced by women. The fact that /s/-/ʃ/ identification differs as a function of the presumed gender of the person being identified shows that people are implicitly aware of gender differences in /s/, and that they apply this knowledge when they are deciding what sounds and words a person said. This finding is consistent with a variety of studies showing that social expectations influence speech perception pervasively (Babel & Russell, 2015; McGowan, 2015; Sumner et al., 2014).
Judgments of gender in children’s speech are also affected by phonetic detail of children’s /s/ (Munson, 2015; Munson et al., 2015; Munson et al., 2022): children with a high peak frequency /s/ are more likely to be labeled as female or as sounding girl-like by listeners. However, no study has yet examined whether identifying a child’s gender influences the perception of the quality of the sound they said, in a manner analogous to the /s/-/ʃ/ perception differences described in the previous paragraph. Differences in the production of /s/ between AMAB and AFAB children lead us to believe that such a relationship should exist. Tokens of /s/ with a relatively low peak frequency should be more likely to be identified as /s/ (i.e., as correct) when the talker is believed to be male than when the talker is believed to be female. The influence of perceived gender on ratings of /s/ accuracy might also reflect a belief that AFAB children produce speech more accurately than AMAB ones. This belief is consistent with studies on the higher incidence of SSD in AMAB children than in AFAB ones (Wren et al., 2016), and differences in speech accuracy between AMAB and AFAB children in the normative samples of standardized tests of articulation like the Goldman-Fristoe Test of Articulation-3 (Goldman & Fristoe, 2015).
Evidence for a child’s gender influencing perception would have important implications for the clinical practice of speech-language pathology and for our understanding of speech development. In principle, accounting for a child’s gender when evaluating tokens would represent a culturally responsive clinical process, as it would indicate SLPs would incorporate knowledge of sociolinguistic variation when appraising children’s speech. Such a practice would require a thorough assessment of gender that includes the participation of children themselves. As reviewed by Olezeski et al. (2020), there are many clinical and experimental methods for assessing children’s gender identity. Simply assuming that a child’s gender is the inevitable consequence of their sex assigned at birth would represent the imposition of cisnormative expectations.
Evidence supporting that a child’s gender influences perception would also have strong implications for the development of automatized measures of children’s speech accuracy that are meant to replicate SLP/SLT perception (Benway et al., 2023a) and would provide further evidence of the need for acoustic measures to be normalized by gender in such schemes, as shown in the context of /ɹ/ errors by Benway et al. (2023b) and Campbell et al. (2018). In these schemes, acoustic measures of children’s productions are compared to their same-gender peers. Culturally responsible automated measures of /s/ accuracy would be especially useful because /s/ is one of the latest acquired sounds, is commonly misarticulated by children with speech sound disorder, and, along with /ɹ/ and /z/, account for approximately 90% of speech sound errors that persist into adolescence (Lewis et al, 2015; Shriberg et al., 1994; Smit et al., 1990). A finding that SLPs’/SLTs’ ratings for /s/ differ between AMAB and AFAB children, and that such a difference is exaggerated in tasks where the SLP/SLT knows the child’s SAB, would suggest that gender-normalization schemes are warranted in the development of automated measures of /s/ accuracy.
Previous studies have shown that ratings of children’s speech are indeed susceptible to bias. Munson et al. (2010) showed that continuous accuracy ratings of word-initial /s/ preceded by a carrier phrase were affected by whether the carrier phrase contained a developmental error (“I weawwy yike”) or not (“I really like”). Children’s productions of /s/ were more likely to be judged as more accurate in conditions where participants were led to believe that the child was younger than ones where they were led to believe the child was older. That study also showed that the likelihood of /s/ being judged as less accurate depended on whether the instructions for the experiment mentioned whether the study examined developmental misarticulations. These findings show that expectations of the age of the children being rated affected the likelihood of /s/ being rated as inaccurate.
Given the need for culturally responsive clinical practice, the purpose of this study is to examine whether SLPs’/SLTs’ perception of gender in children’s speech affects their perception of children’s speech accuracy. We examined whether SLPs rate /s/ productions differently depending on whether they are told the child’s sex assigned at birth. We compared conditions where SLPs/SLTs were told the child’s sex assigned at birth compared with ones where they were not. We used /s/ productions of 4 AMAB children and 4 AFAB children who had a range of /s/ accuracies.
The children in the current study varied in age. Hence, we also included conditions in which people were or were not told the age of the child, which we refer to as priming conditions. The inclusion of age also allows us to examine whether providing any social label affects ratings of /s/ compared to a baseline condition where no information is given. Previous studies have shown that expectations about children’s age affect judgments of /s/ accuracy, as described above.
We predicted that the differences in ratings between AFAB and AMAB children’s /s/ productions would be greatest in conditions in which people were told the children’s sex assigned at birth compared to conditions in which they were not told that information. We also predicted that there would be a stronger influence of the child’s age on ratings of accuracy in conditions in which they were told the child’s age than ones in which they were not. The design of the study allowed us to explore whether the amount or type of clinical experience mitigated the effects of priming age and SAB on ratings of /s/ accuracy. It also allowed us to explore whether the influence of acoustic characteristics of fricatives predicted judgments differently for AFAB and AMAB children. More generally, this work contributes to an understanding of how the perception of children’s gender affects ratings of children’s speech accuracy and their language more broadly, as described by Shimko et al. (2020).
Methods
The procedures in this study were approved by the Institutional Review Boards at the authors’ institutions. Informed consent was provided by the participants in the listening study, and by the legal guardians of the children whose speech served as stimuli.
Stimuli.
The stimuli that we used were 87 productions of /s/-initial words. The words were chosen to be familiar to young children (as determined by age-of-acquisition norms), and to have /s/ in a variety of different vowel contexts. These comprised 11 words by 8 children, 4 AMAB and 4 AFAB, with one word lost due to a processing error. While the words were not identical across the eight children, the height and backness of the vowel following the /s/ were balanced across the eight individuals, so that any effects of coarticulatory context on accuracy would be balanced across talkers. There were 60 monosyllabic stimuli and 27 disyllabic stimuli.
The children ranged in age from 2;7 to 8;0 and consisted of four AMAB-AFAB pairs matched for age. The pairs broke down into two younger pairs (mean ages of 2;9 and 4;8) and two older pairs (mean ages of 7;0 and 8;0). The recordings were made as part of the children’s participation in studies in the first and fourth authors’ laboratories. More details of the recording sessions and the studies in which the children participated can be found in Munson et al. (2021), Preston et al. (2014), and Preston et al. (2016). In general, these studies elicited productions using tasks that resemble clinical assessments of children’s speech, like the Goldman-Fristoe Test of Articulation-3. Because the children had participated in previous studies where their speech production accuracy had been measured, the pairs were also matched for their overall speech production accuracy. The words that were selected for the study were free from extraneous background noise and represented a variety of error types, as described below. The tokens were also chosen because they had no frank deletion or substitution errors in sounds other than the initial /s/. Example words include scissors, sandwich, sad, and sick.
Authors NRB and CW coded the binary accuracy of these tokens (i.e., correct/incorrect), knowing the children’s age and SAB. These transcriptions were meant to be the figurative 'gold standard' against which the experimental participants' judgments would be compared. Hence, we refer to this as gold-coding and to the resulting judgments as gold-coded accuracy, following other similar recent studies (Nightengale et al., 2020). There was 92% (80/87) agreement on the initial accuracy judgments. The first author independently judged the seven tokens for which there was disagreement, and the majority judgment was used for the final accuracy calculation. The average accuracy for /s/ produced by AFAB children was 16.7%, and 15.9% for AMAB children. We examined whether this difference was statistically significant using a logit mixed-effects model, fit with the lme4 package in R (Bates et al., 2015), with significance tests computed using the lmerTest package (Kuznetsova et al., 2017) using Satterthwaite’s approximation for degrees of freedom. Accuracy was the dependent variable. Age and SAB (AMAB= −1, AFAB= 1) were the predictor variables. A random intercept for individual participants was included in the model. The model did not fit the data better than a baseline model with only the random intercept (χ2[df=2] = 4.003, p = .135). This shows gold-coded /s/ accuracy was not affected by age or SAB.
To further describe the stimuli, a variety of acoustic measurements of the /s/ tokens was taken. The onset and offset of the /s/ were hand-marked in Praat. As Table 1 shows, there were six nonfricative realizations in the stimuli. The token with the stopping error was produced with a relatively intense and long aspiration interval, from which the acoustics measures were taken. The five tokens coded as gliding errors were produced by a sound perceptually intermediate between a true glide /j/, and a voiced palatal fricative /ʝ/, i.e., with an interval of aperiodicity. Acoustic measures were taken from this interval.
Table 1.
Stimulus Characteristics (N=11 for all talkers except s2002).
| ID | Age (years; months) |
SAB | Gold-coded /s/ Accuracy (% correct) |
/s/ Average spectral center of gravity (Hz) |
/s/ Average spectral skewness (Hz) |
/s/ Average duration (ms) |
Most common error types (based on gold- coding transcription) |
|---|---|---|---|---|---|---|---|
| s0072 | 2;7 | f | 9.1% | 5927 | 1738 | 150 | Backing to /ʃ/ (n=9), stopping (n=1) |
| s0607 | 3;0 | m | 0% | 2794 | 1935 | 58 | Glide substitution (n=5), lateral misarticulation (n=6) |
| s0128 | 4;7 | f | 9.1% | 4828 | 2351 | 245 | Backing to /ʃ/ (n=1), lateral misarticulation (n=9) |
| s0076 | 4;9 | m | 18.2% | 6343 | 2051 | 188 | Frontal misarticulation (n=9) |
| s2002a | 7;0 | f | 0% | 4796 | 2301 | 277 | Frontal Misarticulation (n=10) |
| s1029 | 7;0 | m | 27.3% | 6401 | 2102 | 162 | Lateral Misarticulation (n=8) |
| s0028 | 8;0 | f | 54.5% | 7275 | 1947 | 193 | Frontal misarticulation (n=4) |
| s0035 | 8;0 | m | 18.2% | 7439 | 1778 | 130 | Frontal (n=2) and Lateral (n=7) Misarticulation |
n=10 tokens
The first two spectral moments (m1 and m2) were taken for the middle 40 ms interval of the fricatives, band-passed filtered with a lower cutoff frequency of 500 Hz (to remove any potential effects of voicing) and an upper cutoff of 10,000 Hz (to mitigate the effect of any differences in the high-frequency responses of the microphones used to make the recordings). The first spectral moment, spectral centroid, differentiates /s/ from /ʃ/. Higher centroids are associated with a greater likelihood of /s/ being judged as accurate (Holliday et al., 2015; Munson, 2015). The second spectral moment, spectral variance, differentiates /s/ from /θ/. Higher spectral variance is associated with /s/ being judged as inaccurate (Munson, 2015). Readers should be aware that recent research has critiqued the use of spectral moments to characterize fricatives (Shadle, 2023).
A series of linear mixed-effects models examined whether spectral center of gravity, spectral variance, and fricative duration differed by SAB and age. Model-fitting paralleled that for accuracy: the fits of models including SAB and age were compared to that of baseline models with only a random intercept for participants. The fully specified models for spectral center of gravity and spectral variance did not fit the data better than baseline models (χ2[df=2] = 4.093, p = .129 for spectral center of gravity, χ2[df=2] = 0.985, p = 0.619 for spectral variance). That indicates that the spectral centroids and spectral variance did not differ across age or SAB. In contrast, the fully specified model for /s/ duration did fit the data better than the baseline model, χ2[df=2] = 6.302, p = 0.043. The fricatives produced by AFAB children were longer than those produced by AMAB children. Notably, none of the coefficients in the duration model were significant at the α=0.05 level. These values are presented in Table 1.
In sum, the /s/ productions used as stimuli did not differ statistically significantly in accuracy, or in their acoustic characteristics, between AMAB and AFAB children, or across the age range we studied. Hence, any influence of SAB and age on ratings of accuracy in conditions in which listeners were told the child’s age versus ones in which listeners were not would reflect listeners’ perceptual biases when judging /s/ accuracy, rather than characteristics of the fricatives themselves. This is not to say that there were no cues whatsoever to gender and age in these stimuli. Formant-frequency spacing, fundamental frequency, and voice quality likely reflected age and gender. This speculation is supported by findings that children’s gender can be detected from words that do not contain fricatives (Perry et al, 2001).
Participants.
The people we recruited to be our listeners were 95 speech-language pathologists. They were recruited through emails shared to two listservs hosted by the American Speech-Language-Hearing Association, SIG 1 and SIG 14. The limited recruitment venue was intended to decrease the likelihood that people who did not meet the inclusionary criteria would participate in bad faith, a phenomenon described by Roehl and Harland (2022), among others. The recruitment materials stated that participants should have a current SLP license in one of the 50 US states, should be between 18 and 64 years of age, and should have a minimum of one year of experience post clinical fellowship. Because this study took place in the US, where the term SLP is used exclusively, these participants are referred to henceforth as SLPs rather than SLPs/SLTs. People indicated their interest by filling out a Google Form verifying they met the inclusionary criteria. Each person was then sent an individualized link to participate. This two-step procedure was intended to further minimize the likelihood that people would participate in bad faith. It also decreased the completion rate: 155 people indicated interest through the Google form, but only 95 completed the experiment. Participants were compensated with a gift card for $20 from an on-line retailer for participating. Participant demographics are presented in Table 2. The specific wording of the demographic questions is presented in the supplemental materials. Gender, racial identity, ethnic identity, and state were provided as fixed options.
Table 2.
Participant Characteristics, Separated by Condition
| No Priming (n=23) | Sex Priming (n=24) | Age Priming (n=23) | Age and Sex Priming (n=25) |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | SD | Range | Mean | SD | Range | Mean | SD | Range | Mean | SD | Range | |
| Years SLP Experience | 11.0 | 9.3 | 2-36 | 13.1 | 7.6 | 2-33 | 9.9 | 6.8 | 1-26 | 10.5 | 6.5 | 2-25 |
| Years SLP Experience with chidren | 10.8 | 8.6 | 2-30 | 12 | 7.4 | 2-33 | 9.7 | 6.9 | 1-26 | 10.4 | 6.7 | 0-25 |
| Avg. Hours SLP work/ week | 36.1 | 11.3 | 4-60 | 34.1 | 12.4 | 6-50 | 34.8 | 11.5 | 2-50 | 32.4 | 12.7 | 2-48 |
| Avg. Hours SLP work with children/week | 16.7 | 13.4 | 0-40 | 14.3 | 12.5 | 0-40 | 10.8 | 12 | 0-40 | 8.7 | 7.1 | .5-25 |
| Avg. Hours with children under 8/week | 31.3 | 20.6 | 0-100 | 65.5 | 59.3 | 0-168 | 34.6 | 36.1 | 0-125 | 46.6 | 35.4 | 2-100 |
| %Peds | 78% | 83% | 61% | 68% | ||||||||
| %SSD | 65% | 71% | 61% | 72% | ||||||||
| Gender | 96% F, 4% Nonbinary | 83% F, 8% M, 4% Nonbinary, 4% No response | 91% F, 9% M | 100% F | ||||||||
| Racial and Ethnic Identity | 13% Asian, 17% Black, 4% Multiracial, 4% No response, 61% white and non-Latine | 8% Black, 4% No response, 88% white and non-Latine | 4% Black, 4% Multiracial, 4% Latine of any race,87% white and non-Latine | 4% Asian, 4% Black,4% Multiracial,12% Latine of any race,76% white and non-Latine | ||||||||
We asked a series of open-ended questions about clinical experience and practice patterns. Participants reported the number of years of experience in SLP (regardless of whether they worked full-, or part-time, and regardless of whether they worked the entire year), the number of years they have worked as an SLP with children under 17, the number of hours a week they currently spend providing SLP services, the number of hours per week spent providing SLP services to children under 17, and the number of hours a week they spend with children under age 8 (including time spent with children outside of providing SLP services). Participants provided answers to the open-ended question In your own words, how would you describe the facility you currently work in? Table 2 distills these into pediatric settings (i.e., school, pediatric clinic, etc.) and non-pediatric settings (i.e., hospital, skilled nursing facility, etc.).
Consistent with the demographics of the SLP profession in the US (ASHA, 2022), there were more women (93%) than men (4%), people whose gender was neither exclusively male nor exclusively female (2%), or people who did not respond (1%). A chi-square test showed that this distribution did not differ significantly across conditions (χ2[df=9]=9.49,p=.39). SLP listeners claimed a variety of racial and ethnic identities. Consistent with the demographics of the SLP profession in the US, most participants (78%) were white and non-Latine. The remaining racial identities were Asian and non-Latine (4%), Black and non-Latine (10%), Multiracial and non-Latine (2%), white and Latine (4%), or no response (2%). These did not differ significantly across the conditions, as determined by a chi-square test (χ2[df=15]=19.51,p=.19). The percentage of SLPs who work primarily with children or primarily on SSD is presented in Table 2, separated by condition. These did not differ across conditions by a chi-square test (χ2[df=3]=3.62,p=.31). Table 2 shows that the Ns for each condition were not identical, which was due to the fully random and automatic assignment of participants to condition. Table 2 also presents the average years of experience as an SLP, average hours per week working as an SLP, average hours per week working with children, and the average time per week that the participants interact with children under 8 (including at home). Only the last measure differed significantly across the groups, as shown by a between-factor analysis of variance (F[3,91] = 3.393, p = 0.02). This is because the sex bias condition included multiple people who reported spending 168 hours a week (i.e., their entire week) with children under 8, presumably as parents or primary caregivers of children under 8. This number inflates the time spent actively with children under 8, as it includes time spent sleeping.
Procedures.
SLPs were instructed to wear headphones during the experiment, and to participate in a quiet location using either a laptop or a tablet. The experiment was implemented with the Qualtrics software. Qualtrics was configured not to accept IP addresses outside of the US, to enforce the inclusion criterion that participants come from the US. Prior to the full experiment, participants were instructed to adjust the volume of their computer to a comfortable level. They were allowed to play a speech token not used in the main experiment as many times as needed while adjusting their computer’s output volume.
The experiment had four conditions using a between-groups design, with SLPs assigned to a single condition. In the first condition (no priming), SLPs were told neither the child’s age nor the child’s sex. In the second condition (sex priming), SLPs were told just the SAB of the participant. In the third condition (age priming), SLPs were just told the age of the participant.. In the fourth condition (age and sex priming), listeners were told the age and sex of the participants. Participants were assigned to condition randomly by Qualtrics, which resulted in an uneven number of participants across conditions.
On each trial, a single token was played. In the priming conditions, the child’s age and/or SAB appeared prominently above the button that participants had to press to play the stimuli. There were two versions of the age and sex priming condition, one in which age was presented above SAB, and one in which the SAB was presented above age. Approximately half of the participants were assigned to each condition. After hearing each stimulus, SLPs were first asked whether the word-initial /s/ was correct. We refer to these responses as binary accuracy judgments. If they answered that it was incorrect, they were asked two additional questions. First, they were asked how incorrect it was, with four responses: Slightly Incorrect, Moderately Incorrect, Mostly Incorrect, and Completely Incorrect. This question was motivated by the results of studies showing that people are sensitive to degrees of accuracy in children’s productions (McAllister Byun et al., 2016; Munson et al., 2010). We refer to these as accuracy ratings. Finally, we asked people the type of error that the child produced, with four responses: frontal misarticulation, lateral misarticulation, stopping error, or a different type of error (with an option to specify the specific error type).
After all the sounds had been rated, SLPs in the no priming, age priming, and sex priming conditions were asked an additional set of questions. Participants in the no priming and sex priming conditions were played a single word produced by each child and were asked what is your best guess of the child’s age in years and months? They answered by typing the age in years and months. SLPs in the no priming and age priming conditions were presented with a single word by each child and were asked what is your best guess of the child’s gender? They answered by selecting from the list female, male, or a gender that is neither exclusively male nor exclusively female. After both questions, individuals reported their confidence in their response, using the choices Not at All Confident, Somewhat Confident, Mostly Confident, or Completely Confident. These confidence ratings are not analyzed further.
Results
Accuracy Ratings
The first set of analyses examined whether children’s age and SAB affected ratings of accuracy, and whether these effects differed across the four conditions. Prior to conducting statistical analyses, we examined the distribution of responses across the 95 listeners across the four conditions. Recall that the average gold-coded accuracy was 16.3%. Figure 1 is a scatter plot of the proportion of items reported to be correct for each of the listeners, plotted against the number of years of experience as an SLP. This figure shows that there was a very wide range of accuracy ratings, all of which were higher than the gold-coded accuracy ratings (shown by the dashed line). Figure 1 shows that the proportion of items reported to be accurate was not associated with years of experience as an SLP (r = 0.05, p = .66). This finding is important because some previous research has shown that the duration of clinical experience predicts continuous accuracy ratings of children’s speech (Meyer & Munson, 2021; Munson et al., 2012) It also did not correlate with years of experience as an SLP working primarily with children, hours per week spent working with children, hours per week spent around children under 8, or whether or not they work in a pediatric setting (all r’s < 0.14, all p’s > 0.05).
Figure 1.

Scatterplot relating the total proportion of items rated as correct by individual participants by their years of experience working as a speech-language pathologist. The dashed line represents the accuracy as determined by the authors NRB and CW.
The second set of analyses examined whether binary accuracy judgments differed as a function of the child’s SAB and age, and whether these effects varied across the four conditions. We predicted that there would be an effect of SAB on ratings, with AFAB children receiving higher ratings than AMAB children. We also predicted that older children would receive more 'accurate' judgments than younger children. There were no differences in the acoustic characteristics of the AMAB and AFAB children’s /s/ productions, and no effect of age on the acoustic characteristics. Hence, any effects of age or SAB should reflect perceptual biases rather than differences in the productions themselves.
Logistic mixed-effects regression was used to examine predictors of binary accuracy judgments. The R packages lme4 (Bates et al., 2015) and LmerTest (Kuznetsova et al., 2017) were used to fit models and assess the significance of individual coefficients, using Satterthwaite’s approximation for degrees of freedom. The dependent measure was whether the item was judged to be accurate or inaccurate (coded as 1 and 0, respectively). Participant was a random effect. A series of analyses also included item as a random effect, but more-complex models including the random effect of item did not converge, and hence it was removed as a random effect from all models. The baseline model contained only the random effect for participant. The next model included a variable coding SAB (AFAB=1, AMAB=−1) and age (z-transformed to improve model fit). This model fit the data better than the baseline model, χ2[df=2] = 1373.2, p < 0.001. The next model contained a factor for condition in interaction with both SAB and age. For this factor, the no priming condition served as the reference level. This model fit the data better than the model without condition, χ2[df=9] = 19.73, p = 0.02. The coefficients for this model are presented in Table 3.
Table 3.
Coefficients for the fixed effects in logistic mixed-effects model predicting binary judgments from Sex Assigned at Birth (SAB), Age, and condition. (Variance for the random effect of participant = 0.44, total log likelihood = −4701.7). The reference level for condition was no priming.
| Factor | Estimate | Standard Error | z-score | Pr > (∣z∣) |
|---|---|---|---|---|
| (Intercept) | 0.109 | 0.150 | 0.7 | 0.47 |
| SAB (AFAB1=1, AMAB2=−1) | 0.147 | 0.052 | 2.8 | 0.01 |
| Condition: Sex Priming | 0.033 | 0.208 | 0.2 | 0.87 |
| Condition: Age Priming | 0.021 | 0.210 | 0.1 | 0.92 |
| Condition: Sex and Age Priming | 0.302 | 0.207 | 1.5 | 0.14 |
| Age in months (z-transformed) | 0.897 | 0.055 | 16.4 | <0.01 |
| SAB*Condition: Sex Priming | 0.093 | 0.072 | 1.3 | 0.20 |
| SAB*Condition: Age Priming | 0.174 | 0.073 | 2.4 | 0.02 |
| SAB*Condition: Sex and Age Priming | 0.256 | 0.073 | 3.5 | <0.01 |
| Age*Condition: Sex Priming | −0.040 | 0.076 | −0.5 | 0.60 |
| Age*Condition: Age Priming | −0.035 | 0.077 | −0.5 | 0.65 |
| Age*Condition: Sex and Age Priming | 0.107 | 0.077 | 1.4 | 0.16 |
Assigned Female at Birth
Assigned Male at Birth
Table 3 shows there was a significant main effect of age on binary accuracy judgments: older children’s productions were more likely to be judged to be accurate than younger children’s productions. This did not interact with condition. There was also a main effect of SAB on binary accuracy judgments: productions by AFAB children were more likely to be judged to be accurate than AMAB children. This interacted with condition: the effect of SAB was larger in the age priming and sex and age priming conditions than the effect in the no priming and sex priming conditions. This is illustrated in Figure 2. Figure 2 plots individual SLPs’ proportion ‘accurate' responses, separated by SAB and condition.
Figure 2.

Bar plot and strip-chart showing the proportion of items rated as correct by individual participants, separated by children’s sex assigned at birth (AFAB=assigned female at birth, AMAB=assigned male at birth) and condition.
Figure 2 shows that the difference in binary accuracy judgments between AFAB and AMAB children was smallest in the no priming condition, and larger in the other three conditions. Interestingly, the mean difference was largest, and most variable across participants, in the sex priming condition. The increased effect of SAB on ratings appeared to be driven by the ratings for AMAB children being lower in the three priming conditions than in the no priming condition. The ratings for AFAB children were more stable across conditions, with a slightly higher mean rating in the age and sex priming condition.
Recall that people who provided judgments of ‘incorrect’ were asked to indicate the level of accuracy from a choice of four. Hence, for each stimulus, there were five degrees of accuracy ratings: correct, slightly incorrect, moderately incorrect, mostly incorrect, and completely incorrect. The proportion of responses falling into these five categories is shown in Table 4. A second set of analyses examined predictors of accuracy ratings. Because these data were nominal, we used Multinomial Mixed-Effects Logistic Regression (henceforth multinomial mixed-effects models) to predict these measures. Multinomial mixed-effects models predict the likelihood of the outcome of three or more options, in this case, the five levels of accuracy ratings. In these models, one of these levels of accuracy (in this case, correct) served as the reference levels. The model then compared the likelihood of the other responses to the likelihood of responding to the reference level.
Table 4.
Proportion of gradient accuracy rating responses.
| Sex Assigned at Birth | |||
|---|---|---|---|
| Condition | Accuracy | AFAB | AMAB |
| No Priming | Correct | 54.1% | 50.5% |
| Slightly Incorrect | 21.2% | 16.3% | |
| Moderately Incorrect | 14.1% | 8.9% | |
| Mostly Incorrect | 6.2% | 4.1% | |
| Completely Incorrect | 4.4% | 20.3% | |
| Sex Priming | Correct | 56.2% | 49.1% |
| Slightly Incorrect | 19.8% | 16.1% | |
| Moderately Incorrect | 13.0% | 10.3% | |
| Mostly Incorrect | 7.1% | 4.2% | |
| Completely Incorrect | 4.0% | 20.2% | |
| Age Priming | Correct | 57.9% | 48.0% |
| Slightly Incorrect | 18.7% | 17.8% | |
| Moderately Incorrect | 12.8% | 9.1% | |
| Mostly Incorrect | 6.9% | 4.2% | |
| Completely Incorrect | 3.7% | 20.9% | |
| Sex and Age Priming | Correct | 63.7% | 51.0% |
| Slightly Incorrect | 18.3% | 16.5% | |
| Moderately Incorrect | 9.9% | 7.1% | |
| Mostly Incorrect | 5.1% | 5.0% | |
| Completely Incorrect | 3.0% | 20.4% | |
The multinomial mixed-effects models were fitted with the mblogit function in the R package mclogit (Elff, 2022). The model-fitting procedure paralleled that for the logistic mixed-effects models described earlier. The coefficients for the most-complex model, analogous to the logit mixed-effects model presented in Table 3, are shown in Appendix A. The model converged after seven iterations, with a final deviance of 16408.75 and a criterion < 0.001. As Appendix A shows, the patterns mirrored those for the logit mixed-effects model described in Table 3. In particular, the most robust interaction between condition and sex assigned at birth was for the comparison between the no priming and the age and sex priming conditions. Moreover, the effect of SAB was most pronounced for items that were rated as completely incorrect. We return to these points in the discussion.
A series of analyses explored whether the five clinical-practice related variables in Table 2 (years of SLP experience, years of experience working with children, hours per week of SLP work, hours per week spent around children under 8, whether the primary workplace is a pediatric facility) predict the magnitude of age and gender effects on binary accuracy judgments. These analyses compared models that included each of these measures and their interaction with child’s SAB and child’s age and assessed whether they fit the data better than models without them. None of the models that included practice-related variables fit the data better than models without them.
Participants were asked to list the error type for items that were judged to be incorrect. The error types that were identified varied considerably across the participants, and a formal analysis of these differences is outside of the scope of this paper. One notable finding was that the proportion of responses of a different type of error than the ones listed in the question (with an option to describe the error) was chosen considerably more often for AMAB children (38.6% of all responses to this question, which excludes tokens identified as correct) than AFAB children (12.4%). This finding suggests that the tokens by AMAB children were treated with more scrutiny than those produced by AFAB children.
Next, we examined how the acoustic variables predicted ratings. Though the purpose of this study was not to examine how acoustic features predict binary accuracy judgments, this analysis was included to ensure that the effect of SAB on ratings, and the interaction between that effect and condition, could not be attributed solely to the acoustic characteristics of the fricatives being rated. To assess this, we examined models predicting binary accuracy judgments from SAB, age, and one acoustic measure, m1. M1 was chosen as the acoustic measure because it best discriminates among places of articulation of English fricatives (Jongman et al., 2000). Moreover, recall that we predicted that the effect of SAB on ratings would mean that similar m1 values would elicit a different likelihood of being rated as accurate depending on whether the child was AFAB or AMAB, based on studies of the perception of adults’ speech. To examine this, we fit a logit-mixed effect model in which binary accuracy judgments were the dependent variable. The fixed effects were age (z-transformed to improve model fit), SAB (AFAB=1, AMAB=−1), and stimulus m1 (z-transformed to improve model fit). The model included interactions between SAB and m1, and between age and m1. Participant was a random effect. The model fit the data considerably better than a model without m1, χ2[df=3] = 1221.1, p < 0.001. The coefficients for the model showed significant effects of age and SAB, as well as a significant effect of m1 (β = 1.274, SEM = 0.044, z = 28.8, p < 0.001) and, as predicted, a significant interaction between m1 and SAB (β = −0.362, SEM = 0.043, z = −8.4, p < 0.001). This interaction is illustrated in Figure 3. Figure 3 is a scatterplot of the average accuracy ratings by stimulus m1, separated by SAB. The lines represent the logistic functions illustrating the likelihood of a token being rated as accurate across the range of m1 frequencies, calculated separately for AFAB and AMAB children. Figure 3 shows that our prediction was somewhat supported: for m1 frequencies below 7000 Hz, the AFAB children’s productions were more likely to be rated as accurate than AMAB children’s productions. The figure also shows a more categorical relationship between m1 and accuracy for AMAB children than for AFAB children.
Figure 3.

Scatterplot showing the proportion of listeners who judged individual items to be correct by that item’s first spectral moment (m1, i.e., the spectral centroid). Children assigned female at birth (AFAB) and children assigned male at birth (AMAB) plotted separately. Lines represent the results of logistic mixed-effects regressions predicting ratings of accuracy for individual items from m1, with participant as a random effect.
Recall that m1 did not differ statistically significantly between AFAB and AMAB children. However, Figure 3 illustrates that AMAB children produced more tokens with an atypically low m1 than AFAB children, as would be expected given the findings of Koeppe (2021). An additional statistical model was fit to examine whether these non-significant differences between m1 accounted for the effect of SAB and its interaction with condition. This model included the same terms as the model described in Table 3, with the addition of a fixed effect for m1 (z-normalized to improve model fit). This model differed from that in the previous paragraph by including condition as a predictor. The coefficients for that model broadly resembled those in Table 3: the coefficient for SAB, and its interaction with the age bias and age and sex bias conditions were significant. This indicates that the different distributions of m1 did not explain the different ratings given to AFAB and AMAB children.
Ratings of Children’s Age and Sex
Recall that participants in the no priming and age priming conditions were asked what they believed the gender of the eight children to be at the end of the experiment. Participants in the no priming and sex priming conditions were asked what they believed the age of the eight children to be at the end of the experiment. The final analysis examined these ratings, to determine how robustly children’s gender and age were conveyed through the stimuli presented in the experiment.
Gender ratings were examined using logit mixed-effects models. The dependent measure was the gender rating. The predictors were the child’s SAB and condition (with age priming as the reference level) and their interaction, and participant was included as a random effect. Recall that gender ratings were one of three responses: male, female, or a gender that is neither exclusively male nor exclusively female. Only two responses for the latter were given, which was not a large enough N to justify including this response category. Hence, these two responses were excluded from the analysis. Again, the model was fit with the R package lme4. In the final model, condition and its interaction with SAB was not significant. The coefficient for SAB was significant (β = −0.760, SEM = 0.173, z = −4.386, p < 0.001). Overall, the children’s SAB was reported accurately in 65% of responses.
Age ratings were examined with logistic mixed-effects regression. The dependent measure was the age guess in months. The predictors were the child’s age in months (z-transformed to improve model fit) and condition (with no priming as the reference level), and their interaction. Participant was a random effect. The R package LmerTest was used to generate significance tests, using Sattherthwaite’s approximation for degrees of freedom. The coefficient for child’s age was statistically significant (β = 19.23, SEM = 0.879, t[327.2] = 21.876, p < 0.001). This shows that, somewhat unsurprisingly, the child’s age strongly predicted the age guess. The coefficient for the interaction between child’s age and condition was also significant, β =−3.3857, SEM = 1.2597, t[327.2] = −2.688, p=0.008). Because sex priming was the reference level, this finding shows that the relationship between child’s age and age ratings was stronger in the no priming condition than in the sex priming condition.
Together, these findings show that children’s age and sex could be perceived through these stimuli at greater-than-chance levels. This supports the possibility that the influence of children’s age and sex on ratings could be due to the perception of these variables through phonetic characteristics of the stimuli.
Discussion
The results in this paper provide an important 'proof of concept' that social variables can affect judgments of children's speech accuracy. The most noteworthy empirical findings of this study are as follows. The SLP listeners were considerably more likely than the gold-coders to rate the 87 tokens of /s/ to be accurate. Moreover, there was substantial variation across the 95 participants in the propensity to rate /s/ tokens as accurate. These two findings are interesting because they suggest substantial variation in ideologies about variation in /s/. One of the motivations for the current study is the observation that /s/ productions by adults vary substantially across social categories. Knowledge of such variation might make the task of assessing /s/ accuracy in children challenging if SLPs were to expect children’s productions to show similar gender-related variation. The variation across raters might indicate a range of ideologies about whether /s/ variation is due to normal sociolinguistic variation or to the presence or absence of a communication impairment. We presume that these differences also affect the gold-coded accuracy, made by authors NRB and CW. While these differences between the gold-coders and the participants do not relate to gender or age per se, they might reflect differences overall in ideologies about speech correctness or maturity.
The 95 participants judged the /s/ productions of AMAB children to be less accurate than those of AFAB children. Recall that there was a small, statistically non-significant difference between the AMAB and AFAB children’s accuracy as assessed by the gold-coders, with AMAB children being less accurate. However, there were much larger differences between AFAB and AMAB for the 95 participants. That effect was robustly statistically significant in our logit mixed-effects model. These differences were not due to simple acoustic differences between AMAB and AFAB children’s productions, as the interaction was significant in a model where the principle acoustic measure differentiating incorrect and correct productions was included as a covariate. Recall that by 2 ½ years of age, AMAB and AFAB children speak in perceptibly different ways. One reasonable interpretation of this effect is that participants implicitly identified children’s gender, and then judged the accuracy of their /s/ productions consistent with their beliefs about gender and speech. Evidence for this interpretation comes from the finding that participants in the two conditions where SAB was not revealed identified the children’s SAB at greater-than-chance levels. It is important to emphasize here that these findings are based on a small number of children. It remains to be seen whether they would generalize to the entire population of AMAB and AFAB children.
Finally, there was a significant interaction between children’s sex assigned at birth and condition. Specifically, the AMAB children were rated to be less-accurate in the two conditions in which participants were told the children’s age, either alone in or in combination with their SAB, when compared to a condition where they were told neither the age nor the SAB. Contrary to predictions, we did not find any evidence that telling people whether a child was AMAB or AFAB changed the ratings relative to a condition in which they were not told this information. A further analysis showed that this difference was driven largely by tokens that were identified as completely inaccurate, rather than as slightly, moderately, or mostly incorrect. This finding was somewhat unexpected, as previous research by Munson et al. (2010) found biasing to be strongest for productions that were neither clearly correct nor clearly incorrect. This difference may relate to the types of responses used in different studies. Munson et al. (2010) used a single continuous rating scale to elicit responses, while the current study used a two-step process in which a binary accuracy judgment was followed by a nominal accuracy rating.
While we did find that the suggestion of a social label affected ratings of accuracy, the relationship between the label and the ratings was not straightforward. We predicted that priming sex would lead to a bigger difference in accuracy ratings for AFAB and AMAB children. Instead, the difference was greater in conditions where the child’s age was revealed to participants. Both age and gender are highly salient variables, so this difference is unlikely to be due to the salience of these labels. Instead, it may relate to social desirability bias, that is, the tendency for experimental participants to respond in ways they perceive to be socially desirable rather than in ways that reflect their true beliefs (Krumpal, 2013). The sole mention of gender in the sex bias condition may have prompted participants to actively monitor and suppress a tendency to rate AFAB and AMAB children differently, under the assumption that doing so would represent socially undesirable gender discrimination. In contrast, age might not elicit social desirability effects because it is not associated with overt discrimination in children. Indeed, SLPs/SLTs pay close attention to age in their clinical practice, as many communication disorders are defined by performance relative to children’s same-age peers. This explanation is bolstered by the general social salience of /s/ variation and gender, as evidenced by, for example, online discourse on sexuality and speech (Mulliner, 2019). This speculation could be examined in future work assessing the overt and covert beliefs that SLTs/SLTs have about gender and speech.
Regardless of why age and sex priming behaved differently, the results of this study show clearly that invoking social attributes, either age alone or age and sex, led people to rate the /s/ productions of AMAB to be less accurate than in a condition where neither was primed. One of the motivations for this study was to determine whether automatic feedback systems for /s/ productions should consider a child’s gender when associating acoustic variables with binary accuracy judgments. The results of the study do not provide a straightforward answer to that question. On the surface, the findings in the context of the previous literature would seem to suggest the answer is a clear yes. Previous studies have shown that AMAB and AFAB children produce /s/ with different acoustic characteristics. The results of this study show that people rate AMAB and AFAB children’s /s/ productions differently, even when acoustic characteristics of these productions are controlled statistically. Both findings argue that automatic feedback systems have knowledge of whether a child is a boy or a girl to reflect production differences and to replicate SLP/SLT ratings.
A closer look at the data, however, suggests that the answer to the question is probably not. The primary reason for this is that SLPs/SLTs should be simultaneously sensitive both to clinical variation and to sociolinguistic variation. That is, clinicians should be focused both on improving communication skills (i.e., sensitive to clinical variation) that center clients’ socially agentive use of language to convey their social affiliations (i.e., sensitive to sociolinguistic variation). In the introduction, we argued that accuracy judgments reflect an SLP/SLT’s assessment of whether children are speaking in adult-like manner, and that every adult speaks differently. It is also true that adults speak differently in different contexts. For example, Maniwa et al. (2009) showed that productions of /s/ in clear speech styles were systematically different from those in conversational speech styles. Calder (2019) showed that productions of /s/ varied within individuals (in this case, drag performers) as part of their performances of different genders. We believe that SLPs/SLTs are responsible for facilitating different ways of speaking that reflect the demands and expectations of different environments. However, every clinician’s knowledge of the range of variation in speech production is necessarily limited by their own knowledge of and experience with different ways of speaking. Upon reflection, the gold-coded judgments reported in this study were likely made with the most formal and official speaking styles in mind, ones analogous to the clear-speech styles examined by Maniwa et al. It is not surprising, then, that they did not differ between AMAB and AFAB children. The participants, in contrast, likely invoked a variety of socially imbued expectations when making binary accuracy judgments. These resulted in a wide range of binary accuracy judgments, and evidence of social influences on those judgments.
Ultimately, clients receiving services for /s/ misarticulation must command the full range of ways of producing /s/, including the more-formal ways reflected in the gold-coded judgments, and the socially imbued ways reflected in the participants’ judgments. Given the variability of the latter, we believe that any automated system should default to using the more uniform, stricter gold-coded judgments from individuals trained in sociophonetic expectation and variation. Such a system should be careful to highlight that these judgments reflect expectations about specific ways of speaking, and not the full range of speaking styles that clients would ultimately hope to command. To be clear, just because the gold-coded judgments in the current study weren’t affected by social variables does not mean that they are socially neutral. Clear speech styles arguably reflect the expression of social power (Tripp & Munson, 2022), and hence reflect societal imbalances in access to power. Ultimately, speech-language clinicians, along with other scholarly communities and professions, should lead broader societal discussions of what it means to speak accurately, acknowledging that what is accurate in one social context is not necessarily accurate in another. One way that SLP/SLT can contribute to that is to focus more on ear training, using productions from a diverse set of talkers.
An additional reason to take a more cautious approach is that we do not yet fully understand why AFAB and AMAB children produce /s/ differently. In contrast, the reasons for the formant-frequency normalization for /ɹ/ described by Benway et al. were arguably due to sex dimorphism in the vocal tract. If the reason for /s/ differences between AMAB and AFAB children is because they—like adults—use /s/ variation to mark gender, then the relevant differences should not be based on SAB, but on gender. Sociolinguistic studies of gender and language show that differences in language forms among genders are often related to other social variables (Munson & Babel, 2019; Tripp & Munson, 2022). Here it is important to emphasize that our use of the labels AFAB and AMAB is not because we believe /s/ variation to be driven solely by SAB, but because the children’s gender was not measured in the studies from which the stimuli were taken. Like adults, children’s gender does not follow automatically from SAB. A recent study by Kidd et al. (2021) found 1.8% of adolescents did not have a cisgender identity. Moreover, previous studies have demonstrated that variation in AMAB children’s gender predicts variation in phonetic variables, including characteristics of /s/ (Li et al., 2016; Munson et al., 2015). Such measures should arguably be part of SLPs’/SLTs’ assessments, given the strong evidence that children’s gender is reflected in the ways they use language.
What this Paper Adds.
What is already known on the subject?
Adult men and women produce /s/ differently. A consensus is that these differences reflect sociolinguistic gender marking, rather than being the passive consequence of vocal-tract differences. Recent studies have shown that children assigned female at birth (AFAB) and those assigned male at birth (AMAB) produce /s/ differently in ways that mirror the differences between adult men and women, and which presumably reflect gender marking.
What this paper adds to existing knowledge
We asked whether US-based Speech-Language Pathologists' (SLPs) ratings of the accuracy of /s/ differ depending on whether they are rating an AFAB or an AMAB child, and whether these differences are greater in conditions in which people are told the sex assigned at birth (SAB) of the child being rated. We found that SLPs were more likely to judge AFAB children’s /s/ productions to be more accurate than AMAB children’s, even though the productions from the AMAB and AFAB children that were used as stimuli were matched for accuracy as determined by trained researchers.
What are the potential or actual clinical implications of this work?
SLPs/SLTs should be sensitive to the influence of social variables when assessing /s/. SLPs/SLTs might rate children’s productions differently depending on whether they believe they are rating an AFAB or an AMAB child.
Acknowledgments
We thank Abby Hammell for programming the perception experiment, and for consulting on the design of the study. The collection of productions used as stimuli in this study was funded by NIH grant R01 DC02932 to Jan Edwards (lead PI), Munson (MPI), and Mary E. Beckman (MPI), and grants R03 DC012152 and R15 DC016426 to Jonathan Preston. Participant payment was funded by a University of Minnesota Undergraduate Opportunities Research Program grant to Chloe Wruck. Manuscript preparation was supported by NIH Grant R01 DC020959 to Jonathan Preston and NIH Grant T32 DC000046 to Matthew Goupell and Catherine Carr (Nina R. Benway, trainee). We thank the two anonymous reviewers for their thoughtful, thorough, and spirited comments on an earlier version of this manuscript.
Appendix A. Results of a Multinomial Logistic Mixed Model predicting gradient accuracy ratings (summarized in Table 4) from condition, sex assigned at birth (SAB), and age.
Model for correct vs. slightly incorrect
| Factor | Estimate | Standard Error | z-score | Pr > (∣z∣) |
|---|---|---|---|---|
| (Intercept) | −0.998 | 0.181 | −5.5 | <0.001 |
| SAB (AFAB1=1, AMAB2=−1) | 0.048 | 0.066 | 0.7 | 0.467 |
| Condition: Sex Priming | −0.102 | 0.252 | −0.4 | 0.685 |
| Condition: Age Priming | −0.056 | 0.254 | −0.2 | 0.827 |
| Condition: Sex and Age Priming | −0.237 | 0.25 | −0.9 | 0.343 |
| Age in months (z-transformed) | −0.416 | 0.07 | −6.0 | <0.001 |
| SAB*Condition: Sex Priming | −0.069 | 0.092 | −0.8 | 0.449 |
| SAB*Condition: Age Priming | −0.197 | 0.093 | −2.1 | 0.034 |
| SAB*Condition: Sex and Age Priming | −0.198 | 0.091 | −2.2 | 0.030 |
| Age*Condition: Sex Priming | 0.061 | 0.098 | 0.6 | 0.531 |
| Age*Condition: Age Priming | −0.036 | 0.099 | −0.4 | 0.712 |
| Age*Condition: Sex and Age Priming | −0.045 | 0.097 | −0.5 | 0.645 |
Assigned Female at Birth
Assigned Male at Birth
Model for correct vs. moderately incorrect
| Factor | Estimate | Standard Error | z-score | Pr > (∣z∣) |
|---|---|---|---|---|
| (Intercept) | −1.545 | 0.193 | −8.0 | <0.001 |
| SAB (AFAB1=1, AMAB2=−1) | 0.107 | 0.081 | 1.3 | 0.188 |
| Condition: Sex Priming | −0.001 | 0.267 | 0.0 | 0.997 |
| Condition: Age Priming | −0.063 | 0.271 | −0.2 | 0.817 |
| Condition: Sex and Age Priming | −0.446 | 0.269 | −1.7 | 0.097 |
| Age in months (z-transformed) | −0.625 | 0.084 | −7.5 | <0.001 |
| SAB*Condition: Sex Priming | −0.139 | 0.112 | −1.2 | 0.213 |
| SAB*Condition: Age Priming | −0.114 | 0.115 | −1.0 | 0.321 |
| SAB*Condition: Sex and Age Priming | −0.25 | 0.119 | −2.1 | 0.036 |
| Age*Condition: Sex Priming | 0.14 | 0.116 | 1.2 | 0.228 |
| Age*Condition: Age Priming | 0.164 | 0.119 | 1.4 | 0.166 |
| Age*Condition: Sex and Age Priming | −0.251 | 0.124 | −2.0 | 0.043 |
Assigned Female at Birth
Assigned Male at Birth
Model for correct vs. mostly incorrect
| Factor | Estimate | Standard Error | z-score | Pr > (∣z∣) |
|---|---|---|---|---|
| (Intercept) | −2.639 | 0.24 | −11 | <0.001 |
| SAB (AFAB1=1, AMAB2=−1) | −0.078 | 0.119 | −0.7 | 0.511 |
| Condition: Sex Priming | 0.042 | 0.332 | 0.1 | 0.899 |
| Condition: Age Priming | 0.166 | 0.331 | 0.5 | 0.616 |
| Condition: Sex and Age Priming | −0.191 | 0.332 | −0.6 | 0.565 |
| Age in months (z-transformed) | −1.215 | 0.132 | −9.2 | <0.001 |
| SAB*Condition: Sex Priming | −0.028 | 0.164 | −0.2 | 0.862 |
| SAB*Condition: Age Priming | 0.000 | 0.164 | 0.0 | 0.998 |
| SAB*Condition: Sex and Age Priming | −0.387 | 0.165 | −2.3 | 0.019 |
| Age*Condition: Sex Priming | −0.046 | 0.183 | −0.3 | 0.802 |
| Age*Condition: Age Priming | 0.239 | 0.178 | 1.3 | 0.180 |
| Age*Condition: Sex and Age Priming | −0.107 | 0.187 | −0.6 | 0.568 |
Assigned Female at Birth
Assigned Male at Birth
Model for correct vs. completely incorrect
| Factor | Estimate | Standard Error | z-score | Pr > (∣z∣) |
|---|---|---|---|---|
| (Intercept) | −3.942 | 0.308 | −12.8 | <0.001 |
| SAB (AFAB1=1, AMAB2=−1) | −1.499 | 0.121 | −12.4 | <0.001 |
| Condition: Sex Priming | −1.583 | 0.512 | −3.1 | 0.002 |
| Condition: Age Priming | −0.617 | 0.464 | −1.3 | 0.183 |
| Condition: Sex and Age Priming | −1.139 | 0.467 | −2.4 | 0.015 |
| Age in months (z-transformed) | −3.145 | 0.208 | −15.1 | <0.001 |
| SAB*Condition: Sex Priming | −0.379 | 0.177 | −2.1 | 0.032 |
| SAB*Condition: Age Priming | −0.31 | 0.176 | −1.8 | 0.078 |
| SAB*Condition: Sex and Age Priming | −0.522 | 0.179 | −2.9 | 0.004 |
| Age*Condition: Sex Priming | −1.259 | 0.358 | −3.5 | <0.001 |
| Age*Condition: Age Priming | −0.494 | 0.319 | −1.5 | 0.121 |
| Age*Condition: Sex and Age Priming | −0.704 | 0.321 | −2.2 | 0.028 |
Assigned Female at Birth
Assigned Male at Birth
Footnotes
We use the terms AMAB and AFAB throughout this paper instead of male and female or boys and girls. We only use the terms boys and girls which describing studies in which (a) children’s gender was assessed directly, including by asking children themselves, and (b) children and parents were given response options other than boy/male, girl/female, and the option not to respond (i.e., studies which gave participants the opportunity to assert a gender other than one that is exclusively male or exclusively female).
References
- American Speech-Language-Hearing Association. (2022). Membership and Affiliation Profile. https://www.asha.org/research/memberdata/.
- Babel M, & Russell J (2015). Expectations and speech intelligibility. Journal of the Acoustical Society of America, 137, 2823–2833. [DOI] [PubMed] [Google Scholar]
- Bates D, Maechler M, Bolker B, & Walker S (2015). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1-9, https://CRAN.R-project.org/package=lme4. [Google Scholar]
- Benway NR and Preston JL (2023a). Prospective validation of motor-based intervention with automated mispronunciation detection of rhotics in residual speech sound disorders. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Dublin, Ireland. [Google Scholar]
- Benway NR, Preston JL, Salekin A, Xiao Y, Sharma H, & McAllister T (2023b). Classifying Rhoticity of/r/in Speech Sound Disorder using Age-and-Sex Normalized Formants. arXiv preprint arXiv:2305.16111. [Google Scholar]
- Bowen C (2023). Children's Speech Sound Disorders. John Wiley & Sons. [Google Scholar]
- Calder J (2019). From sissy to sickening: The indexical landscape of/s/in SoMa, San Francisco. Journal of Linguistic Anthropology, 29, 332–358. [Google Scholar]
- Campbell H, Harel D, Hitchcock E, & McAllister Byun T (2017). Selecting an acoustic correlate for automated measurement of/r/production in children. International Journal of Speech-Language Pathology, 20, 635–643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elff M (2022). Package ‘mclogit’. https://cran.r-project.org/web/packages/mclogit/index.html.
- Fox RA, & Nissen SL (2005). Sex-Related Acoustic Changes in Voiceless English Fricatives. Journal of Speech, Language, and Hearing Research, 48, 753–765 [DOI] [PubMed] [Google Scholar]
- Fuchs S, & Toda M (2010). Do differences in male versus female /s/ reflect biological or sociophonetic factors? In Turbulent Sounds: An Interdisciplinary Guide (pp. 281–302). Mouton de Gruyter. 10.1515/9783110226584 [DOI] [Google Scholar]
- Fung P, Schertz J, & Johnson EK (2021). The development of gendered speech in children: Insights from adult L1 and L2 perceptions. Journal of the Acoustical Society of America Express Letters, 1, 014407. 10.1121/10.0003322 [DOI] [PubMed] [Google Scholar]
- Goldman R, & Fristoe M (2015). Goldman-Fristoe Test of Articulation–Third Edition [Assessment instrument]. Pearson Assessments. [Google Scholar]
- Holliday JJ, Reidy PF, Beckman ME, & Edwards J (2015). Quantifying the robustness of the English sibilant fricative contrast in children. Journal of Speech, Language, and Hearing Research, 58, 622–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jongman A, Wayland R, & Wong S (2000). Acoustic characteristics of English fricatives. The Journal of the Acoustical Society of America, 108, 1252–1263. [DOI] [PubMed] [Google Scholar]
- Kidd KM, Sequeira GM, Douglas C, Paglisotti T, Inwards-Breland DJ, Miller E, & Coulter RW (2021). Prevalence of gender-diverse youth in an urban school district. Pediatrics, 147, e2020049823. DOI: 10.1542/peds.2020-049823 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koeppe K (2021). The Emergence of Gendered Phonetic Variation in Preschool Children: Findings from a Longitudinal Study [Unpublished doctoral dissertation]. University of Minnesota. [Google Scholar]
- Krumpal I (2013). Determinants of social desirability bias in sensitive surveys: a literature review. Quality & Quantity, 47, 2025–2047. [Google Scholar]
- Kuznetsova A, Brockhoff PB, & Christensen RHB (2017). lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software, 82, 1–26. 10.18637/jss.v082.i13 [DOI] [Google Scholar]
- Lewis BA, Freebairn L, Tag J, Ciesla AA, Iyengar SK, Stein CM, & Taylor HG (2015). Adolescent outcomes of children with early speech sound disorders with and without language impairment. American Journal of Speech-Language Pathology, 24, 150–163. 10.1044/2014_AJSLP-14-0075 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li F (2017). The development of gender-specific patterns in the production of voiceless sibilant fricatives in Mandarin Chinese. Linguistics, 55, 1021–1044. 10.1515/ling-2017-0019 [DOI] [Google Scholar]
- Li F, Rendall D, Vasey PL, Kinsman M, Ward-Sutherland A, & Diano G (2016). The development of sex/gender-specific /s/ and its relationship to gender identity in children and adolescents. Journal of Phonetics, 57, 59–70. 10.1016/j.wocn.2016.05.004 [DOI] [Google Scholar]
- Lieberman P (1986). Some aspects of dimorphism and human speech. Human Evolution, 1, 67–75. 10.1007/BF02437286 [DOI] [Google Scholar]
- Maniwa K, Jongman A, & Wade T (2009). Acoustic characteristics of clearly spoken English fricatives. Journal of the Acoustical Society of America, 125, 3962–3973. [DOI] [PubMed] [Google Scholar]
- McAllister Byun TM, Harel D, Halpin PF, & Szeredi D (2016). Deriving gradient measures of child speech from crowdsourced ratings. Journal of Communication Disorders, 64, 91–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGowan KB (2015). Social expectation improves speech perception in noise. Language and Speech, 58, 502–521. [DOI] [PubMed] [Google Scholar]
- Meyer MK and Munson B (2021), Clinical experience and categorical perception of children's speech. International Journal of Language & Communication Disorders, 56, 374–388. 10.1111/1460-6984.12610 [DOI] [PubMed] [Google Scholar]
- Mulliner S (2019). A Mixed Methods Analysis of Corpus Data from Reddit Discussions of “Gay Voice” (Master's thesis, Portland State University; ). [Google Scholar]
- Munson B (2007). The acoustic correlates of perceived sexual orientation, perceived masculinity, and perceived femininity. Language and Speech, 50, 125–142. [DOI] [PubMed] [Google Scholar]
- Munson B (2011). The Influence of Actual and Imputed Talker Gender on Fricative Perception, Revisited. The Journal of the Acoustical Society of America, 130, 2631–2634. [DOI] [PubMed] [Google Scholar]
- Munson B (2015). The influence of /s/ variation on perceived gender typicality in children’s speech. Proceedings of the International Congress on Phonetic Sciences. Glasgow, Scotland: University of Glasgow. [Google Scholar]
- Munson B, & Babel M (2019). The phonetics of sex and gender. In Katz W & Assmann P (Eds.), Routledge Handbook of Phonetics (p. 499–525). 10.4324/9780429056253 [DOI] [Google Scholar]
- Munson B, Crocker L, Pierrehumbert J, Owen-Anderson A, & Zucker K (2015). Gender Typicality in Children's Speech: A comparison of the Speech of Boys with and without Gender Identity Disorder. The Journal of the Acoustical Society of America, 137, 1995–2003. DOI: 10.1121/1.4916202. [DOI] [PubMed] [Google Scholar]
- Munson B, Edwards J, Schellinger SK, Beckman ME, & Meyer MK (2010). Deconstructing phonetic transcription: Covert contrast, perceptual bias, and an extraterrestrial view of Vox Humana. Clinical Linguistics and Phonetics, 24, 245–260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munson B, Jefferson SV, & McDonald EC (2006). The influence of perceived sexual orientation on fricative identification. The Journal of the Acoustical Society of America, 119, 2427–2437. [DOI] [PubMed] [Google Scholar]
- Munson B, Johnson JM, & Edwards J (2012). The role of experience in the perception of phonetic detail in children's speech: a comparison between speech-language pathologists and clinically untrained listeners. American Journal of Speech-Language Pathology, 21, 124–139. 10.1044/1058-0360(2011/11-0009) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munson B, Lackas N, & Koeppe K (2022). Individual differences in the development of gendered speech in preschool children: evidence from a longitudinal study. Journal of Speech, Language, and Hearing Research, 65, 1311–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munson B, Logerquist M, Kim H, Martell A, & Edwards J (2021). Does early phonetic differentiation predict later phonetic development? Evidence from a longitudinal study of /ɹ/ development in preschool children. Journal of Speech, Language, and Hearing Research, 64, 2417–2437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munson B, McDonald EC, DeBoe NL, & White AR (2006). Acoustic and perceptual bases of judgments of women and men's sexual orientation from read speech. Journal of Phonetics, 34, 202–240. [Google Scholar]
- Munson B, Ryherd K, & Kemper S (2017). Implicit and explicit gender priming in English lingual sibilant fricative perception. Linguistics, 55, 1073–1107. [Google Scholar]
- Nightingale C, Swartz M, Ramig LO, & McAllister T (2020). Using crowdsourced listeners' ratings to measure speech changes in hypokinetic dysarthria: A proof-of-concept study. American Journal of Speech-Language Pathology, 29, 873–882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nissen SL, & Fox RA (2005). Acoustic and spectral characteristics of young children’s fricative productions: A developmental perspective. The Journal of the Acoustical Society of America, 118, 2570–2578. [DOI] [PubMed] [Google Scholar]
- Olezeski CL, Pariseau EM, Bamatter WP, & Tishelman AC (2020). Assessing gender in young children: Constructs and considerations. Psychology of Sexual Orientation and Gender Diversity, 7, 293. [Google Scholar]
- Perry TL, Ohde RN, & Ashmead DH (2001). The acoustic bases for gender identification from children's voices. The Journal of the Acoustical Society of America, 109, 2988–2998. 10.1121/1.1370525 [DOI] [PubMed] [Google Scholar]
- Preston JL, Leece MC, & Maas E (2016). Intensive treatment with ultrasound visual feedback for speech sound errors in childhood apraxia. Frontiers in Human Neuroscience, 10, 1–9. doi: 10.3389/fnhum.2016.00440 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Preston JL, McCabe P, Rivera-Campos A, Whittle JL, Landry E, & Maas E (2014). Ultrasound visual feedback treatment and practice variability for residual speech sound errors. Journal of Speech, Language, and Hearing Research, 57, 2102–2115. doi: 10.1044/2014_JSLHR-S-14-0031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roehl JM, & Harland DJ (2022). Imposter participants: overcoming methodological challenges related to balancing participant privacy with data quality when using online recruitment and data collection. The Qualitative Report, 27, 2469–2485. [Google Scholar]
- Shadle CH (2023). Alternatives to moments for characterizing fricatives: Reconsidering Forrest et al.(1988). Journal of the Acoustical Society of America, 153, 1412–1426. [DOI] [PubMed] [Google Scholar]
- Shimko A, Redmond S, Ludlow A, & Ash A (2020). Exploring gender as a potential source of bias in adult judgments of children with specific language impairment and attention-deficit/hyperactivity disorder. Journal of Communication Disorders, 85, 105910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shriberg LD, Gruber FA, & Kwiatkowski J (1994). Developmental phonological disorders III: Long-term speech-sound normalization. Journal of Speech, Language, and Hearing Research, 37, 1151–1177. [DOI] [PubMed] [Google Scholar]
- Shriberg L, Austin D, Lewis B, McSweeny J, & Wilson D (1997). The percentage of consonants correct (PCC) metric: Extensions and reliability data. Journal of Speech, Language, and Hearing Research, 40, 708–722. [DOI] [PubMed] [Google Scholar]
- Smit AB, Hand L, Freilinger JJ, Bernthal JE, & Bird A (1990). The Iowa articulation norms project and its Nebraska replication. Journal of Speech and Hearing Disorders, 55, 779–798. [DOI] [PubMed] [Google Scholar]
- Strand E, & Johnson K (1996). Gradient and visual speaker normalization in the perception of fricatives. In Gibbon D (Ed.) Natural Language Processing and Speech Technology. Results of the 3rd KOVENS Conference, Bielefeld, October, 1996. Berlin: Mouton de Gruyter. [Google Scholar]
- Sumner M, Kim SK, King E, & McGowan KB (2014). The socially weighted encoding of spoken words: A dual-route approach to speech perception. Frontiers in Psychology, 4, 1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tripp A, Feldman NH, & Idsardi WJ (2021). Social inference may guide early lexical learning. Frontiers in Psychology, 12, 645247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tripp A, & Munson B (2022). Perceiving gender while perceiving language: Integrating psycholinguistics and gender theory. Wiley Interdisciplinary Reviews: Cognitive Science, e1583. 10.1002/wcs.1583. [DOI] [PubMed] [Google Scholar]
- Tripp A, & Munson B (2023). Acknowledging language variation and its power: Keys to justice and equity in applied psycholinguistics. Applied Psycholinguistics, 44, 495–513. [Google Scholar]
- Winn MB, Rhone AE, Chatterjee M, & Idsardi WJ (2013). The use of auditory and visual context in speech perception by listeners with normal hearing and listeners with cochlear implants. Frontiers in Psychology, 4, 824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wren Y, Miller LL, Peters TJ, Emond A, & Roulstone S (2016). Prevalence and predictors of persistent speech sound disorder at eight years old: Findings from a population cohort study. Journal of Speech, Language, and Hearing Research, 59, 647–673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zimman L (2017). Variability in /s/ among transgender speakers: Evidence for a socially grounded account of gender and sibilants. Linguistics, 55, 993–1019. 10.1515/ling-2017-0018 [DOI] [Google Scholar]
