Gradient Perception of Children’s Productions of /s/ and /θ/:A Comparative Study of Rating Methods

Sarah K Schellinger; Benjamin Munson; Jan Edwards

doi:10.1080/02699206.2016.1205665

. Author manuscript; available in PMC: 2017 Jan 1.

Published in final edited form as: Clin Linguist Phon. 2016 Aug 23;31(1):80–103. doi: 10.1080/02699206.2016.1205665

Gradient Perception of Children’s Productions of /s/ and /θ/:A Comparative Study of Rating Methods

Sarah K Schellinger ¹, Benjamin Munson ², Jan Edwards ³

PMCID: PMC5200952 NIHMSID: NIHMS833009 PMID: 27552446

Abstract

Past studies have shown incontrovertible evidence for the existence of covert contrasts in children’s speech, that is, differences between target productions that are nonetheless transcribed with the same phonetic symbol. Moreover, there is evidence that these are relevant to forming prognoses and tracking progress in children with speech sound disorder. A challenge remains to find the most efficient and reliable methods for assessing covert contrasts. This study investigates how readily listeners can identify covert contrasts in children’s speech when using a continuous rating scale in the form of a visual analog scale (VAS) to denote children’s productions. Individual listeners’ VAS responses were found to correlate statistically significantly with a variety of continuous measures of children’s production accuracy, including judgments of binary accuracy pooled over a large set of listeners. These findings reinforce the growing body of evidence that VAS judgments are potentially useful clinical measures of covert contrast.

Keywords: Fricative, Children, Speech Perception, Assessment, Covert Contrast

Children learn to speak like adults in a remarkably short period of time. Numerous cross-sectional and longitudinal studies have shown that by the age of only 5 or 6 years, children produce most or all of the sounds of their language correctly, as judged by phonetic transcriptions made by experienced transcribers (e.g., Smit, Hand, Freilinger, Bernthal, & Bird, 1990). From the first vocalizations to the point at which children’s productions are transcribed as completely accurate, children are learning contrasts among linguistically relevant units. The earliest contrasts that children learn may be as simple as the contrast between different syllable shapes. As children later produce more adult-like speech, they are able to produce even such fine-grained contrasts as the contrast between two highly similar sounds.

Considerable within- and between-child variability exists in the production of sounds, from the earliest transcribable vocalizations to the achievement of adult-like speech. This variability can be seen in phonetic transcriptions of children’s speech. For example, while each speech sound in adult speech would likely be transcribed as a correct production, transcriptions of children’s speech commonly contain notations of substitutions, distorted productions, and deletions. Furthermore, transcriptions may reveal that children’s productions of a given speech sound vary across productions (e.g., Macrae, Tyler, & Lewis, 2014; Macrae & Sosa, 2015). For example, a child might be transcribed as correctly producing a given speech sound on 50% of attempts, while the remainder of attempts are transcribed as errors.

However, there is much variability in children’s speech that is less easily captured using phonetic transcription. Children’s speech sound productions vary considerably in terms of their fine-grained, articulatory-acoustic properties. Even if a child is perceived as producing a clear example of a given sound, the production may still vary acoustically from a typical adult production. This kind of variability can be viewed as existing at the phonetic or subphonemic level, because it encompasses variability that can exist within phoneme categories. Research on covert contrast documents this articulatory-acoustic variability in child speech. Covert contrast occurs when significant articulatory differences are present between two phonemes in a child’s speech, but both phonemes are transcribed with the same symbol. Because both variants fall within a single adult phonemic category, transcribers denote the two variants with the same phoneme symbol. Covert contrast has been found in the speech of typically developing children and children with phonological disorders for contrasts involving voicing and place of articulation for both stops and fricatives (e.g., Baum & McNutt, 1990; Forrest, Weismer, Elbert, & Dinnsen, 1994; Forrest, Weismer, Hodge, Dinnsen, & Elbert, 1990; Gierut & Dinnsen, 1986; Hewlett, 1988; Li, Edwards & Beckman, 2009; Macken & Barton, 1980; Maxwell & Weismer, 1982; Scobbie, Gibbon, Hardcastle, & Fletcher, 2000). For example, Baum and McNutt (1990) compared children’s correct productions of /θ/ with correct productions of /s/ and frontal misarticulations of /s/. Although frontal misarticulations of /s/ are commonly described as substitutions of /θ/, acoustic analyses revealed significant differences between frontal misarticulations and correct productions of both /s/ and /θ/.

Given that the role of the Speech Language Pathologist (SLP) is to accurately and thoroughly characterize children’s speech sound productions, it is clear that SLPs would benefit from assessment methods that would allow them to describe variability at the phoneme level, as well as differences in phonetic detail within phonemes. Most of the standardized assessments of children’s speech use phonetic transcription. Phonetic transcription involves denoting a clinician’s auditory-perceptual judgment of a child’s speech sound production with a finite set of symbols. This allows for a coarse-grained description of the production using phonetic symbols, which are then used to make binary judgments of correct or incorrect. However, as discussed above, this method fails to characterize more subtle phonetic variability within children’s speech.

Baum and McNutt’s (1990) findings on the acoustics of frontal misarticulations illustrate this point. Using transcription alone, children’s frontal misarticulations of /s/ might be transcribed as a substitution error and denoted with the phonetic symbol [θ]. Similarly, correct productions of /θ/ would also be transcribed with [θ]. However, Baum and McNutt found that there were systematic acoustic differences between these two types of [θ]. Labeling the frontal misarticulation as a substitution error does provide a coarse-grained description of the child’s production. In addition, it highlights the potential linguistic consequences of such errors (e.g., an inability to convey meaning for words containing these sounds resulting in reduced speech intelligibility and, consequently, less-effective communication). However, more fine-grained information is lost—namely the fact that the children were, in fact, making a systematic contrast between target /s/ and target /θ/. This highlights the limitations that are imposed when phonetic transcription is used as the sole method for denoting children’s speech productions. When children’s productions vary subtly in acoustic-phonetic properties from a prototypical adult form, the use of transcription may obscure potentially important information, such as that the children are capable of producing a contrast between two phonemes. In other words, children may perceive that two phonemes are different and have begun to produce them differently at a sub-phonemic level, but the difference between them is not large enough to be transcribed as such when coarse-grained systems are used. This distinction has important implications for the assessment and treatment of children with disorders in speech-sound production, as it has been shown that children who exhibit covert contrast progress more quickly in therapy than children without covert contrast (Tyler, Figurski & Langdale, 1993). Moreover, a fine-grained system might allow individuals to document within-category improvement in phoneme production in children receiving speech therapy in a way that is impossible with phonetic transcription.

Given this limitation, an obvious question that arises is how we can improve upon current transcription methods to better characterize children’s speech, such that information on fine phonetic detail is not lost. Some researchers have suggested that one way to improve transcription is to distinguish between intermediate productions (which are perceived as in between two phonemes) and correct productions or clear substitutions (e.g., Stoel-Gammon, 2001; Edwards & Beckman, 2008). To the extent that listeners can perceive these intermediate sounds, this would allow coding of subtle distinctions that may be lost using the standard transcription process. While this approach is promising, it is not in wide use, and little research exists on the reliability and validity of using an intermediate category during transcription. Moreover, this method only increases the number of categories that can be denoted from two to three.

A second possibility is to pool perceptual judgments of a large set of naïve listeners together to examine fine phonetic detail. This possibility was explored by Li, Munson, Edwards, Yoneyama, and Hall (2011) and by Munson and Urberg Carlson (2015). Munson and Urberg Carlson (2015) examined adults’ perception of children’s productions of target /s/ and /ʃ/ in three different experiments. They found that the proportion of listeners who judged a token to be /s/ was well predicted by acoustic characteristics of the fricatives being rated. Figure 1 plots data reported by Munson and Urberg Carlson. It shows that the relationship between the percentage of listeners who judged a token to be /s/ and centroid frequency (an acoustic measure that distinguishes between /s/ and /ʃ/, Jongman, Wayland, and Wong, 2000) was relatively linear. Average judgments can therefore potentially be used to measure within-category differences in children’s productions. Average judgments of naïve listeners have two additional benefits. First, they are ecologically valid, as they presumably predict the cumulative feedback that children receive from members of the language community during social interactions, a claim that is developed in more detail in Julien and Munson (2012). Second, they can be used as continuous measures when a suitable acoustic or articulatory measure does not exist. The acoustic characteristics of obstruent consonants are still actively debated, and for some sound contrasts no one acoustic measure differentiates between endpoints. For example, the contrast between /s/ and /θ/ is reflected in the spectral centroid, the spread of energy in the spectrum, and the intensity of the fricative (Jongman et al., 2000). However, average judgments have the obvious disadvantage of requiring a large group of individuals to provide judgments, rather than a single listener.

Scatterplot showing the association between the proportion of listeners who judge a sound to be /s/ and the centroid frequency for a 40 ms interval of frication centered at the fricative midpoint. These data are a re-analysis of responses reported in Munson and Urberg Carlson (2015)

The final possibility is to use an assessment tool that is not categorical in nature and which can be made by individual listeners. While transcription requires a listener to listen to continuous acoustic-phonetic signals and then assign the resulting percept to one of a finite number of categories (e.g. the “s” sound versus the “th” sound), other tasks allow for a graded response. Using measures that allow for continuous responses, a number of experimental studies have shown that listeners can indeed perceive subtle within-category acoustic differences for obstruents in certain tasks (e.g., Massaro & Cohen, 1983; McMurray, Tanenhaus, & Aslin, 2002; Carney, Widin, & Viemeister, 1977; Pisoni & Tash, 1974). Indeed, work by Toscano, McMurray, Dennhardt and Luck (2010) shows that these graded responses are reflected in electrophysiological responses both of auditory encoding and phonemic categorization.

One type of tool that allows for continuous responses for the assessment of children’s speech sounds is visual analog scaling (VAS). VAS is often used in the assessment of complex, multidimensional percepts, such as the perception of pain in clinical medical settings. There is considerable research on the reliability and validity of this measure in the pain literature (e.g., Price, McGrath, Rafii & Buckingham, 1983; Bijur, Silver, & Gallagher, 2001; Gallagher, Liebman & Bijur, 2001). It is also used widely in the study of voice disorders and is part of one standardized voice assessment, the CAPE-V (Kempster, Gerratt, Verdolini Abbott, Barkmeier-Kraemer, & Hillman, 2009). VAS has also been used to study adults’ perception of children’s speech. In one such procedure, Munson, Johnson, and Edwards (2012) presented participants with a horizontal line with endpoints anchored with the text “the ‘s’ sound” at one end and “the ‘th’ sound” at the other end. Listeners were presented with children’s productions of target /s/ and target /θ/ and were asked to use a mouse to click a point on the line where they perceived that a given sound production fell along the continuum. Munson et al. found a strong correlation between the click location for individual stimuli and acoustic measures of the stimuli. This suggests that not only were listeners able to perceive subphonemic variation in the productions, but also that their perceptions were related to the actual physical properties of the speech signal. Subsequent research by Munson and Urberg Carlson (2015) has shown that VAS is psychometrically superior to two other continuous measures of within-category variation, Likert scale judgments and direct magnitude estimates of category goodness. Julien and Munson (2012) showed that VAS ratings of children’s /s/ and /ʃ/ productions are correlated with the centroid frequency of the fricative. Centroid frequency distinguishes between productions of /s/ and /ʃ/.

Further evidence of the utility of VAS in assessing within-category variation comes from Munson, Edwards, Schellinger, Beckman, and Meyer (2010). Munson et al. examined the relationship between VAS ratings and transcription categories. Children’s productions of /s/ and /θ/ were first transcribed by a trained phonetician using Stoel-Gammon’s (2001) suggestion to code intermediate productions. Six transcription categories were used: correct /s/, clear substitution of [s] for /θ/, intermediate but closer to /s/, intermediate but closer to /θ/, clear substitution of [θ] for /s/, and correct /θ/. VAS judgments were then elicited from a set of listeners using a line with endpoints representing /s/ and /θ/. Participants listened to children’s speech sounds and rated where they perceived the sound to fall along the line. Researchers reported that mean VAS ratings were significantly different for each of the transcription categories. Furthermore, the pattern was exactly as expected; correct /s/ transcriptions had the highest mean click location (i.e., were closer to the “s” end of the line), followed in order by substitutions of [s] for / θ /, intermediate productions closer to /s/, intermediate productions closer to /θ/, and substitutions of [θ] for /s/. Finally, correct productions of /θ/ had the lowest mean VAS rating (i.e., were closest to the “th” end of the line). Subsequent research has used VAS to evaluate a variety of other speech contrasts, most notably the contrast between correct and incorrect /r/ productions (McAllister Byun, Halpin, & Harrell, 2015; McAllister Byun, Halpin, & Szeredi, 2015) and the contrast between /t/ and /k/ (Strombergsson, Salvi, & House, 2015).

The results of Munson et al. (2010) suggest that VAS ratings differ as a function of subtle subphonemic differences across productions. However, these results are based on VAS ratings averaged across listeners for all the tokens in a given transcription category. Individual tokens within a transcription category vary for meaningful reasons (i.e., because of specific characteristics of the talkers who produced them, or the words in which they were produced. . Hence, the analyses used by Munson et al. reflect what Clark (1973) referred to as the “language as fixed-effect fallacy,” as they do not model the effects of token-level variation on performance. Moreover, these results do not tell us whether individual listeners’ VAS judgments for specific items also serve as a continuous measure of subtle subphonemic differences. In clinical practice, it is generally a single clinician who assesses a child’s speech during the assessment process. Therefore, if a task such as VAS is to have clinical utility, it is necessary to determine whether individual listeners’ ratings of individual tokens, rather than just averages across transcription categories, continuously track subphonemic detail.

This current study was designed as a follow-up to Munson et al. (2010) to address this limitation. Specifically, we wanted to determine whether individual listeners’ VAS ratings of children’s productions of individual speech tokens reflected continuous variation in category goodness. To accomplish this goal, we developed two primary research questions.

First, we asked whether individual listeners used the entire VAS line to report their perceptions, or whether they instead tended to simply respond by clicking at discrete locations along the line, such as endpoints. To evaluate this, we first visually examined plots depicting the distributions of individual listeners’ VAS ratings. Next, we conducted two statistical analyses. The first examined the extent to which individual listeners’ ratings differentiated among six types of transcriptions made using the very fine-grained transcription system described by Munson et al. (2010). Next, we used mixture models to decompose the listeners’ distribution of responses into different underlying distributions.

Our second primary research question relates to the concurrent validity of VAS. Specifically we asked whether individual listeners’ VAS ratings could be predicted by a continuous measure of how /s/ or /θ/-like each stimulus was, beyond how well they could be predicted by a binary categorization of the sound as /s/ or /θ/. In this study, the continuous measure was binary judgments of whether the sound was /s/ or /θ/, averaged across multiple listeners, to which we refer henceforth as community identification judgments. If community judgments are more predictive of VAS judgments than are binary judgments provided by a single transcriber, it would suggest that use of VAS may be a valid way to obtain information on fine phonetic detail in children’s speech.

To answer these research questions, we conducted two experiments on adult listeners’ perception of children’s productions of two speech sounds: /s/ and /θ/. We elicited perceptual judgments from one group of listeners using VAS and from another group of listeners using binary identification judgments. We chose the /s/ and /θ/ sounds for several reasons. First, both are typically mastered relatively late in development (e.g., Sander, 1972; Fudala & Reynolds, 1986; Smit et al., 1990). Additionally, children are often observed to produce /θ/-like substitutions for /s/ (McGlone & Proffitt, 1973; Smit, et al., 1990). Furthermore, as discussed earlier, Baum and McNutt (1990) documented covert contrast in children’s productions of these sounds. Therefore, we felt confident that using these speech sounds would ensure variability in terms of the fine phonetic detail contained in the speech productions. Furthermore, we believed that because of this phonetic variability, it was likely that some of these productions might be perceived as more “ambiguous” or “intermediate.” Finally, there is no one acoustic measure that differentiates between /s/ and /θ/ in English. As described in Jongman et al. (2000) and below, a combination of acoustic measures are needed to distinguish productions of these sounds. A given sound may be intermediate because of intermediacy in any of these parameters. Moreover, individual listeners might differ in their weighting of these parameters. These two facts mean that a study of individual differences in the relationship between acoustics and perception of the /s/-/θ/ contrast would be methodologically very difficult. This makes the community identification judgments of /s/ and /θ/ a particularly appropriate measure of continuous variation.

Methods

VAS Task

This study used the same set of VAS data described in Munson et al. (2010). Given that Munson et al. only provided a cursory explanation of the methods used to obtain these data, a more detailed description will be presented here.

Participants

Twenty-one adult listeners participated in the VAS rating task. All were living in Minneapolis, MN, were native speakers of North American English, and were between the ages of 18 and 45. Participants were recruited by referral or by postings at the University of Minnesota and in the surrounding community. According to self-report, none of the participants had a history of speech, language, or hearing disorders. Each participant provided informed consent and was compensated for his or her time.

Stimuli

For this experiment, 200 word-initial CV syllables beginning with /s/ and /θ/ were excised from single word productions of familiar words (such as sofa) and non-words (such as /sʌpʰoʊn/) which were elicited from typically-developing two- to five-year-old native English speakers using a word repetition task. These words came from a larger study (Edwards & Beckman, 2008) on obstruent development across several languages. Full details of the elicitation protocol, as well as a description of the effects of lexicality, word length, and prosodic structure on consonant accuracy, can be found in Edwards and Beckman (2008). Briefly, each child participated in a word repetition task that was conducted in a quiet room at his or her preschool. Each word or non-word stimulus began with a single obstruent consonant, followed by a monophthong vowel, and was between one and four syllables. Words and nonwords were presented using a laptop computer, and a corresponding picture was shown on the screen. For non-words, pictures of unfamiliar objects without commonly known names were used. Children were asked to repeat each word or nonword, and their productions were recorded using a head-mounted microphone.

All of the words and non-words were transcribed by a native speaker of English (the first author). During the transcription process, the transcriber first made a binary judgment as to the accuracy of the /s/ or /θ/ production. The transcriber then broadly transcribed it using IPA phonetic symbols. Additionally, using Stoel-Gammon’s (2001) suggestion, the transcriber identified and coded productions that she perceived as intermediate between /s/ and /θ/, differentiating between those that were intermediate but closer to /s/, and intermediate but closer to /θ/. The CV syllables that were selected for this experiment were those transcribed as containing one of the following: a correct /s/, a correct /θ/, an [s] for /θ/ substitution, a [θ] for /s/ substitution, or a sound that was intermediate between /s/ and /θ/. The latter category was subdivided into intermediate productions deemed to be closer to /s/ (henceforth [s:θ]) or ones deemed to be closer to /θ/ (henceforth [θ:s]) The intention behind using these specific transcription categories was to develop a set of stimuli that included a great deal of natural variation along the continuum from /s/ to /θ/.

To select the specific CVs to use in this study, we first identified every instance of a child’s production in the corpus created in the Edwards and Beckman (2008) study that fell into one of these transcription categories. Next, we eliminated those productions for which the vowel was transcribed as being produced incorrectly. The first author then listened to each of the remaining CV stimuli. Productions for which the presence of background noise in the recording was felt to obscure either the consonant or vowel sound were eliminated. Additionally, if the first author judged that it was not a good exemplar of the transcription category, then the production was omitted. Finally, the stimuli were balanced such that approximately half were transcribed as /s/ and half were transcribed as /θ/. This was done to avoid the ‘set effects’ that are commonly observed in speech perception experiments (Keating, Mikos, & Ganong, 1981). In addition, for each transcription category, vowel context and the speaker’s age were balanced as best as possible. Productions were excluded if they would upset this balance. Following these steps, 200 CV productions were included, produced by a total of 43 children (10 two-year olds, 11 three-year olds, 13 four-year-olds, and 9 five-year olds). Of these children, 21 were female and 22 were male. Each CV syllable was normalized for amplitude. The CV syllables contained the initial fricative and a 150 ms vocalic portion. Descriptions of the stimuli can be found in Tables 1 and 2.

Table 1.

Stimuli Inventory: Total number of Consonant-Vowel syllables by age, vowel context, and transcription category

Following Vowel	[θ] substitutions for /s/				Correct /θ/				Intermediate Tokens (but slightly closer to /θ/)				Total
Following Vowel	2;0- 2;11	3;0- 3;11	4;0- 4;11	5;0- 5;11	2;0- 2;11	3;0- 3;11	4;0- 4;11	5;0- 5;11	2;0- 2;11	3;0- 3;11	4;0- 4;11	5;0- 5;11	Total
(/i/ and /ɪ/)	1	1	2	0	0	4	13	14	2	2	4	1	44
(/e/ and /ɛ/)	0	4	1	0	0	0	0	0	1	2	1	1	10
(/α/)	4	4	0	1	0	1	1	2	1	2	2	2	20
(/o/)	2	1	0	0	0	0	0	0	1	1	2	0	7
(/u/ and /ʊ/)	0	3	0	0	0	2	5	4	1	1	2	1	19

	Total: 24				Total: 46				Total: 30				100

Open in a new tab

Note: This table displays the 50 percent of the CV stimuli that were transcribed as /θ/ or “more /θ /-like” (for intermediate tokens).

Table 2.

Stimuli Inventory: Total number of Consonant-Vowel syllables by age, vowel context, and transcription category

Following Vowel	[s] substitutions for /θ/				Correct /s/				Intermediate Tokens (but slightly closer to /s/)				Total
Following Vowel	2;0- 2;11	3;0- 3;11	4;0- 4;11	5;0- 5;11	2;0- 2;11	3;0- 3;11	4;0- 4;11	5;0- 5;11	2;0- 2;11	3;0- 3;11	4;0- 4;11	5;0- 5;11	Total
(/i/ and /ɪ/)	2	3	4	3	3	2	4	5	1	2	4	1	34
(/e/ and /ɛ/)	0	0	0	0	2	2	4	2	1	1	1	2	15
(/α/)	1	2	1	0	0	2	2	4	0	2	2	0	16
(/o/)	0	0	1	0	2	1	3	2	2	2	2	0	15
(/u/ and /ʊ/)	1	2	4	0	1	2	4	3	1	2	0	0	20

	Total: 24				Total: 50				Total: 26				100

Open in a new tab

Note: This table displays the 50 percent of the CV stimuli that were transcribed as /s/ or “more /s/-like” (for intermediate tokens).

The stimuli were subjected to a set of five acoustic measures to ensure that they contained ample variation in the acoustic features that are relevant for perception of /s/ and /θ/. The first (m1) and second (m2) spectral moments from a 40 ms interval of frication centered at fricative midpoint were calculated. The first of these distinguishes /s/ from /ʃ/, and also distinguishes between tokens of /s/ that listeners rate as more accurate and those they rate as less accurate (Holliday, Reidy, Edwards, & Beckman, 2015). The second of these distinguishes between /s/ and /θ/ (Jongman, Wayland, & Wong, 2000). The frequency of the second formant of the following vowel at onset (onset F2) was also calculated. This has been shown to differentiate between /s/ and /ʃ/. The duration of the fricatives was also logged, because duration of frication noise has been shown to vary with place of articulation (You, 1979) and to influence perceptual identification judgments (Jongman, 1989). Finally, we logged the intensity ratio between the fricative and the vowel. This was calculated by taking the intensity at fricative midpoint in dB IL and subtracting the intensity at vowel midpoint in dB IL (relative intensity). Because dB is a logarithmic scale, the ratio is calculated as a difference score. The intensity of /θ/ is lower than that of /s/.

The five acoustic measures were used as predictors in two stepwise linear discriminant function analyses (DFA). The first DFA predicted whether the sound was transcribed as [θ] (including [θ:s] productions) or [s] (including [s:θ] productions). The second DFA predicted membership in one of the six transcription categories. In both of these DFAs, three variables significantly improved categorization rates: m1, m2, and relative intensity. Figures 2 and 3 plot the 200 stimuli in these three dimensions. The symbol size in these figures corresponds to the results of the binary categorization study, described in detail below. As these figures show, the sounds varied substantially in the three relevant acoustic dimensions, in a direction predicted by previous research: sounds were more likely to be labeled as /s/ if they had a higher m1, a lower m2, or lower relative intensity (i.e., a more-intense fricative). These figures suggest that the target stimuli are likely to vary widely in terms of how good an example of target /s/ or target /θ/ they are.

Scatterplot of the relationship between centroid (m1) frequency in Hertz and standard deviation (m2) of frequency in Hertz for a 40 ms interval of frication centered at fricative midpoint for the stimuli used in this study. The shading reflects the proportion of listeners in this study who identified the sound as /s/ in this study.

Scatterplot of the relationship between the peak intensity in the fricative and the peak intensity in the following vowel (relInt, dB) and standard deviation (m2) of frequency in Hertz for a 40 ms interval of frication centered at fricative midpoint for the stimuli used in this study. The shading reflects the proportion of listeners in this study who identified the sound as /s/ in this study.

Procedure

Each participant was tested individually in a sound-proof booth, seated in front of a computer monitor. Each of the 200 CV stimuli was played over headphones in random order using E-Prime software (Schneider, Eschmann, & Zuccolotto, 2002). Listeners were informed that they would hear consonant-vowel syllables taken from words that were supposed to start with “s” or “th.” Instructions gave examples of words beginning with /θ/ to cue them that they were to listen for the voiceless variant, and not for /ð/. The listeners were asked to rate the consonant in each CV syllable using a visual analog scale (shown in Figure 4) that was presented on the computer monitor. Listeners were explicitly instructed to click the location along the line that corresponded with the percept of ‘proximity’ to “s” or “th” and were encouraged to use the entire line. To assess intra-rater reliability, 20 items were repeated. The second ratings of these items were used in the analysis of reliability only.

Visual Analog Scale used in the VAS rating task.

Data Analysis

The click location for each stimulus trial was analyzed in terms of the number of pixels along the x-dimension of the visual analog line. The left end of the VAS line (corresponding to “the “s” sound”) was denoted as the zero point and the right end of the VAS line corresponded with 535 pixels. Any clicks that fell off the line in the horizontal dimension were assigned these minimum and maximum values (i.e., clicks left of the line were assigned “0” and clicks right of the line were assigned 535). This comprised 1.4% of the total number of tokens (n=60). All responses were +/− 25 pixels from the line in the y-dimension. The location of the click on the y-dimension was not systematically related to transcription category..

For ease of interpretation, click locations for each trial were then transformed into a metric indicating their location in terms of the proportion of the line. The click location for each trial was divided by the maximum value of 535, resulting in click location values that ranged from zero to one. These were then inverted so that click locations closer to zero correspond with percepts more like “the “th” sound” and click locations closer to one correspond with percepts of more like “the “s” sound”. The inversion was done so that the VAS judgments would be positively correlated with the community judgment scores. A click location of .5 indicates that the listener perceived the sound as exactly between /s/ and /θ/.