Abstract
Previous research indicates that talkers differ in phonetically relevant properties of speech, including voice-onset-time (VOT) in word-initial stop consonants; some talkers have characteristically shorter VOTs than others. Previous research also indicates that VOT is robustly affected by contextual influences, including speaking rate and place of articulation. This paper examines whether these contextual influences on VOT are themselves talker-specific. Many tokens of alveolar ∕ti∕ (experiment 1) or labial ∕pi∕ and velar ∕ki∕ (experiment 2) were elicited from talkers across a range of rates. VOT and vowel duration (a metric of rate) were measured for each token. Hierarchical linear modeling analyses showed that (1) VOT increased as rate decreased for all talkers, but the magnitude of the increase varied significantly across talkers; thus the effect of rate on VOT was talker-specific; (2) the talker-specific effect of rate was stable across a change in place of articulation; and (3) for all talkers VOTs were shorter for labial than velar stops, and there was no significant variability in the magnitude of this displacement across talkers; thus the effect of place on VOT was not talker-specific. The implications of these findings for how listeners might accommodate talker differences in VOT during speech perception are discussed.
INTRODUCTION
The past 50 years of research in speech acoustics have yielded substantial information on the acoustic parameters that specify individual speech segments. One consistent finding in this domain is that there is considerable variability in the acoustic-phonetic information produced for individual consonants and vowels, such that there is no one-to-one mapping between the acoustic signal and speech segment. Many sources of acoustic-phonetic variability have been examined, including variability that results from differences in pronunciation across individual talkers. Talker differences have been observed for a host of speech sound classes including vowels (Hillenbrand et al., 1995; Peterson and Barney, 1952), fricatives (Newman et al., 2001), stops (Allen et al., 2003; Byrd, 1992; Zue and Laferriere, 1979), and liquids (Espy-Wilson et al., 2000; Hashi et al., 2003). The goal of the current work is to further characterize such talker differences.
Our primary motivation for examining talker differences in detail stems from recent findings in the speech perception literature indicating that listeners retain fine-grained information about how a talker implements speech segments (Goldinger, 1996; Goldinger, 1998; Palmeri et al., 1993) and that this information can be used to facilitate subsequent processing (Bradlow and Bent, 2008; Bradlow and Pisoni, 1999; Clarke and Garrett, 2004; Nygaard and Pisoni, 1998; Nygaard et al., 1994). In order to provide a theoretical account of speech perception that describes the role of talker-specific phonetic detail, comprehensive data on the acoustic-phonetic consequences of talker differences in speech production are necessary.
In this paper we examine talker differences for one phonetically relevant property of speech, voice-onset-time (VOT). VOT is a primary cue marking the linguistic contrast of voicing in word-initial English stops. In word-initial position, English voiced stops (∕b∕, ∕d∕, and ∕g∕) are typically produced with short VOTs (or, in some cases, with prevoicing), and English voiceless stops (∕p∕, ∕t∕, and ∕k∕), which are aspirated, are produced with longer VOTs (Lisker and Abramson, 1964). Recent research has shown that this property is subject to individual talker differences (Allen et al., 2003). Focusing on voiceless stops, Allen et al. (2003) compared word-initial VOTs for many monosyllabic words across eight talkers. Their results showed that even after statistically controlling for contextual factors such as speaking rate (using both syllable duration and, in separate analyses, vowel duration as metrics of speaking rate), a statistically significant amount of variability in VOT was accounted for by stable differences across individual talkers. In other words, the talkers differed in their characteristic VOTs, with some talkers producing longer VOTs compared to other talkers. Here we build on this finding by examining the role of contextual influences on VOT at the level of individual talkers.
Two contextual factors that have been examined extensively with respect to VOT, and that are the focus of the current research, are place of articulation and speaking rate (e.g., Klatt, 1975; Lisker and Abramson, 1967; Picheny et al., 1986; Robb et al., 2005). First consider place of articulation. It is well established that, in general, VOT increases as place moves from an anterior to posterior point of constriction in the vocal tract (e.g., Cho and Ladefoged, 1999; Lisker and Abramson, 1964; Volaitis and Miller, 1992). In the current paper we examine whether talkers systematically vary in the magnitude of this effect.
Next consider speaking rate. At a global level, rate is a complex variable that encompasses the rate at which speech itself is produced as well as the number and duration of pauses and aspects of higher-level prosodic structure. There is evidence that the specific way in which a change in speaking rate is implemented may vary in numerous respects across individual talkers (e.g., Crystal and House, 1982; Kuehn and Moll, 1976; Matthies et al., 2001; McClean, 2000). Nonetheless, it appears that for all talkers a change in overall rate involves a change in the rate of speech itself (and is not solely due to a change in pausing) (Miller et al., 1984), and this is the focus of the current study. Specifically, we examine how the rate at which a given syllable was produced (defined in terms of its syllable or vowel duration) influences VOT. We know from previous research that, in general, VOT systematically increases as speaking rate decreases (and syllables become longer), especially for voiceless aspirated stops such as English ∕p∕, ∕t∕, and ∕k∕ (e.g., Kessinger and Blumstein, 1997; Miller et al., 1986; Nagao and de Jong, 2007). In the current paper we examine whether talkers exhibit systematic variability in the extent to which changes in rate affect VOT.
We report two experiments. Experiment 1 is centered on the effect of speaking rate on VOT in the context of the alveolar voiceless stop ∕t∕. In experiment 2, we extend the findings of experiment 1 to the labial (∕p∕) and velar (∕k∕) voiceless stops, as well as examine the effect of place of articulation per se on VOT. To preview our results, we find evidence that the magnitude of the speaking rate effect, but not of the place effect, is talker-specific. The implications of these distinct patterns of results for accounts of speech perception are considered in Sec. 4.
EXPERIMENT 1
The primary goal of experiment 1 was to extend the investigation of talker differences in word-initial VOT for voiceless stop consonants by examining the effect of speaking rate on VOT at the level of individual talkers. Specifically, we examined whether the magnitude of the increase in VOT as speaking rate decreases systematically differs across talkers. A secondary goal of experiment 1 was to replicate Allen et al., 2003 using a different methodology. They observed talker differences in VOT when speaking rate was statistically controlled; we examined whether such differences are also observed when comparing syllables produced at the same rate of speech.
Method
Subjects
Ten talkers (five males, E1M1–E1M5; five females, E1F1–E1F5) were recruited from the Northeastern University community for this experiment. The talkers were native speakers of American English between 18 and 31 years of age with no history of speech or language disorders, and were either paid or received partial course credit for their participation.
Recordings
A magnitude-production procedure (e.g., Adams et al., 1993; Lane and Grosjean, 1973; Miller et al., 1986; Volaitis and Miller, 1992) was used to elicit multiple repetitions of the syllable ∕ti∕ that span a wide range of syllable durations. The alveolar stop was recorded in a constrained phonetic environment in order to control for factors that can influence VOT and syllable duration (e.g., vowel identity and final consonant, Port and Rotunno, 1979; Weismer, 1979); such factors could introduce extraneous variability, making it difficult to isolate talker-specific effects of rate on VOT. In the magnitude-production procedure, talkers were directed to produce clear tokens of the syllable ∕ti∕ at their normal speaking rate and at rates relative to their normal speaking rate. Each talker was recorded producing eight “runs” of syllables. A run consisted of six repetitions of ∕ti∕ at each of the following speaking rates: normal, twice as fast, four times as fast, as fast as possible, normal, twice as slow, four times as slow, as slow as possible. Thus, each run yielded syllables produced at eight speaking rates—seven unique speaking rates and two blocks of repetitions produced at a “normal” speaking rate. Note that this procedure was used as a tool for acquiring syllables that exhibited variation in overall duration and not as a means to compare the duration of syllables across individuals produced, for example, at a normal speaking rate. The extreme rate prompts (e.g., “as fast as possible”) were provided to encourage duration variation, and talkers were told that these prompts should reflect the variation found in natural speech and not, for example, the direction to speak as fast as humanly possible. Talkers were given a practice run prior to the recording session and were also given a short break after the first four runs. All recordings took place in a sound-attenuated booth. Speech was recorded via microphone (AKG C460B) onto digital audiotape.
In total, 3840 syllables (6 repetitions×8 speaking rates×8 runs×10 talkers) were recorded. All recordings were digitized at a sampling rate of 20 kHz using the CSL system (KayPENTAX). Syllables produced in the first block of the normal speaking rate for each run were excluded from further analyses to help ensure that, at least to a first approximation, tokens were evenly distributed across the measured range of syllable (and vowel) duration. In addition, the final repetition at each speaking rate was excluded from further analyses because this token may have been subject to a phrase-final lengthening effect (Klatt, 1976). Excluding these tokens left 2800 possible syllables (5 repetitions×7 speaking rates×8 runs×10 talkers) for acoustic analysis.
Acoustic measurements
The PRAAT speech analysis software (Boersma, 2001) was used to generate a waveform for each syllable. On each waveform, three points in time were located: the onset of the release burst, marked by the onset of low amplitude, aperiodic noise; voicing onset, marked by the onset of high amplitude, periodic energy; and voicing offset, marked by the offset of the last visible glottal pulse. From these three points in time, three durations were calculated. VOT was calculated as the latency between the release burst and voicing onset. Vowel duration was calculated as the latency between voicing onset and voicing offset. Syllable duration was calculated as the latency between the onset of the release burst and voicing offset. In line with numerous studies examining the effect of speaking rate at the segmental level, vowel duration and syllable duration were used as metrics of rate (e.g., Allen et al., 2003; Kessinger and Blumstein, 1997; Nagao and de Jong, 2007; Port, 1981). Vowel duration was used as the primary metric because the statistical analyses used in the current research require that the metric of speaking rate and VOT be mathematically independent. (Because the syllable duration measurement for a particular token includes VOT for that token, syllable duration is not mathematically independent of VOT.) However, syllable duration was also considered, as a secondary metric, in accord with the traditional definition of speaking rate as number of syllables produced per unit time. For all analyses presented in this paper, two versions were conducted; one that used vowel duration as the metric of rate and one that used syllable duration as the metric of rate. Analogous results were found in all cases. For ease of explication, we describe all analyses and results only using the vowel duration metric.
For the 2800 syllables measured, two exclusionary criteria were used to select a final set for statistical analysis. First, a token was excluded if there were production anomalies or if a clear burst onset and vowel offset could not be determined; 2.4% of the tokens were excluded on this basis. Second, a token was excluded if its syllable duration was greater than 799 ms. This criterion, which was established through informal listening, was intended to exclude tokens that were perceived as unnaturally long; 2.8% of the tokens were excluded on this basis. As a result of this selection process, 2654 syllables that spanned durations from 125 to 798 ms were used in subsequent analyses.
Reliability
One trained experimenter conducted all acoustic measurements. In order to determine cross-experimenter reliability, a different trained experimenter measured approximately 13% of the syllables (one randomly determined run from each talker). Correlations (Pearson’s r) between the two experimenters’ measurements were 0.99 for both VOT and vowel duration. The mean absolute differences between the experimenters’ measurements were 2 ms (SD=2) for VOT and 12 ms (SD=16) for vowel duration.
Results
For each of the ten talkers, a linear function relating VOT to vowel duration was calculated using a least squares prediction method. To illustrate, Fig. 1 shows VOT (ms) as a function of vowel duration (ms) for two of the ten talkers; in this figure, each filled circle represents a single token of ∕ti∕ and the solid lines represent the linear functions relating VOT to vowel duration. For both talkers, the tokens span a wide range of vowel durations, and VOT systematically increases as speaking rate decreases.1
Table 1 shows the slopes and intercepts of the ten individual talker functions, as well as the correlations (Pearson’s r) between the functions and observed values as an index of goodness-of-fit. Slopes are shown as the increase in VOT (ms) per 100 ms increase in vowel duration and the intercepts are shown as VOT at the mean vowel duration produced across all talkers, which was 319 ms. The slopes of the individual talker functions measure the effect of speaking rate on VOT. The intercepts of the individual talker functions represent VOT at a given vowel duration; in other words, the intercepts of the individual talker functions measure VOT at a single speaking rate.
Table 1.
Talker | Alveolar | ||
---|---|---|---|
Slope | Intercept | r | |
E1M1 | 21 | 91 | 0.78 |
E1M2 | 23 | 79 | 0.81 |
E1M3 | 14 | 62 | 0.67 |
E1M4 | 8 | 78 | 0.69 |
E1M5 | 7 | 62 | 0.50 |
E1F1 | 16 | 77 | 0.68 |
E1F2 | 10 | 82 | 0.71 |
E1F3 | 22 | 86 | 0.70 |
E1F4 | 14 | 71 | 0.71 |
E1F5 | 12 | 87 | 0.74 |
Consider first the slopes of the individual talker functions. Across the ten talkers, the slopes show wide variability. For example, given a 100 ms change in vowel duration, VOT for talker E1M2 increases approximately three times as much as VOT for talker E1M4 (also shown in Fig. 1). Turning to the intercepts of the individual talker functions, VOT also varies considerably, spanning values from 62 to 91 ms. Inspection of these parameters suggests that the magnitude of the effect of rate on VOT does vary across talkers, and that talker differences in VOT are present for syllables produced at the same speaking rate.
A hierarchical linear modeling (HLM) analysis (Bryk and Raudenbush, 1992) was used in order to test the statistical significance of the variability in talkers’ slopes and intercepts. One benefit of using an HLM analysis is that it allows us to compare the slope and intercept parameters across talkers while taking into account the entire set of data, which consisted of 2654 tokens. (A complete description of the HLM structure for all models presented in this paper is provided in the Appendix.) In terms of the talkers’ slopes, results showed that the mean slope across talkers was non-zero [t(9)=8.11, p<0.001], confirming that VOT systematically increased as vowel duration increased (i.e., rate decreased) across the group of talkers. Critically, the results also showed that there was significant variability in the talkers’ slopes [χ2(9)=374.78, p<0.001], indicating that how much VOT increased as rate decreased was not the same for all talkers. In terms of the talkers’ intercepts, results confirmed that the mean intercept across talkers was non-zero [t(9)=25.43, p<0.001], and showed that there was significant variability in the talkers’ intercepts [χ2(9)=776.23, p<0.001]. This finding indicates that talkers differed in their characteristic VOTs for utterances produced at the same speaking rate.
An additional set of analyses was performed in order to examine whether talker differences in VOT would be observed across a range of vowel durations, and not solely at the mean vowel duration produced across all talkers. The motivation for these analyses stems from the finding that there was significant variability in the slopes of the individual talker functions, with some functions intersecting within the measured range of vowel duration. As a consequence, even though talker differences in VOT were observed at the mean vowel duration, they will not necessarily be observed across a range of vowel durations. For these analyses, four intercepts (shown in Table 2) were calculated for each talker corresponding to VOT (ms) at 200, 300, 400, and 500 ms vowel durations, these values span the range of greatest intersection among the individual functions. HLM analyses (see Appendix) confirmed that there was significant variability in talkers’ intercepts at each vowel duration [in all cases; χ2(9)>311.00, p<0.001], indicating that the presence of talker differences in VOT is not contingent on speaking rate.
Table 2.
Talker | Alveolar intercepts | |||
---|---|---|---|---|
Vowel duration | ||||
200 | 300 | 400 | 500 | |
E1M1 | 66 | 87 | 108 | 129 |
E1M2 | 51 | 74 | 97 | 120 |
E1M3 | 45 | 59 | 73 | 87 |
E1M4 | 69 | 77 | 85 | 93 |
E1M5 | 54 | 61 | 68 | 75 |
E1F1 | 58 | 74 | 90 | 106 |
E1F2 | 71 | 81 | 91 | 101 |
E1F3 | 60 | 82 | 104 | 126 |
E1F4 | 54 | 68 | 82 | 96 |
E1F5 | 73 | 85 | 97 | 109 |
EXPERIMENT 2
Experiment 2 extends the findings from experiment 1 in three ways. First, we attempt to replicate the findings from the first experiment for the other two voiceless stops in English, labial ∕p∕ and velar ∕k∕. Second, we examine whether the effect of rate for a particular talker is stable across a change in place of articulation by comparing the slopes of the functions relating VOT to vowel duration for the labial and velar voiceless stops. Third, experiment 2 examines whether the contextual influence of place of articulation on VOT is itself talker-specific. As noted earlier, previous research has shown that, in general, VOT increases as place moves from front to back in the vocal tract (e.g., Lisker and Abramson, 1964). In the current experiment we examine whether the magnitude of the difference in VOT between ∕p∕ and ∕k∕ varies across individual talkers.
Method
Subjects
Ten talkers (five males, E2M1–E2M5; five females, E2F1–E2F5) who did not participate in experiment 1 were recruited from the Northeastern University community for this experiment. The talkers were native speakers of American English between 18 and 22 years of age with no history of speech or language disorders, and were either paid or received partial course credit for their participation.
Recordings
The magnitude-production procedure described in experiment 1 was used to elicit multiple repetitions of the syllables ∕pi∕ and ∕ki∕ across a range of syllable durations. As in experiment 1, talkers produced eight runs of each syllable, with each run consisting of six repetitions at eight speaking rates. The order of the labial and velar syllables was counter-balanced across talkers. All recordings followed the procedure outlined for experiment 1.
In total, 7680 syllables (6 repetitions×8 speaking rates×8 runs×10 talkers×2 places of articulation) were recorded. All recordings were digitized at a sampling rate of 20 kHz using the CSL system. As in experiment 1, all syllables produced in the first block of the normal speaking rate for each run and the final repetition at each speaking rate were excluded from further analyses. Excluding these tokens left 5600 possible syllables (5 repetitions×7 speaking rates×8 runs×10 talkers×2 places of articulation) for acoustic analysis.
Acoustic measurements
The PRAAT speech analysis software was used to generate a waveform for each of the 5600 syllables. As in experiment 1, VOT, vowel duration, and syllable duration were calculated for each waveform, and two exclusionary criteria were used to select a final set of syllables for statistical analysis. First, a token was excluded if there were production anomalies or if a clear burst onset and vowel offset could not be determined; 7.4% of the tokens were excluded on this basis. Second, a token was excluded if its syllable duration was greater than 799 ms; 9.6% of the tokens were excluded on this basis. As a result of this selection process, 4646 syllables that spanned durations from 115 to 799 ms were used in subsequent analyses.2
Reliability
Two trained experimenters, who each measured a subset of the recorded tokens, conducted all acoustic measurements. To determine cross-experimenter reliability, a third trained experimenter measured one randomly determined run of ∕pi∕ and ∕ki∕ for each talker (approximately 13% of the tokens). Correlations (Pearson’s r) between the two experimenters’ measurements were 0.98 for VOT and 0.99 for vowel duration. The mean absolute differences between the experimenters’ measurements were 4 ms (SD=6) for VOT and 29 ms (SD=27) for vowel duration.
Results
For each of the ten talkers, two linear functions relating VOT to vowel duration were calculated using a least squares prediction method, one for the labial syllables and one for the velar syllables.3 Table 3 shows the slopes and intercepts of the individual talker functions, as well as the correlations (Pearson’s r) between the functions and observed values as an index of goodness-of-fit. Slopes are shown as the increase in VOT (ms) per 100 ms increase in vowel duration and the intercepts are shown as VOT at the mean vowel duration produced across all talkers for the labial and velar tokens, which was 374 ms. Three sets of analyses were performed on the parameters specifying the individual talker functions. In the first set of analyses, we attempted to extend findings from experiment 1 to labial and velar voiceless stops. The second set of analyses examined whether, for a given talker, the magnitude of the effect of speaking rate on VOT is stable across place of articulation. The third set of analyses examined whether the contextual influence of place of articulation itself is talker-specific.
Table 3.
Talker | Labial | Velar | ||||
---|---|---|---|---|---|---|
Slope | Intercept | r | Slope | Intercept | r | |
E2M1 | 9 | 60 | 0.55 | 6 | 103 | 0.32 |
E2M2 | 5 | 37 | 0.48 | 10 | 91 | 0.55 |
E2M3 | 25 | 83 | 0.80 | 20 | 99 | 0.81 |
E2M4 | 10 | 55 | 0.74 | 7 | 95 | 0.53 |
E2M5 | 3 | 30 | 0.42 | 4 | 86 | 0.48 |
E2F1 | 10 | 78 | 0.51 | 13 | 111 | 0.44 |
E2F2 | 10 | 64 | 0.77 | 9 | 93 | 0.71 |
E2F3 | 12 | 77 | 0.48 | 13 | 126 | 0.61 |
E2F4 | 3 | 57 | 0.30 | 6 | 81 | 0.58 |
E2F5 | 16 | 81 | 0.41 | 23 | 113 | 0.59 |
Replication
Following the structure used for experiment 1, HLM analyses were applied to the labial data (2481 tokens) and, separately, to the velar data (2165 tokens). As expected, the results showed that for both the labial and velar functions, the mean slope across talkers was significantly different from zero [t(9)=5.05, p<0.001 and t(9)=5.74, p<0.001; respectively] and the mean intercept across talkers was significantly different from zero [t(9)=10.94, p<0.001 and t(9)=22.92, p<0.001; respectively]. Moreover, there was significant variability in the slopes [χ2(9)=481.47, p<0.001 and χ2(9)=332.37, p<0.001; respectively] and intercepts [χ2(9)=1559.82, p<0.001 and χ2(9)=957.19, p<0.001; respectively] of the individual talker functions. These results extend the findings from experiment 1 to labial and velar voiceless stops, confirming not only the presence of talker differences in VOT at a single speaking rate, but also that the effect of speaking rate on VOT varied significantly across talkers. As in experiment 1, we also tested for talker differences in VOT across a range of vowel durations (i.e., speaking rates) for both the labial and velar stops (see Table 4). HLM analyses showed that there was significant variability in talkers’ intercepts at each vowel duration [in all cases; χ2(9)>274.00, p<0.001].
Table 4.
Talker | Labial intercepts | Velar intercepts | ||||||
---|---|---|---|---|---|---|---|---|
Vowel duration | Vowel duration | |||||||
200 | 300 | 400 | 500 | 200 | 300 | 400 | 500 | |
E2M1 | 44 | 53 | 62 | 71 | 93 | 99 | 105 | 111 |
E2M2 | 28 | 33 | 38 | 43 | 74 | 84 | 94 | 104 |
E2M3 | 40 | 65 | 90 | 115 | 64 | 84 | 104 | 124 |
E2M4 | 37 | 47 | 57 | 67 | 83 | 90 | 97 | 104 |
E2M5 | 24 | 27 | 30 | 33 | 79 | 83 | 87 | 91 |
E2F1 | 60 | 70 | 80 | 90 | 88 | 101 | 114 | 127 |
E2F2 | 47 | 57 | 67 | 77 | 77 | 86 | 95 | 104 |
E2F3 | 57 | 59 | 81 | 93 | 103 | 116 | 129 | 142 |
E2F4 | 51 | 54 | 57 | 60 | 71 | 77 | 83 | 89 |
E2F5 | 53 | 69 | 85 | 101 | 73 | 96 | 119 | 142 |
Stability of the effect of speaking rate on VOT for individual talkers
Results reported above indicate that the magnitude of the effect of speaking rate on VOT varies across talkers for a given voiceless stop. This finding highlights a source of systematic variability in the speech signal in that how much VOT increases as rate decreases can vary from talker to talker. Here we examine a potential source of stability in the speech signal by comparing the effect of rate on VOT for a given talker across a change in place of articulation. In this analysis, we considered the slopes of the labial and velar functions for individual talkers. If, for a given talker, the effect of rate on VOT is stable across a change in place of articulation, then the slopes of the labial and velar functions will be approximately the same. Inspection of the labial and velar slopes, shown in Table 3, suggests that this may be the case in that the difference between the labial and velar slopes for any given talker is quite small. To illustrate, Fig. 2 displays VOT (ms) as a function of vowel duration (ms) at both places of articulation for one of the ten talkers. VOT increases as speaking rate decreases for both the labial and velar tokens, and does so to approximately the same degree.
In order to examine the statistical significance of the difference between the labial and velar slopes for individual talkers, we conducted an additional HLM analysis nesting the labial and velar slopes within talkers. Results from this analysis revealed that there was no significant variability across talkers in the difference between the labial and velar slopes [χ2(9)=1.60, p>0.50], which indicates that the effect of speaking rate on VOT for a given talker is the same for labial and velar voiceless stops.
Effect of place of articulation on VOT for individual talkers
In this set of analyses, we examined whether the magnitude of the difference between labial and velar VOTs varies significantly across talkers. To quantify the effect of place of articulation on VOT for each talker, we used the difference between the labial and velar intercepts, with the intercept defined as VOT at 374 ms vowel duration (shown in Table 3). Because the results reported above indicate that the slopes of the labial and velar functions within a given talker are not statistically different (and thus the functions are approximately parallel), using a single point on each function as the basis of comparison is valid in that the difference between labial and velar VOTs will be the same for any value along the x-axis.
As expected, the labial intercept was located at a shorter VOT than the velar intercept for each talker, resulting in a reliable effect of place of articulation on VOT for the group of talkers [mean difference=37.60 ms; t(9)=9.06, p<0.001]. To examine the central question of whether the magnitude of the difference in labial and velar intercepts varied significantly across talkers, an HLM analysis was used to nest the labial and velar intercepts within talkers. The HLM results showed that there was no significant variability in the difference between the labial and velar intercepts across individual talkers [χ2(9)=2.97, p>0.50]. These results indicate that the effect of place of articulation on VOT does not vary across individual talkers.
DISCUSSION
Previous research has provided evidence for talker-specific variability in the acoustic-phonetic information used to convey individual speech segments (e.g., Espy-Wilson et al., 2000; Newman et al., 2001; Peterson and Barney, 1952; Zue and Laferriere, 1979). As a case in point, recent findings indicate that talkers differ in VOTs produced for voiceless stop consonants; some talkers produce characteristically shorter VOTs than other talkers (Allen et al., 2003). The results from the current research confirm this finding and, most importantly, extend it by examining potential talker specificity in how two contextual variables, speaking rate and place of articulation, influence VOT.
In terms of speaking rate, previous research has shown that as speaking rate decreases (and syllables become longer), VOT systematically increases (e.g., Kessinger and Blumstein, 1997; Volaitis and Miller, 1992). The current results replicated this finding for all three voiceless stops, ∕p∕, ∕t∕, and ∕k∕. However, the results also showed that for each stop, the magnitude of the increase in VOT for a given change in speaking rate varied significantly across talkers. This finding, which indicates that the effect of speaking rate on VOT is talker-specific, highlights a source of systematic variability in the speech signal. Further, the results from experiment 2, which compared ∕p∕ and ∕k∕, showed that for a given talker, the magnitude of the rate effect on VOT remained constant across a change in place of articulation. This finding highlights a source of stability in the speech signal at the individual talker level, in that how rate influences VOT for one voiceless stop is the same for a different voiceless stop.
In terms of place of articulation, the results from experiment 2 showed that for each talker VOTs for ∕p∕ were shorter than VOTs for ∕k∕, in line with previous research (e.g., Lisker and Abramson, 1964). Critically, the results also indicated that the magnitude of displacement between VOTs for ∕p∕ and ∕k∕ did not vary significantly across talkers. Thus, unlike speaking rate, the contextual influence of place of articulation on VOT appears not to be talker-specific.
Taken together, these findings have implications for theoretical accounts of speech perception. It has been established that listeners retain talker-specific acoustic-phonetic information in memory (e.g., Goldinger, 1998) and that familiarity with a particular talker’s speech can facilitate word recognition (e.g., Nygaard et al., 1994). Furthermore, findings from literature on perceptual learning in speech suggest that the benefits of talker familiarity observed at the word level might result, at least in part, from talker-specific effects at a prelexical level of representation (e.g., Eisner and McQueen, 2005; Kraljic and Samuel, 2007). Of particular relevance to the current research, Allen and Miller (2004) showed that listeners could learn that one talker produces a particular voiceless stop with characteristically short VOTs and a different talker produces the same stop with characteristically long VOTs.
This finding raises the possibility that listeners may customize stop voicing categories based on individual talkers’ characteristic VOTs. However, given contextual influences on VOT, listeners would need to consider a talker’s characteristic VOTs not in an absolute manner, but with respect to context. Indeed, it is well established that at a general level, listeners do process VOT in relation to numerous contextual factors, including both speaking rate and place of articulation. These contextual influences systematically affect both the boundaries between phonetic categories and the best exemplars of a given phonetic category (e.g., Lisker and Abramson, 1970; Miller and Volaitis, 1989; Summerfield, 1981; Volaitis and Miller, 1992). We do not yet know whether such context-dependent processing is tuned to the speech of individual talkers, but the results of the current experiments place constraints on the type of exposure listeners might require for such perceptual tuning.
Specifically, the current data suggest that exposure to a talker’s VOTs for a voiceless stop at one speaking rate would not optimally inform the listener as to that talker’s VOTs for the stop at a novel speaking rate. Because the magnitude of the rate effect systematically varies across talkers, in order to accommodate the contextual influence of rate on VOT listeners would need to learn, for a given talker, how much VOT changes as a function of speaking rate; that is, ascertain the slope of the function relating VOT to rate. However, because the magnitude of the rate effect on VOT for a given talker is stable across a change in place of articulation, tracking the contextual influence of rate in the context of one voiceless stop could potentially inform the listener as to how this contextual influence operates for other voiceless stops in similar phonetic environments.
The current data also suggest that listeners might not need to track the contextual influence of place of articulation per se on VOT at the level of individual talkers. Because the magnitude of the place effect does not systematically differ across individual talkers, listeners could rely on more general knowledge, perhaps specific to their language (e.g., Cho and Ladefoged, 1999), to inform them as to how VOT shifts as a function of place of articulation. As a consequence, for a given speaking rate and a similar phonetic environment, learning a particular talker’s characteristic VOTs for one voiceless stop may inform the listener as to that talker’s VOTs for voiceless stops with a different place of articulation.
In sum, the present data provide basic information on how two contextual factors influence VOT at a talker-specific level and, in so doing, point to constraints on how listeners might accommodate such contextual variation when customizing phonetic categories for an individual talker’s speech. Future research is aimed at examining the nature and extent of such perceptual fine-tuning.
ACKNOWLEDGMENTS
This research was supported by NIH Grant No. R01 DC000130 to J.L.M. and by NIH Grant No. F31 DC009114 (Ruth L. Kirschstein NRSA for Individual Predoctoral Fellows) to R.M.T., and formed the basis for part of the doctoral dissertation of R.M.T. at Northeastern University. We thank Eliza Floyd, Katrina Smith, and Janelle LaMarche for assistance with the acoustic measurements, and we thank the reviewers for their helpful comments and suggestions.
APPENDIX
HLM analyses (Bryk and Raudenbush, 1992) were used in the current research because they allow examination of stable individual differences around group level patterns. HLM analyses are based on linear regression techniques; however, unlike standard regression models, HLM analyses are well suited for examination of data from repeated-measures designs. All of the analyses reported in this paper are based on two HLM structures. The first HLM structure was used to compare slope and intercept parameters within a single place of articulation. This model was used in experiment 1 to compare the slopes and intercepts of the alveolar functions and in experiment 2 to compare the slopes and intercepts of the labial functions and, separately, the velar functions. The second HLM structure was used to compare the slope and intercept parameters across place of articulation. This structure was used in experiment 2 to compare the slopes of the labial functions to the slopes of the velar functions, and, separately, to compare the intercepts of the labial functions to the intercepts of the velar functions. The details of each type of model are presented in turn.
In order to test the statistical significance of the variability in talkers’ slopes and intercepts within a single place of articulation, all of the tokens for the particular analysis were nested within each of the ten talkers as follows. For the level-1 model,
For the level-2 model,
With this structure, VOT is specified as a function of vowel duration, while incorporating the fact that sets of individual tokens are associated with specific talkers. Importantly, the level-2 model allows the intercepts (β0j) and slopes (β1j) of the level-1 model to vary across talkers. That is, the level-2 model estimates the mean intercept (γ00) and mean slope (γ10) values across talkers while also testing if significant variability exists in these parameters (u0j and u1j, respectively) as a function of stable talker differences.
In order to examine the slopes (or intercepts) across place of articulation, the labial and velar slopes (or intercepts) were nested within talkers as follows. For the level-1 model,
For the level-2 model,
With this structure, slope (or intercept) is specified as a function of place of articulation, while incorporating the fact that pairs of individual values are associated with specific talkers. In order to allow place of articulation to be examined as a linear variable, labial and velar were coded as 0 and 1, respectively. Using this method, the slope parameter of the HLM (β1j) does not indicate the absolute slope (or intercept) for either the labial or velar functions; rather, it represents the difference between the labial and velar slopes (or intercepts). The level-2 model allows the slope (β1j) of the level-1 model to vary across talkers; accordingly, the model estimates the mean difference between the labial and velar slopes (or intercepts) across talkers (γ10) while also testing if significant variability exists in this parameter (u1j).
Portions of this work were presented at the 152nd meeting of the Acoustical Society of America, Honolulu, HI, December 2006; the 153rd meeting of the Acoustical Society of America, Salt Lake City, UT, June 2007; and the XVIth International Congress of Phonetic Sciences, Saarbrücken, Germany, August 2007.
Footnotes
As described in the main text, one assumption of the statistical analyses used in the current research is that VOT and the metric of speaking rate (e.g., vowel duration) are mathematically independent. An additional assumption is that the relationship between VOT and the metric of speaking rate can be adequately described as linear. For the range of speaking rates that occur in typical speech, there is no established theoretical relationship between VOT and speaking rate. To ensure that a linear function would adequately describe the relationship between VOT and speaking rate for each of the ten talkers in the current study, we compared three different functions using both vowel duration and syllable duration as the metric of speaking rate: a linear function, an exponential function (with VOT on a linear scale and speaking rate on a log scale), and a power function (with both VOT and speaking rate on a log scale). In all 20 cases (10 talkers×2 metrics of speaking rate), the correlation coefficient (Pearson’s r) associated with the linear function was statistically significant, and, critically, was greater than or statistically equal to the correlation coefficient of the exponential and power functions.
As is apparent, a larger percentage of tokens was excluded from statistical analysis in experiment 2 than experiment 1, due both to an increased proportion of anomalous∕immeasurable tokens and to an increased proportion of extremely long tokens. The underlying reason for the difference across experiments is not known. Importantly, even with the exclusion, the number of tokens available for statistical analysis in both experiments was very large.
As in experiment 1, we confirmed that the relationship between VOT and the metric of speaking rate could be adequately described as linear for the ten talkers examined here. For each place of articulation, we examined the correlation coefficient (Pearson’s r) of three different functions (linear, exponential, and power) using both vowel duration and syllable duration as the metric of speaking rate. In all 40 cases (10 talkers×2 places of articulation×2 metrics of speaking rate), the correlation coefficient of the linear function was statistically significant and, critically, was better than or statistically equal to the correlation coefficient of the exponential and power functions.
References
- Adams, S. G., Weismer, G., and Kent, R. (1993). “Speaking rate and speech movement velocity profiles,” J. Speech Hear. Res. 36, 41–54. [DOI] [PubMed] [Google Scholar]
- Allen, J. S., and Miller, J. L. (2004). “Listener sensitivity to individual talker differences in voice-onset-time,” J. Acoust. Soc. Am. 10.1121/1.1701898 115, 3171–3183. [DOI] [PubMed] [Google Scholar]
- Allen, J. S., Miller, J. L., and DeSteno, D. (2003). “Individual talker differences in voice-onset-time,” J. Acoust. Soc. Am. 10.1121/1.1528172 113, 544–552. [DOI] [PubMed] [Google Scholar]
- Boersma, P. (2001). “Praat, a system for doing phonetics by computer,” Glot International 5, 341–345. [Google Scholar]
- Bradlow, A. R., and Bent, T. (2008). “Perceptual adaptation to non-native speech,” Cognition 10.1016/j.cognition.2007.04.005 106, 707–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradlow, A. R., and Pisoni, D. B. (1999). “Recognition of spoken words by native and non-native listeners: Talker-, listener-, and item-related factors,” J. Acoust. Soc. Am. 10.1121/1.427952 106, 2074–2085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bryk, A. S., and Raudenbush, S. W. (1992). Hierarchical Linear Models: Applications and Data Analysis Methods (Sage, Newbury Park, CA: ). [Google Scholar]
- Byrd, D. (1992). “Preliminary results on speaker-dependent variation in the TIMIT database,” J. Acoust. Soc. Am. 10.1121/1.404271 92, 593–596. [DOI] [PubMed] [Google Scholar]
- Cho, T., and Ladefoged, P. (1999). “Variations and universals in VOT: Evidence from 18 languages,” J. Phonetics 10.1006/jpho.1999.0094 27, 207–229. [DOI] [Google Scholar]
- Clarke, C. M., and Garrett, M. F. (2004). “Rapid adaptation to foreign-accented English,” J. Acoust. Soc. Am. 10.1121/1.1815131 116, 3647–3658. [DOI] [PubMed] [Google Scholar]
- Crystal, T. H., and House, A. S. (1982). “Segmental durations in connected speech signals: Preliminary results,” J. Acoust. Soc. Am. 10.1121/1.388251 72, 705–716. [DOI] [PubMed] [Google Scholar]
- Eisner, F., and McQueen, J. M. (2005). “The specificity of perceptual learning in speech processing,” Percept. Psychophys. 67, 224–238. [DOI] [PubMed] [Google Scholar]
- Espy-Wilson, C. Y., Boyce, S. E., Jackson, M., Narayanan, S., and Alwan, A. (2000). “Acoustic modeling of American English ∕r∕,” J. Acoust. Soc. Am. 10.1121/1.429469 108, 343–356. [DOI] [PubMed] [Google Scholar]
- Goldinger, S. D. (1996). “Words and voices: Episodic traces in spoken word identification and recognition memory,” J. Exp. Psychol. Learn. Mem. Cogn. 22, 1166–1183. 10.1037/0278-7393.22.5.1166 [DOI] [PubMed] [Google Scholar]
- Goldinger, S. D. (1998). “Echoes of echoes? An episodic theory of lexical access,” Psychol. Rev. 10.1037/0033-295X.105.2.251 105, 251–279. [DOI] [PubMed] [Google Scholar]
- Hashi, M., Honda, K., and Westbury, J. R. (2003). “Time-varying acoustic and articulatory characteristics of American English [ɹ]: A cross-speaker study,” J. Phonetics 31, 3–22. 10.1016/S0095-4470(02)00062-1 [DOI] [Google Scholar]
- Hillenbrand, J., Getty, L. A., Clark, M. J., and Wheeler, K. (1995). “Acoustic characteristics of American English vowels,” J. Acoust. Soc. Am. 10.1121/1.411872 97, 3099–3111. [DOI] [PubMed] [Google Scholar]
- Kessinger, R. H., and Blumstein, S. E. (1997). “Effects of speaking rate on voice-onset time in Thai, French, and English,” J. Phonetics 10.1006/jpho.1996.0039 25, 143–168. [DOI] [Google Scholar]
- Klatt, D. H. (1975). “Voice onset time, frication, and aspiration in word-initial consonant clusters,” J. Speech Hear. Res. 18, 686–706. [DOI] [PubMed] [Google Scholar]
- Klatt, D. H. (1976). “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence,” J. Acoust. Soc. Am. 10.1121/1.380986 59, 1208–1221. [DOI] [PubMed] [Google Scholar]
- Kraljic, T., and Samuel, A. G. (2007). “Perceptual adjustments to multiple speakers,” J. Mem. Lang. 56, 1–15. 10.1016/j.jml.2006.07.010 [DOI] [Google Scholar]
- Kuehn, D. P., and Moll, K. L. (1976). “A cineradiographic study of VC and CV articulatory velocities,” J. Phonetics 4, 303–320. [Google Scholar]
- Lane, H., and Grosjean, F. (1973). “Perception of reading rate by speakers and listeners,” J. Exp. Psychol. 97, 141–147. 10.1037/h0033869 [DOI] [PubMed] [Google Scholar]
- Lisker, L., and Abramson, A. S. (1964). “A cross-language study of voicing in initial stops: Acoustical measurements,” Word 20, 384–422. [Google Scholar]
- Lisker, L., and Abramson, A. S. (1967). “Some effects of context on voice onset time in English stops,” Lang Speech 10, 1–28. [DOI] [PubMed] [Google Scholar]
- Lisker, L., and Abramson, A. S. (1970). “The voicing dimension: Some experiments in comparative phonetics,” in Proceedings of the Sixth International Congress of Phonetic Sciences (Academia, Prague: ), pp. 563–567.
- Matthies, M., Perrier, P., Perkell, J. S., and Zandipour, M. (2001). “Variation in anticipatory coarticulation with changes in clarity and rate,” J. Speech Lang. Hear. Res. 44, 340–353. 10.1044/1092-4388(2001/028) [DOI] [PubMed] [Google Scholar]
- McClean, M. D. (2000). “Patterns of orofacial movement velocity across variations in speech rate,” J. Speech Lang. Hear. Res. 43, 205–216. [DOI] [PubMed] [Google Scholar]
- Miller, J. L., and Volaitis, L. E. (1989). “Effect of speaking rate on the perceptual structure of a phonetic category,” Percept. Psychophys. 46, 505–512. [DOI] [PubMed] [Google Scholar]
- Miller, J. L., Green, K. P., and Reeves, A. (1986). “Speaking rate and segments: A look at the relation between speech production and speech perception for the voicing contrast,” Phonetica 43, 106–115. [Google Scholar]
- Miller, J. L., Grosjean, F., and Lomanto, C. (1984). “Articulation rate and its variability in spontaneous speech: A reanalysis and some implications,” Phonetica 41, 215–225. [DOI] [PubMed] [Google Scholar]
- Nagao, K., and de Jong, K. (2007). “Perceptual rate normalization in naturally produced rate-varied speech,” J. Acoust. Soc. Am. 10.1121/1.2713680 121, 2882–2898. [DOI] [PubMed] [Google Scholar]
- Newman, R. S., Clouse, S. A., and Burnham, J. L. (2001). “The perceptual consequences of within-talker variability in fricative production,” J. Acoust. Soc. Am. 10.1121/1.1348009 109, 1181–1196. [DOI] [PubMed] [Google Scholar]
- Nygaard, L. C., and Pisoni, D. B. (1998). “Talker-specific learning in speech perception,” Percept. Psychophys. 60, 355–376. [DOI] [PubMed] [Google Scholar]
- Nygaard, L. C., Sommers, M. S., and Pisoni, D. B. (1994). “Speech perception as a talker-contingent process,” Psychol. Sci. 5, 42–46. 10.1111/j.1467-9280.1994.tb00612.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Palmeri, T. J., Goldinger, S. D., and Pisoni, D. B. (1993). “Episodic encoding of voice attributes and recognition memory for spoken words,” J. Exp. Psychol. Learn. Mem. Cogn. 19, 309–328. 10.1037/0278-7393.19.2.309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peterson, G. E., and Barney, H. L. (1952). “Control methods used in a study of the vowels,” J. Acoust. Soc. Am. 10.1121/1.1906875 24, 175–184. [DOI] [Google Scholar]
- Picheny, M. A., Durlach, N. I., and Braida, L. D. (1986). “Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech,” J. Speech Hear. Res. 29, 434–446. [DOI] [PubMed] [Google Scholar]
- Port, R. F. (1981). “Linguistic timing factors in combination,” J. Acoust. Soc. Am. 10.1121/1.385347 69, 262–274. [DOI] [PubMed] [Google Scholar]
- Port, R. F., and Rotunno, R. (1979). “Relation between voice-onset time and vowel duration,” J. Acoust. Soc. Am. 10.1121/1.383692 66, 654–662. [DOI] [PubMed] [Google Scholar]
- Robb, M., Gilbert, H., and Lerman, J. (2005). “Influence of gender and environmental setting on voice onset time,” Folia Phoniatr Logop 57, 125–133. 10.1159/000084133 [DOI] [PubMed] [Google Scholar]
- Summerfield, Q. (1981). “Articulatory rate and perceptual constancy in phonetic perception,” J. Exp. Psychol. Hum. Percept. Perform. 10.1037/0096-1523.7.5.1074 7, 1074–1095. [DOI] [PubMed] [Google Scholar]
- Volaitis, L. E., and Miller, J. L. (1992). “Phonetic prototypes: Influence of place of articulation and speaking rate on the internal structure of voicing categories,” J. Acoust. Soc. Am. 10.1121/1.403997 92, 723–735. [DOI] [PubMed] [Google Scholar]
- Weismer, G. (1979). “Sensitivity of voice-onset time (VOT) measures to certain segmental features in speech production,” J. Phonetics 7, 197–204. [Google Scholar]
- Zue, V. W., and Laferriere, M. (1979). “Acoustic study of medial ∕t,d∕ in American English,” J. Acoust. Soc. Am. 10.1121/1.383323 66, 1039–1050. [DOI] [Google Scholar]