Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2011 Dec;130(6):4053–4062. doi: 10.1121/1.3651816

Effects of cross-language voice training on speech perception: Whose familiar voices are more intelligible?

Susannah V Levi 1,a), Stephen J Winters 2, David B Pisoni 3
PMCID: PMC3253604  PMID: 22225059

Abstract

Previous research has shown that familiarity with a talker’s voice can improve linguistic processing (herein, “Familiar Talker Advantage”), but this benefit is constrained by the context in which the talker’s voice is familiar. The current study examined how familiarity affects intelligibility by manipulating the type of talker information available to listeners. One group of listeners learned to identify bilingual talkers’ voices from English words, where they learned language-specific talker information. A second group of listeners learned the same talkers from German words, and thus only learned language-independent talker information. After voice training, both groups of listeners completed a word recognition task with English words produced by both familiar and unfamiliar talkers. Results revealed that English-trained listeners perceived more phonemes correct for familiar than unfamiliar talkers, while German-trained listeners did not show improved intelligibility for familiar talkers. The absence of a processing advantage in speech intelligibility for the German-trained listeners demonstrates limitations on the Familiar Talker Advantage, which crucially depends on the language context in which the talkers’ voices were learned; knowledge of how a talker produces linguistically relevant contrasts in a particular language is necessary to increase speech intelligibility for words produced by familiar talkers.

INTRODUCTION

Speech perception—the process of interpreting the linguistic content of an acoustic signal—is a remarkable feat given the vast amount of variability in the acoustic signal. One of the major sources of variability in the speech signal comes from talker-specific attributes, where differences across talkers can result in vastly different acoustic signals (Peterson and Barney, 1952). A growing body of literature has shown that listeners are highly sensitive to this aspect of the speech signal (sometimes referred to as the talker dimension of the speech signal) when processing the linguistic information in the speech signal. It is well-established that known or inferred information about a talker alters segmental perception (Allen and Miller, 2004; Eisner and McQueen, 2005; Johnson, 1990; Johnson et al. 1999; Kraljic et al., 2008; Kraljic and Samuel, 2005, 2006, 2007; Ladefoged, 1978; Ladefoged and Broadbent, 1957), tone perception (Leather, 1983), and can generalize to novel linguistic contexts (Dahan et al., 2008; Kraljic and Samuel, 2007; Theodore et al., 2009). In addition to affecting the identification of individual segments, the talker dimension interacts with linguistic processing at higher cognitive levels. List recall at short interstimulus intervals is improved when the talker is held constant across an experiment (Goldinger et al., 1991). Furthermore, memory for previously presented words is improved (Palmeri et al., 1993) and priming is increased (Schacter and Church, 1992) when words are produced by the same talker within an experiment compared to conditions where different talkers produce the words.

Careful manipulation of the talker dimension has even been shown to improve linguistic performance in some cases. In adult speech perception, improved word recognition has been found for familiar talkers (Magnuson et al., 1995; Nygaard and Pisoni, 1998; Nygaard et al., 1994). This Familiar Talker Advantage—increased intelligibility for familiar talkers—(Nygaard and Pisoni, 1998; Nygaard, et al., 1994) is generated by first familiarizing naive listeners with a set of novel talkers. Once listeners are familiar with the talkers’ voices, they perform a linguistic processing task such as word recognition or sentence transcription. These studies have shown that listeners’ performance on the linguistic tasks is consistently better for familiar talkers than for unfamiliar, novel talkers. A similar advantage has been found in cross-modal speech perception. Rosenblum et al. (2007) first had participants transcribe sentences from visual-only speech stimuli and then asked them to transcribe novel sentences from auditory-only stimuli produced by either the same or different talker. Although participants were not explicitly directed to attend to talker characteristics during the initial visual-only transcription task, they nonetheless perceived more words correctly when the same talker was used in both the visual-only and auditory-only tasks. Familiarity with talkers seems to be especially useful when listening to speech in the context of a competing background talker. Newman and Evers (2007) found that listeners who were familiar with a talker performed better on a shadowing task that included a background talker but only when they were explicitly told the identity of the talker.

Understanding how the talker dimension interacts with linguistic processing is important for current theoretical accounts of speech perception (see Pisoni, 1997). In the case of the Familiar Talker Advantage, it remains unclear exactly what it is that listeners know about familiar talkers—that they do not know about unfamiliar talkers—that improves speech perception and results in an increase in intelligibility for familiar talkers. Further examination of the Familiar Talker Advantage reveals that mere exposure to a talker’s voice is not sufficient to produce improvements in speech recognition. Nygaard and Pisoni (1998) examined the role of stimulus type on the Familiar Talker Advantage. They found that learning to identify talkers from isolated words improved single-word recognition and that learning to identify talkers from sentence-length utterances improved sentence comprehension. However, they also found that learning to identify talkers’ voices from sentences did not result in higher accuracy for isolated single words. Nygaard and Pisoni argued that the talker information that is salient in sentence-length utterances is qualitatively different from the information available in single words. In particular, learning to identify talkers from sentence-length utterances provides additional acoustic information linked to the talker (e.g., coarticulation across words and phrases, prosodic/rhythmic patterns) that is not present in the production of isolated words and thus is not helpful when performing a speech perception task with novel words. Listeners may be able to rely on these higher-level prosodic cues to perform the talker identification task, without attending to the fine-grained acoustic-phonetic detail in the individual words. The results reported by Nygaard and Pisoni indicate that the information needed to improve intelligibility is highly context-dependent.

Similarly, Sommers et al. (1994) found that changing speaking rate or changing the talker within a block disrupted speech perception, but changing amplitude did not. This may have occurred because both speaking rate and talker variability affect the degree of coarticulation and prosodic patterns of speech, whereas amplitude does not. Additional support for the context-dependent nature of speech perception comes from an earlier study on the perception of synthetic speech (Greenspan et al., 1988). Greenspan et al. trained listeners to recognize synthetic speech from either words or sentences. Listeners trained on sentences showed increased intelligibility for both words and sentences. Importantly, the synthesized sentences were generated by simply concatenating individual words, thus listening to sentences was the same as listening to a string of isolated words, lacking coarticulation and sentential prosody. Therefore, listeners had experience with the identical type of information needed for the perception of isolated words. It is also worth noting here that the tasks used during training and testing were the same, namely word recognition.

Given the findings that mere exposure to a talker’s voice does not automatically result in a Familiar Talker Advantage, we further investigated this phenomenon by manipulating the type of talker information available to listeners. In a previous study, we found that two types of talker information were present in the speech signal, namely language-independent and language-specific talker information (Winters et al., 2008). In the current study, this was manipulated by training listeners with either language-independent talker information (German stimuli) or language-specific (and also language-independent) talker information (English stimuli). Language-independent talker information may include information such as gender, age, voice quality, and characteristic F0 range and formant values—namely, acoustic properties related to physical characteristics of the talker. The language-specific talker information includes dialect affiliation and idiolectal articulations of the talker which may not be present in utterances of another language.

Our previous study on cross-language talker familiarity showed no difference between listeners during the initial learning phase; English-trained and German-trained listeners showed similar rates of talker identification improvement, and also exhibited effectively equivalent accuracy on the last day of training. Listeners were also able to generalize talker-voice knowledge from one language to another, although this depended on the training language. German-trained listeners showed no significant decline in identification accuracy when generalizing to English stimuli, whereas English-trained listeners performed significantly worse when generalizing to German stimuli than when generalizing to English stimuli.

These results provide evidence for both language-specific and language-independent talker information in the speech signal. They also suggest that when learning to identify a voice in English (a known language), listeners attend to both language-specific and language- independent talker information, whereas when learning to identify a voice in German (an unknown language) they learn only language-independent talker information. Previous studies of the Familiar Talker Advantage have only investigated talker familiarity within a single, known language. In these cases, listeners have access to language-specific talker information in the speech signal. Given that previous work has shown that the Familiar Talker Advantage is context-dependent (Nygaard and Pisoni, 1998; Sommers et al., 1994), we examined the context in which the voices were learned and thus controlled the type of talker information that listeners had access to in the initial learning phase. Specifically, one group of listeners was trained to identify bilingual talkers from English words and then tested on their ability to recognize English words produced by both these familiar talkers and other, unfamiliar talkers as in previous studies. A second group of listeners was trained to identify the same talkers from German stimuli and then tested on their ability to recognize English words produced by both familiar and unfamiliar talkers. Listeners trained with English words were thus exposed to language-specific (here English-specific) talker information, while listeners trained on German words only had access to language-independent talker information when transfer was tested in a speech intelligibility test with English stimuli.

Based on previous studies, we expected that English-trained listeners would exhibit a Familiar Talker Advantage. It was unclear whether the German-trained listeners would also exhibit the same advantage. If knowledge of more general (language-independent) talker information related to physical characteristics of the talker (referred to as “structural information” in Perrachione et al., 2009) supports improved speech perception, then listeners trained to identify the talkers’ voices from German stimuli should show the Familiar Talker Advantage. Instead, if it is necessary to know English-specific talker information—such as dialect and idiolectal articulations—then the German-trained listeners should show no Familiar Talker Advantage when recognizing words in English because they lack this specific knowledge.

EXPERIMENT

Methods

Stimulus materials

Twelve female German L1/English L2 speakers aged 21–33 living in Bloomington, IN were recorded in a sound-attenuated IAC booth at the Speech Research Laboratory at Indiana University. Speech samples were recorded using a SHURE SM98 head-mounted unidirectional (cardioid) condenser microphone with a flat frequency response from 40 to 20 000 Hz. Utterances were digitized into 16-bit stereo recordings via Tucker-Davis Technologies System II hardware at 22 050 Hz and saved directly to an IBM-PC. A single repetition of 360 English and 360 German words was produced by each speaker. Each word was of the form consonant-vowel-consonant (CVC) and was selected from the CELEX English and German databases (Baayen et al., 1995). Stimulus materials were presented visually to speakers in random order and blocked by language. [See Levi et al. (2007) for additional details about the recording methods.] German was selected as the non-English language for the following reasons: (1) the syllable structure and phonemic inventory of German allow for the use of enough CVC words, (2) frequency counts were available from the same database, (3) German is a less-studied L2 allowing for more potential listeners with no knowledge of German, and (4) the two languages are similar in terms of segmental and rhythmic properties, allowing for the most favorable learning environment in which to see a Familiar Talker Advantage.

Bilingual speakers were paid ten dollars an hour for their time. Two speakers were eliminated (speech disorder, N=1; large age difference: N=1), yielding ten bilingual speakers. Based on data collected in a pilot word-recognition study, talkers were divided into two groups of roughly equal intelligibility (hereinafter referred to as Group One Talkers and Group Two Talkers). Average intelligibility scores, as well as other demographic data, are provided in Table TABLE I..

Table 1.

Demographic variables for the bilingual speakers. “Years of English” refers to the number of years speakers have been learning/using English (current age - age of acquisition). “Proficiency” is a self-reported measure of English proficiency (1=poor, 5=fluent). The final column provides a measure of each speaker’s intelligibility in terms of proportion of words correctly perceived under four signal-to-noise ratios by a set of untrained listeners.

Talker Group Speaker Age of Acquisition Years of English Length of Residence Proficiency Intelligibility
1 F3 10 14 1 5 49.0
F4 13 13 3 4.5 43.8
F7 9 12 1 5 33.5
F9 9 16 2 4 48.4
F10 13 11 5 5 38.5
Mean (SD) 10.8 (2.1) 13.2 (1.9) 2.4 (1.7) 4.7 (.4) 42.7 (6.6)
2 F2 12 9 1 4 37.1
F5 10 14 5 3 41.1
F8 13 16 4 5 54.8
F11 2 4.5 41.3
F12 7 26 5 5 54.8
Mean (SD) 10.5 (2.6) 16.2 (7.1) 3.4 (1.8) 4.3 (.8) 45.8 (8.3)

Five female native speakers of American English were also recorded producing the list of English words under the same conditions as the bilingual speakers. All speakers were between the ages of 18–25 and reported no history of speech or hearing impairments. These speakers received partial course credit for their participation.

Participants

Forty-two listeners participated in the German-training condition and 41 in the English-training condition. In each language condition, half of the listeners were trained on Group One Talkers (“Group One Listeners”) and half on Group Two Talkers (“Group Two Listeners”). All listeners were monolingual native speakers of American English who were attending Indiana University and provided written informed consent. Following Winters et al. (2008), performance criterion for inclusion was at least 40% correct identification in at least 3 (half) testing phases during training. In the German-training condition, ten listeners were eliminated (did not reach performance criterion, N=5; did not complete the experiment, N=3; nonnative speaker of American English, N=1; lived in Germany, N=1) yielding 32 listeners. Nine listeners were eliminated in the English-training condition (did not complete the experiment, N=4; nonnative speaker of American English, N=2; German-speaking parent, N=1; last participants to complete the experiment, N=2) yielding 32 listeners for analysis. None of the remaining 64 listeners reported any knowledge of German, had ever lived in Germany, or had any German-speaking friends or family members. All were between the ages of 18–25 and all reported no history of speech or hearing impairments. Listeners were paid ten dollars an hour for their participation.

Listeners who reached performance criterion were further divided into “good voice-learners” and “poor voice-learners” following Nygaard and Pisoni (1998), who found that listeners who did not reach 70% accuracy in voice identification did not show the Familiar Talker Advantage. In the German-training condition, 9 out of 16 Group One Listeners and 7 out of 16 Group Two Listeners were classified as good voice learners. In the English-training condition, 8 out of 16 Group One Listeners and 12 out of 16 Group Two Listeners were good voice learners. Across both language groups, 16 out of 32 listeners were classified as good voice learners in the German-training condition and 20 out of 32 were classified as good voice learners in the English-training condition. See results section for further discussion and motivation for this division.

Procedure

During the four days of the study, participants were seated at individual testing stations in a quiet room. All stimuli were presented to participants over Beyer Dynamic DT-100 headphones on PowerMac G4 computers running a customized SuperCard (version 4.1.1) stack. Listeners in the German-training condition were informed that they would be hearing words in a language other than English.

Participants were trained to identify one of two sets of five different bilingual voices by name in six training sessions spanning three days. Each training session consisted of seven distinct phases, summarized in Table TABLE II.. Two training sessions were completed on each day. No more than two days intervened between any of the four days of testing. Each training session consisted of two training blocks and one testing block. The training blocks began with two brief familiarization phases where listeners simply heard the same words produced by all five talkers. After familiarization, listeners completed two training blocks in which they heard novel words from the five talkers and identified the talker by clicking an on-screen button next to the appropriate talker’s name. Feedback was provided immediately by playing the word again while the correct talker’s name appeared on the screen. In these phases of training, five different words for each of the five talkers were presented twice, in random order. After two training blocks, listeners completed a test block similar to the training blocks, but without any feedback. The test block consisted of ten novel words produced by each of the five talkers, in random order.

Table 2.

Training procedure used for each session. Two training sessions were completed each day.

  Phase Training Session Stimuli Task
Training Block I Familiarization A Same five words produced by each talker (500 ms ISI) Listen and attend to talker-name pair
Familiarization B Same one word produced by each talker Listen and attend to talker-name pair
Identification with feedback Five novel words produced by each talker, presented twice in random order Identify speaker (feedback)
Training Block II Familiarization A same as above same as above
Familiarization B same as above same as above
Identification with feedback same as above same as above
Test Identification without feedback Ten novel words produced by each talker, presented once in random order Identify talker (no feedback)

For both German and English training sessions, separate sets of words were selected at random for the familiarization, training, and test phases for each participant. In English training, only English words were presented to listeners, while in German training, only German words were presented to listeners. Each training session took approximately 20 minutes to complete. Participants completed two training sessions per day for three days. Participants were given a short five minute break between consecutive sessions on each day of training.

On the fourth day of the experiment, listeners carried out a word recognition task in which they heard novel English words, presented in the clear and at three signal-to-noise ratios (SNR) (+ 10, + 5, and 0 dB SNR), and were asked to type what they heard. These words were selected from the same bilingual talker database as the words presented in training; for the English-trained listeners, none of the words presented in the word recognition task had been presented during training. Each word was mixed with white noise which included a 200 ms linearly increased lead from 0 dB to the appropriate noise level at the beginning of the stimulus and a similar 200 ms ramp down of the noise at the end. One quarter of the stimuli were presented at each SNR.

German-trained listeners heard all 360 English words during the word recognition task. The English-trained listeners heard only the 180 words during word recognition that had not been presented during training. These 180 words were randomly selected for each listener. For both English- and German-trained listeners, the words used during the word recognition task were evenly divided among three groups of talkers: Group One bilingual talkers, Group Two bilingual talkers, and native speakers of English who were used as controls. Thus, in the word recognition task, all listeners were presented with words produced by both familiar and unfamiliar talkers.

Scoring

In addition to whole word accuracy, we will report results for the proportion of phonemes correct. Given that whole word accuracy was relatively low (ranging from an average of approximately 20% in the 0 SNR condition to an average of approximately 50% in the + 10 SNR condition), participants’ responses were also scored at the more fine-grained level of number of phonemes correct. This method of scoring phonemes correct has been used in studies of nonword repetition by listeners whose whole nonword accuracy is low (Dillon et al., 2004). This metric allows for a more fine-grained measure of perceptual accuracy on the word recognition task and allows listeners to receive partial credit in responses such as “cap” → “cat,” where some acoustic information (e.g., a release burst) may be more susceptible to the type of noise used here than other information. Scoring at this level of granularity can reveal important differences across conditions that may be obscured by more coarse-grained metrics. Critically, the phonemes that are scored in this study are produced in a CVC context, as are the stimuli used during training.

Although studies have found that accuracy on one level of linguistic information may not necessarily correlate with accuracy on other levels of linguistic processing, the two perceptual tasks in these studies differ in linguistic context. In Sommers et al. (2005) the phoneme perception task examined consonant accuracy in a /iCi/ context, whereas the word recognition task examined whole-word recognition of one- to three-syllable words spoken at the end of a carrier phrase. Likewise, learning novel talkers from sentence-length utterances and then testing word recognition for isolated CVC words in Nygaard and Pisoni (1998) requires listeners to perform the testing task in a different context. It should be noted that their finding that learning talkers from sentence-length utterances improves “sentence recognition” accuracy relies on coding sentence accuracy by number of keywords correct.

Here, we will report results from both phoneme accuracy and whole word accuracy

Results

Data on native talkers

Native talkers who were unfamiliar to all listeners were included in the word recognition task to ensure that the four listening groups (trained on English Group One talkers, trained on English Group Two talkers, trained on German Group One talkers, trained on German Group Two talkers) did not differ significantly in their baseline word recognition skills. An ANOVA (analysis of variance) with Listener Group as a between-subjects factor and SNR as a within-subjects factor revealed no significant differences across the four listener groups (F(3,62) = 0.217, p = 0.885), establishing that the groups had similar word recognition ability when listening to words in noise.

Talker voice training

An ANOVA was conducted on the response data from the test phases of the six training sessions. This ANOVA assessed the effects that Training Session (1, 2, 3, 4, 5, 6) and Training Language (English, German) had on the percentage of talkers correctly identified in each testing phase. The ANOVA revealed a significant main effect of training session (F(5,62) = 84.34; p < 0.001), indicating that performance improved across the six training sessions. Consistent with our earlier study, no effect of training language (p = 0.26) nor an interaction (p = 0.44) between training session and training language was found, suggesting that listeners learned to identify talkers at similar rates regardless of the Training Language (Winters et al. 2008). These results are illustrated in Fig. 1. Thus, any differences between English- and German-trained listeners cannot be attributed to different levels of voice learning.

Figure 1.

Figure 1

Talker identification accuracy during the six training sessions for both German-trained and English-trained listeners separated by good and poor voice learners. Two training sessions were completed on each day of training.

Novel-word recognition

Regression analyses.

The responses obtained in the word recognition task were analyzed using mixed-effects logistic regression analyses carried out in R (Baayen, 2008; Gelman and Hill, 2006; Jaeger, 2008). Proportion phonemes (and proportion of whole words) correct served as the dependent variable, and Talker Familiarity (familiar/unfamiliar), SNR (3 levels), Training Language (English/German), and Learning Accuracy (continuous measure of percent correct accuracy on the talker training task for the last day of training) were treated as fixed effects, with random intercepts by subject. Only the three conditions with added noise were included in the statistical analyses because of possible ceiling effects in the clear condition. Statistical significance was assessed using likelihood ratio tests (Baayen, 2008, p. 276). Likelihood ratio tests evaluate the change in goodness-of-fit when terms are added to a linear model. They compare the log likelihood, a measure of goodness-of-fit, of a linear model with the term of interest, to the log likelihood of a model without that term. The difference in log likelihoods can be evaluated for statistical significance against the chi-squared distribution with degrees of freedom based on the difference in the number of parameters in each model. The logic behind these tests is similar to that of a traditional ANOVA analysis, which tests the change in mean squared error (a measure of goodness-of-fit) that can be attributed to a given factor.

Considering the simple main effects of Learning Accuracy, SNR, Talker Familiarity, and Training Language, we observed significant effects on proportion of phonemes correct for Learning Accuracy (χ2(1) = 9.56, p = 0.001), SNR (χ2(2) = 422.81, p < 0.001), and Talker Familiarity (χ (1) = 7.74, p = 0.005). A significant effect of Training Language was not observed (p = 0.68). The significant effect of Learning Accuracy indicates that the better listeners were at identifying the talkers’ voices on the final day of training, the more accurate they were on the word recognition task overall. The significant effect of SNR confirms that listeners performed better at the more favorable SNRs, as is visible in Fig. 2. The significant effect of Familiarity indicates that listeners perceived more phonemes correctly when listening to familiar talkers than unfamiliar talkers. Similar results were found for the whole word accuracy, with significant main effects for Learning Accuracy (χ2(1) = 5.58, p = 0.018) and SNR (χ2(2) = 336.1, p < 0.001), but not for Talker Familiarity (p = 0.20) or Training Language (p = 0.69). The implications of the Learning Accuracy and Training Language results will be discussed further below.

Figure 2.

Figure 2

Proportion phonemes correct by SNR for English-trained and German-trained learners divided by learning ability (good and poor). Gray bars represent familiar bilingual talkers and white bars represent unfamiliar bilingual talkers.

To examine the significance of the interactions, an omnibus model was constructed with SNR, Training Language, Talker Familiarity, Learning Accuracy and all interactions as fixed effects, with random intercepts by subject. This omnibus model revealed no significant three- or four-way interactions (all F < 1.5). Thus, a revised omnibus model with only the main effects and two-way interactions as fixed effects served as our full model for testing the significance of the two-way interactions.

Examining the interaction terms in the omnibus model, we observed significant effects on proportion of phonemes correct for Talker Familiarity x Learning Accuracy (χ2(1)  = 4.79, p = 0.028) and Talker Familiarity x SNR (χ2(2)  = 8.26, p = 0.016). The Talker Familiarity x Learning Accuracy interaction indicates that individuals who initially learned to identify the talkers’ voices well showed increased accuracy for familiar talkers, while those who did not learn to identify the talkers’ voices well did not show improved word recognition accuracy. The interaction of Talker Familiarity x SNR indicates that the difference between familiar and unfamiliar talkers varied as a function of SNR. For example, mean phoneme accuracy across all subjects differed less in the + 10 condition (Familiar: 0.774; Unfamiliar: 0.782), than in the + 5 SNR condition (Familiar: 0.695; Unfamiliar: 0.657). When analyzed for whole word accuracy, these interactions approached significance: Talker Familiarity x Learning Accuracy (χ2(1) = 2.78, p = 0.094) and Talker Familiarity x SNR (χ2(2) = 5.77, p = 0.055).

The interaction between Talker Familiarity x Learning Accuracy, and the main effect of Learning Accuracy on phoneme accuracy suggests that listeners who were more accurate on talker identification during training performed differently than those who were less accurate. This particular finding is consistent with previous work which has found a familiarity effect only for listeners who attained 70% talker identification accuracy at the end of training (Nygaard and Pisoni, 1998). Thus, we simplified our analyses by focusing only on the “good voice learners” who attained 70% accuracy on the final day of training. This division at 70% conforms well to our current data, where median learning accuracy was 73%. To examine performance of the subset of good voice learners, mixed-effects logistic regression analyses were carried out on the proportion of phonemes correct with Talker Familiarity, SNR, Training Language and the two- way interactions as fixed effects, with random intercepts by subject. As expected, we observed a significant effect for Talker Familiarity ((χ2(6) = 14.5, p = 0.023) indicating that familiar talkers were more intelligible than unfamiliar talkers for these good voice learners. A significant effect for Training Language was not observed (p = 0.89). For the whole word analysis, the main effect of Talker Familiarity approached significance (χ2(6) = 11.61, p = 0.071).

The primary focus of the current investigation was to examine whether learning to identify talkers from non-English stimuli would result in a Familiar Talker Advantage. Because of the null effect of Training Language in the previous analyses, we further probed the contribution of Training Language by examining phoneme accuracy for good German-trained and good English-trained listeners separately. For these planned comparisons, SNR and Talker Familiarity were treated as fixed effects with random intercepts by subject. For the good English-trained listeners on proportion phonemes correct, we observed a significant effect of Talker Familiarity ((χ2(3) = 8.77, p = 0.032), as has been found in previous work. In contrast, we failed to observe a significant effect of Talker Familiarity for the good German-trained listeners (p = 0.132). Similar results were found when the data were analyzed by whole-word accuracy; good English-trained listeners showed a significant effect of Talker Familiarity ((χ2(3) = 8.212, p = 0.041), but good German-trained listeners did not (p = 0.518).

The results of the planned comparisons for the two language training conditions thus reveal an effect of training language. As in previous studies, listeners trained on voices from English words showed improved phoneme recognition for familiar talkers in a word recognition task. When examined in their own right, the German-trained listeners did not show a significant effect of Talker Familiarity, indicating that familiar talkers were not reliably more intelligible than unfamiliar talkers for these German-trained listeners. The lack of an effect of Talker Familiarity for these German-trained listeners suggests that the Familiar Talker Advantage is a language-dependent phenomenon. That is, the language context present during the learning phase must be the same as the language context during word recognition for a Familiar Talker Advantage to emerge.

Correlational analyses.

Bivariate correlations were conducted between each listener’s Learning Accuracy (the percentage of talkers correctly identified on the last day of training, a measure which indicates degree of familiarity with the talkers) and the Intelligibility Gain (the difference between word recognition scores for familiar versus unfamiliar talkers). All listeners were included in these analyses. Significant correlations between Learning Accuracy and Intelligibility Gain were found for the English-trained listeners for the average intelligibility gain across the three SNRs [r = 0.394, p = 0.010, Fig. 3a]. In contrast, for German-trained listeners, only a marginally significant correlation was found [r = 0.287, p = 0.055, Fig. 3b].

Figure 3.

Figure 3

Scatter plots of Learning Accuracy by Intelligibility Gain for English-trained listeners (a) and German-trained listeners (b).

The stronger correlation found between degree of talker familiarity and intelligibility gain for the English-trained listeners indicates that the better listeners were at identifying the talkers’ voices, the greater advantage they showed in recognizing words produced by those talkers in the novel-word recognition task. That German-trained listeners did not exhibit this relationship provides additional support for the finding that German-based familiarity with talkers’ voices did not lead to improved English word recognition performance for the same talkers. Taken together, these results suggest that the Familiar Talker Advantage is dependent upon the language in which the voices are familiar.

GENERAL DISCUSSION

The present study investigated whether acquired knowledge of a talker’s voice in one language can lead to improved speech perception (measured on a word recognition task) when the talker is familiar in a different language. The results of this study provide new data on what type of information listeners must learn about familiar talkers to show improved performance on speech perception tasks. The omnibus analyses over all listeners and over the subset of good voice learners did not show a significant contribution of the Training Language, suggesting that the pattern of performance of the word recognition task is similar for listeners trained on the two different languages. However, the direct analysis of the contribution of Training Language revealed that familiar talkers were only more intelligible to the good English-trained listeners, and not the good German-trained listeners. Taken together, these results suggest that the training language is an important factor in generating the Familiar Talker Advantage.

The implication of these results is that learning to identify talkers from isolated English words provides more cues to the listener that are relevant for performing the word recognition task than does learning to identify voices from German words. This finding is consistent with the “context-dependence” effects found in previous studies of novel voice learning in a laboratory setting, where the talker-specific information that facilitates word recognition is in the same linguistic context across both the talker identification training and speech perception tasks (Nygaard and Pisoni, 1998; Sommers et al., 1994). The larger question that arises from our current study and from previous work is how does talker-voice knowledge transfer to improved processing on a different task (in this case, word recognition).

Improved performance on word recognition is less surprising when considering what the initial learning phase actually entailed. When native-English listeners perform a talker/voice learning task with English stimuli where their attention is directed towards information about who is talking, they not only perceive and attend to the talker information in the speech signal, but also process the linguistic information in the words they hear. Previous studies support this dual-processing strategy and show that listeners are unable to fully suppress or inhibit information about the talker during speech perception (e.g., Mullennix and Pisoni, 1990), similar to the Stroop effect (Stroop, 1935) where readers have difficulty suppressing the semantic content of the word itself. Thus, when learning to identify the voice of a novel talker in English, native-English listeners hear the voices, but also automatically recognize and encode the words that they hear. Research on memory encoding and retrieval processes reveals that performance in a test phase depends on the similarity between how information was initially encoded and how it is retrieved at test (Morris et al., 1977; Tulving and Thomson, 1973). Thus, when English-trained listeners are asked to perform a word-recognition task using stimuli from English-familiar talkers, they have the potential to perform well because they also encoded, processed, and retrieved lexical information from these talkers during the learning phase.

As discussed in the Introduction, mere familiarity with or exposure to a talker’s voice is not sufficient to generate a Familiar Talker Advantage. Instead, it is critical that the talker information with which a listener is familiar is the same type of talker information that is needed for the transfer (retrieval) task. The lack of a Familiar Talker Advantage for listeners trained with sentences and tested with words in Nygaard and Pisoni (1998) suggests that the operations involved in processing linguistic information about words in sentences—with additional top-down, coarticulatory, and prosodic/rhythmic information—are not necessarily useful when processing isolated single words. Thus, listeners encoded the voices from qualitatively different information than was needed to perform the transfer task.

Similarly, the type of talker information that listeners heard during German-training in the current study was different from that heard during English word recognition. The German-trained listeners were exposed to fine-grained, talker-specific phonetic information from isolated words, but these words were in an unknown language and were therefore not automatically processed at a semantic level. Thus, when presented with words produced by the same talkers in English, the listeners could only make use of articulatory characteristics that were language-independent. Additional evidence for why familiarization in a different language does not improve speech intelligibility comes from speech production studies with bilingual speakers which have shown that languages that contain the “same” phonological contrast do not necessarily make use of the same fine-grained acoustic-phonetic cues or the same category boundaries to differentiate segments (Caramazza et al., 1973; MacLeod and Stoel-Gammon, 2005). Thus, knowledge of how a talker produces a phonological contrast in one language will not necessarily help a listener to perceive the “same” contrast in another language.

Our previous work has shown that listeners who learn to identify novel talkers’ voices from German stimuli (an unfamiliar language) acquire language-independent talker information that can generalize to accurate identification of the same talkers in a different language. The current study shows that this knowledge of language-independent talker information does not transfer to improved performance on a word recognition task. That is, it is not knowledge of a talker’s voice in general that facilitates word recognition for familiar voices; rather, it is knowledge of a talker’s voice characteristics within a specific language that results in improved intelligibility for familiar talkers. In this study, this advantage was exhibited by the listeners who had learned to identify talkers’ voices from English stimuli (the listeners’ native language) and had thus acquired knowledge of both language-independent and language-specific (here, English-specific) talker information (Winters et al., 2008). The current study therefore reveals an important limitation of the Familiar Talker Advantage: The type of information that is learned—and how it is encoded—determines whether speech perception can be improved in transfer tasks such as recognizing isolated words in noise.

The results of the present study also confirm an additional limitation on the Familiar Talker Advantage. An increase in intelligibility for familiar talkers’ voices depends on how well listeners learned to identify those voices within the same language. Exposure to a particular set of stimuli does not automatically guarantee that listeners will process words spoken by familiar talkers more accurately; what seems to be crucial is the strength of the learned association between the linguistic and talker information in the familiar talker’s speech. The results showing that the intelligibility gain correlates with how well talkers were initially learned supports this proposal; the poor voice learners in the English-training condition, who were exposed to the same number of tokens as the good voice learners, showed little or no intelligibility increase, suggesting that they did not sufficiently learn the talker-voice information or the links between it and the linguistic information.

These limitations on the Familiar Talker Advantage thus provide important boundary constraints on any theory of speech perception that attempts to account for the influence of talker information on speech processing. In general, the integration of talker and linguistic information is highly context-dependent, and in this case it is language-dependent. From the perspective of exemplar models of speech perception (Goldinger, 1998; Hintzman, 1986, 1988; Johnson, 1990, 1997), the German-trained listeners were not exposed to the specific links between talker information and English lexical items and thus were unable to use their (other language) knowledge of the talker’s voice to improve word recognition. The data presented here shows that an exemplar-based account of the Familiar Talker Advantage must rely on the acoustic-phonetic links between talker and linguistic information cultivated during training and that the context (whether sentence versus word or language of training) in which these links are formed must be the same at the time of retrieval. In normalization (or “analytical”) theories of speech perception (Magnuson and Nusbaum, 2007; Magnuson et al., 1994; Nusbaum and Magnuson, 1997; Nusbaum and Morin, 1992), the German-trained listeners were unable to fine-tune their acoustic-to-linguistic mappings because this mapping never took place during initial learning.

The second limitation on the Familiar Talker Advantage—that the Familiar Talker Advantage depends on the degree of familiarity—confirms that it is crucial for listeners to actually learn the connections between linguistic and talker information in the signal and that these links are encoded and stored by the listener. Listeners who have not sufficiently learned the relationships between who is talking and what is being said process the linguistic information as though they are presented with an unfamiliar talker.

CONCLUSIONS

The results of this study confirm previous findings that the Familiar Talker Advantage is not a result of voice familiarity in general, but instead is closely tied to the context in which the talker has become familiar. German-trained listeners did not process English-specific information during initial training and thus did not become familiar with the talkers’ voices in a linguistic context that could transfer directly to the word recognition task. English-trained listeners, however, could not ignore the linguistic information present in the talker identification training stimuli. By obligatorily processing this information during initial training, they gained perceptual experience which was directly applicable to the interpretation of the linguistic information in the word recognition task. Furthermore, the more familiar the talkers’ voices became to the listeners, the more likely the listeners were to use their training experiences and processing operations to aid linguistic processing during the transfer task.

The finding that familiarity with a specific talker’s voice facilitated the comprehension of words produced by that talker inspired the notion that linguistic and talker information are inextricably bound, not only in the acoustic record, but also during perception and are stored together in an integrated fashion in memory (Palmeri et al., 1993). The results of this study show that this integration may not be that simple. Previous research has shown that processing of the linguistic content of speech is performed in a “talker-contingent” manner, but the current findings show that this processing is also “language-contingent.” In other words, listeners must have knowledge of talker-specific articulatory information within the same language in order to integrate talker information during a linguistic processing task. Integrated representations of talker and linguistic information in the speech signal may therefore only emerge when listeners attend to words produced in a language that they understand.)

The lack of a consistent transfer effect of familiarity for the German-trained listeners in this study confirms that increased intelligibility stems from the processing operations carried out on the speech signal and is not due just to familiarity or exposure to the talker’s voice. The observed transfer effects depend on knowing how a specific talker produces linguistically relevant contrasts in a particular language.

ACKNOWLEDGMENTS

This work was supported by grants from the National Institutes of Health to Indiana University (Grant Nos. NIH-NIDCD T32 Training Grant DC-00012 and NIH-NIDCD Research Grant R01 DC-00111). We would like to thank Melissa Troyer for her help with data collection, and Jonathan Brennan for statistical consultations.

References

  1. Allen, J. S., and Miller, J. L. (2004). “Listener sensitivity to individual talker differences in voice-onset-time,” J. Acoust. Soc. Am. 115, 3171–3183. 10.1121/1.1701898 [DOI] [PubMed] [Google Scholar]
  2. Baayen, R. H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics (Cambridge University Press, Cambridge: ), p. 390. [Google Scholar]
  3. Baayen, R. H., Piepenbrock, R., and Gulikers, L. (1995). The CELEX Lexical Database (University of Pennsylvania, Philadelphia: ). [Google Scholar]
  4. Caramazza, A., Yeni-Komshian, G. H., Zurif, E. B., and Carbone, E. (1973). “The acquisition of a new phonological contrast: The case of stop consonants in French-English bilinguals,” J. Acoust. Soc. Am. 54, 421–428. 10.1121/1.1913594 [DOI] [PubMed] [Google Scholar]
  5. Dahan, D., Drucker, S. J., and Scarborough, R. A. (2008). “Talker adaptation in speech perception: Adjusting the signal or the representations?” Cognition 108, 710–718. 10.1016/j.cognition.2008.06.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dillon, C. M., Cleary, M., Pisoni, D. B., and Carter, A. K. (2004). “Imitation of nonwords by hearing-impaired children with cochlear implants: segmental analyses,” Clin. Linguist. Phonetics 18(1), 39–55. 10.1080/0269920031000151669 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Eisner, F., and McQueen, J. M. (2005). “The specificity of perceptual learning in speech processing,” Percept. Psychophys. 67, 224–238. 10.3758/BF03206487 [DOI] [PubMed] [Google Scholar]
  8. Gelman, A., and Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, Cambridge: ), p. 648. [Google Scholar]
  9. Goldinger, S. D. (1998). “Echoes of echoes? An episodic theory of lexical access,” Clin. Psychol. Rev. 105(2), 251–279. [DOI] [PubMed] [Google Scholar]
  10. Goldinger, S. D., Pisoni, D. B., and Logan, J. S. (1991). “On the nature of talker variability effects in recall of spoken word lists,” J. Exp. Psychol. 17(1), 152–162. 10.1037/0278-7393.17.1.152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Greenspan, S. L., Nusbaum, H. C., and Pisoni, D. B. (1988). “Perceptual learning of synthetic speech produced by rule,” J. Acoust. Soc. Am. 14(3), 421–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hintzman, D. (1986). “Schema abstraction in a multiple-trace memory model,” Clin. Psychol. Rev. 93, 41–428. [Google Scholar]
  13. Hintzman, D. (1988). “Judgments of frequency and recognition memory in a multiple-trace memory model,” Clin. Psychol. Rev. 95, 528–551. [Google Scholar]
  14. Jaeger, T. F. (2008). “Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models,” J. Mem. Lang. 59, 434–446. 10.1016/j.jml.2007.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Johnson, K. (1990). “The role of perceived speaker identity in F0 normalization of vowels,” J. Acoust. Soc. Am. 88 , 642–654. [DOI] [PubMed] [Google Scholar]
  16. Johnson, K. (1997). “Speech perception without speaker normalization: An exemplar model,” Talker Variability in Speech Processing, edited by Johnson K. and Mullennix J. W. (Academic Press, San Diego: ), pp. 145–164. [Google Scholar]
  17. Johnson, K., Strand, E. A., and D’Imperio, M. (1999). “Auditory-visual integration of talker gender in vowel perception,” J. Phonetics 27, 359–384. 10.1006/jpho.1999.0100 [DOI] [Google Scholar]
  18. Kraljic, T., Brennan, S. E., and Samuel, A. G. (2008). “Accommodating variation: Dialects, idiolects, and speech processing,” Cognition 107, 54–81. 10.1016/j.cognition.2007.07.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kraljic, T., and Samuel, A. G. (2005). “Perceptual learning for speech: is there a return to normal?” Cogn. Psychol. 51, 141–178. 10.1016/j.cogpsych.2005.05.001 [DOI] [PubMed] [Google Scholar]
  20. Kraljic, T., and Samuel, A. G. (2006). “Generalization in perceptual learning for speech,” Psychon. Bull. Rev. 13(2), 262–268. 10.3758/BF03193841 [DOI] [PubMed] [Google Scholar]
  21. Kraljic, T., and Samuel, A. G. (2007). “Perceptual adjustments to multiple talkers,” J. Mem. Lang. 56, 1–15. 10.1016/j.jml.2006.07.010 [DOI] [Google Scholar]
  22. Ladefoged, P. (1978). “Expectation affects identification by listening,” Lang. Speech 21(4), 373–374. [DOI] [PubMed] [Google Scholar]
  23. Ladefoged, P., and Broadbent, D. E. (1957). “Information conveyed by vowels,” J. Acoust. Soc. Am. 29, 98–104. 10.1121/1.1908694 [DOI] [PubMed] [Google Scholar]
  24. Leather, J. (1983). “Speaker normalization in the perception of lexical tone,” J. Phonetics 11(4), 373–382. [Google Scholar]
  25. Levi, S. V., Winters, S. J., and Pisoni, D. B. (2007). “Speaker-independent factors affecting the perception of foreign accent in a second language,” J. Acoust. Soc. Am. 121, 2327–2338. 10.1121/1.2537345 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. MacLeod, A., and Stoel-Gammon, C. (2005). “Are bilinguals different? What VOT tells us about simultaneous bilinguals,” J. Multilingual Commun. Disord. 3, 118–127. 10.1080/14769670500066313 [DOI] [Google Scholar]
  27. Magnuson, J. S., and Nusbaum, H. C. (2007). “Acoustic differences, listener expectations, and the perceptual accommodation of talker variability,” J. Exp. Psychol. 33(2), 391–409. 10.1037/0096-1523.33.2.391 [DOI] [PubMed] [Google Scholar]
  28. Magnuson, J. S., Yamada, R. A., and Nusbaum, H. C. (1994). “Variability in familiar and novel talkers: Effects on mora perception and talker identification,” Proceedings of the Acoustical Society of Japan Technical Committee on Psychological and Physiological Acoustics, Kanazawa, Japan, Vol. H-94-44, pp. 1–8. [Google Scholar]
  29. Magnuson, J. S., Yamada, R. A., and Nusbaum, H. C. (1995). “The effects of familiarity with a voice on speech perception,” Proceedings of the 1995 Spring Meeting of the Acoustical Society of Japan, pp. 391–392.
  30. Morris, C. D., Bransford, J. D., and Franks, J. J. (1977). “Levels of processing versus transfer appropriate processing,” J. Verbal Learn. Verbal Behav. 16, 519–533. 10.1016/S0022-5371(77)80016-9 [DOI] [Google Scholar]
  31. Mullennix, J. W., and Pisoni, D. B. (1990). “Stimulus variability and processing dependencies in speech perception,” Percept. Psychophys. 47(4), 379–390. 10.3758/BF03210878 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Newman, R. S., and Evers, S. (2007). “The effect of talker familiarity on stream segregation,” J. Phonetics 35, 85–103. 10.1016/j.wocn.2005.10.004 [DOI] [Google Scholar]
  33. Nusbaum, H. C., and Magnuson, J. S. (1997). “Talker normalization: Phonetic constancy as a cognitive process,” Talker Variability in Speech Processing, edited by Johnson K. and Mullennix J. W. (Academic Press, San Diego: ), pp. 109–132. [Google Scholar]
  34. Nusbaum, H. C., and Morin, T. M. (1992). “Paying attention to differences among talkers,” Speech Perception, Production, and Lingusitic Structure, edited by Tohkura Y., Sagisaka Y., and Vatikiotis-Bateson E. (Ohmsha Publishing, Tokyo: ), pp. 113–134. [Google Scholar]
  35. Nygaard, L. C., and Pisoni, D. B. (1998). “Talker-specific learning in speech perception,” Percept. Psychophys. 60(3), 355–376. 10.3758/BF03206860 [DOI] [PubMed] [Google Scholar]
  36. Nygaard, L. C., Sommers, M. S., and Pisoni, D. B. (1994). “Speech perception as a talker-contingent process,” Psychol. Sci. 5(1), 42–46. 10.1111/j.1467-9280.1994.tb00612.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Palmeri, T. J., Goldinger, S. D., and Pisoni, D. B. (1993). “Episodic encoding of voice attributes and recognition memory for spoken words,” J. Exp. Psychol. 19(2), 309–328. 10.1037/0278-7393.19.2.309 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Perrachione, T. K., Pierrehumbert, J. B., and Wong, P. C. M. (2009). “Differential neural contributions to native- and foreign-language talker identification,” J. Exp. Psychol. 35(6), 1950–1960. 10.1037/a0015869 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Peterson, G. E., and Barney, H. L. (1952). “Control methods used in the study of the vowels,” J. Acoust. Soc. Am. 24(2), 175–184. 10.1121/1.1906875 [DOI] [Google Scholar]
  40. Pisoni, D. B. (1997). “Some thoughts on “normalization” in speech perception,” Talker Variability in Speech Processing, edited by Johnson K. and Mullennix J. W. (Academic Press, San Diego: ), pp. 9–32. [Google Scholar]
  41. Rosenblum, L. D., Miller, R. M., and Sanchez, K. (2007). “Lip-read me now, hear me better later,” Psychol. Sci. 18, 392–396. 10.1111/j.1467-9280.2007.01911.x [DOI] [PubMed] [Google Scholar]
  42. Schacter, D. L., and Church, B. A. (1992). “Auditory priming: Implicit and explicit memory for words and voices,” J. Exp. Psychol. 18(5), 915–930. 10.1037/0278-7393.18.5.915 [DOI] [PubMed] [Google Scholar]
  43. Sommers, M. S., Nygaard, L. C., and Pisoni, D. B. (1994). “Stimulus variability and spoken word recognition. I. Effects of variability in speaking rate and overall amplitude,” J. Acoust. Soc. Am. 96(3), 1314–1324. 10.1121/1.411453 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Sommers, M. S., Tye-Murray, N., and Spehar, B. (2005). “Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults,” Ear Hear. 26(3), 263–275. 10.1097/00003446-200506000-00003 [DOI] [PubMed] [Google Scholar]
  45. Stroop, J. R. (1935). “Studies of interference in serial verbal reactions,” J. Exp. Psychol. 18, 643–662. 10.1037/h0054651 [DOI] [Google Scholar]
  46. Theodore, R. M., Miller, J. L., and DeSteno, D. (2009). “Individual talker differences in voice-onset time: Contextual influences,” J. Acoust. Soc. Am. 125(6), 3974–3982. 10.1121/1.3106131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Tulving, E., and Thomson, D. M. (1973), “Encoding specificity and retrieval processes in episodic memory,” Psychol. Rev. 80(5), 352–373. 10.1037/h0020071 [DOI] [Google Scholar]
  48. Winters, S. J., Levi, S. V., and Pisoni, D. B. (2008). “Identification and discrimination of bilingual talkers across languages,” J. Acoust. Soc. Am. 123(6), 4524–4538. 10.1121/1.2913046 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES