Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 8.
Published in final edited form as: J Exp Psychol Hum Percept Perform. 2021 Dec;47(12):1673–1680. doi: 10.1037/xhp0000963

Attention, task demands, and multi-talker processing costs in speech perception

David Saltzman 1, Sahil Luthra 1, Emily B Myers 1, James S Magnuson 1
PMCID: PMC10249717  NIHMSID: NIHMS1900541  PMID: 34881952

Abstract

Determining how human listeners achieve phonetic constancy despite a variable mapping between the acoustics of speech and phonemic categories is the longest-standing challenge in speech perception. A clue comes from studies where the talker changes randomly between stimuli, which slows processing compared to a single-talker baseline. These multi-talker processing costs have been observed most often in speeded monitoring paradigms, where participants respond whenever a specific item occurs. Notably, the conventional paradigm imposes attentional demands via two forms of varied mapping in mixed-talker conditions. First, target recycling (allowing items to serve as targets on some trials but as distractors on others) potentially prevents the development of task automaticity. Second, in mixed trials, participants must respond to two unique stimuli (one target produced by each talker), whereas in blocked conditions, they need only respond to one token (multiple target tokens). We seek to understand how attentional demands influence talker normalization, as measured by multi-talker processing costs. Across four experiments, multi-talker processing costs persisted when target recycling was not allowed but diminished when only one stimulus served as the target on mixed trials. We discuss the logic of using varied mapping to elicit attentional effects and implications for theories of speech perception.

Keywords: Talker normalization, word monitoring, automaticity, phonetic constancy


The mapping from the acoustic details of the speech signal to phonemes can vary tremendously depending on factors such as phonetic context, speaking rate, or ambient acoustic context; how listeners routinely perceive a talker’s intended utterance despite this lack of invariance between the acoustic signal and perceptual categories is one of the oldest problems in speech perception (Liberman, Harris, Hoffman, & Griffith, 1957), and it remains unsolved today. Critically, the lack of invariance problem is exacerbated by the fact that individual talkers may produce their speech sounds in substantially different ways (with acoustic consequences), both for vowels (Peterson & Barney, 1952) and consonants (Dorman, Studdert-Kennedy, & Raphael, 1977). Nonetheless, listeners typically perceive the content of the speech signal with ease, achieving phonetic constancy in spite of talker variability.

Researchers have proposed that in order to accommodate talker variability, listeners must adjust the mapping between acoustic details and phonetic categories on the basis of talker information (e.g.: Joos, 1948; Ladefoged & Broadbent, 1957; Nearey, 1989; Nusbaum & Magnuson, 1997). In a classic monograph, Joos (1948) suggested a talker accommodation process by which listeners might make the necessary mapping adjustments. Joos proposed that listeners might use an initial sample of a talker’s speech (e.g., a conventional greeting, such as how do you do) to map the talker’s speech onto phonological (perceptual) categories, and then ‘shift or distort’ either the incoming speech or their internal representations to bring the two into registration. This perspective is consistent with a large body of literature suggesting that listeners’ interpretation of speech is modulated by acoustic information encountered in preceding auditory contexts (Bosker, 2018; Ladefoged & Broadbent, 1957; Laing, Liu, Lotto, & Holt, 2012; Sjerps, Fox, Johnson, & Chang, 2018; Stilp, 2019; Zhang, Peng, & Wang, 2013).1

A number of speech perception studies show that listeners are slower and/or less accurate in identifying words when the talker varies from word to word compared to when all the words are spoken by a single talker (Carter, Lim, & Perrachione, 2019; Choi, Hu, & Perrachione, 2018; Choi & Perrachione, 2019b, 2019a; Heald & Nusbaum, 2014; Kapadia & Perrachione, 2020; Magnuson & Nusbaum, 2007; Mullennix, Pisoni, & Martin, 1989; Nusbaum & Morin, 1992; Verbrugge, Strange, Shankweiler, & Edman, 1976; Wong, Nusbaum, & Small, 2004). Some have interpreted these multi-talker processing costs as being a consequence of talker normalization or talker accommodation.2 On such a view, each time a new talker is encountered, listeners must re-engage the normalization/accommodation mechanism, and a processing cost is incurred as a result.

Much of our understanding of the processing costs associated with talker variability comes from studies that have used a speeded monitoring task (e.g., Antoniou, Wong, & Wang, 2015; Heald & Nusbaum, 2014; Magnuson & Nusbaum, 2007; Magnuson et al., 2021; Nusbaum & Morin, 1992; Wong et al., 2004). In this paradigm, listeners hear a series of stimuli (e.g., jolt, depth, ball, romp…) and must press a button whenever they hear a target item, indicated visually (e.g., BALL). In blocked-talker trials, one talker produces both the target and distractor items, whereas in mixed-talker trials, two different talkers produce both the target and distractor items and the talker alternates pseudo randomly from item to item (the total number of items in a mixed-talker trial is identical to a blocked-talker trial, as each talker produces half of the items). As expected by normalization/accommodation accounts, listeners are slower to identify the target word in mixed-talker trials compared to blocked-talker trials.

Nusbaum and his colleagues (Francis & Nusbaum, 1996; Heald & Nusbaum, 2014; Magnuson & Nusbaum, 2007; Nusbaum & Magnuson, 1997; Nusbaum & Morin, 1992; Nusbaum & Schwab, 1986) have proposed that achieving phonetic constancy despite the apparent lack of invariance between acoustics and percepts requires active, attention- and resource-demanding processes. Thus, when Nusbaum and Morin (1992, p. 122) described the features of the speeded monitoring task that they applied to the challenge of talker normalization, they pointed out that, by design, the blocked- and mixed-talker conditions differ in that blocked-talker conditions are amenable to automaticity (Schneider & Shiffrin, 1977; Shiffrin & Schneider, 1977) but mixed-talker conditions are not. They pointed to the fact that in a blocked-talker trial, participants must make a response to a single target item, while in mixed trials, the target items are produced by two talkers, and therefore participants must respond to two distinct stimuli (one produced by each talker): they noted, “…from a cognitive perspective, recognition in the mixed-talker condition should require more effort and attention than recognition in the blocked-talker condition.”

On this logic, the mixed-talker condition is designed to reveal increased attentional demands induced by talker normalization/accommodation. If speech perception is normally a highly automatized, efficient process, detecting subtle differences in attentional demands induced by a talker change may require stressing the system. Crucially, Nusbaum and Morin (1992) proposed that the computations required to adjust acoustic-perceptual mappings after a talker change would require attention. If this were the case, a simple attentional manipulation like digit load should produce an interaction with talker condition, exacerbating the multi-talker processing cost. This is precisely what they observed in their third experiment. With a 1-digit preload, they observed larger-than-normal mixed-talker processing costs (~30 ms, vs. ~20 ms in previous studies). With a 3-digit preload, there was virtually no change in response times in blocked-talker conditions (if anything, there was a slight numerical decline), but the multi-talker cost increased to nearly 60 ms. This significant interaction is consistent with the logic that the added attentional demands of the mixed-talker condition would stress the (normally automatic, efficient) processes of speech perception detectably.

Previous evidence supports the conclusion that talker normalization is influenced by attentional demands, and the mixed-talker trials in the speeded monitoring paradigm were intentionally designed to allow this influence to be observed. However, the speeded monitoring task as conventionally implemented includes another deviation from the preconditions for automaticity: targets are recycled. That is, a word that appears as a target on one trial may appear on subsequent trials as a distractor. (In Figure 1 we schematize both deviations from consistent mapping.) In the classic visual search studies of Schneider and Shiffrin (1977), target recycling prevented the development of automaticity, as it violates the principle of consistent mapping. For example, Schneider and Shiffrin (1977) presented participants with displays with one or more target symbols. Participants then had to indicate whether any targets were present in a subsequent display with few or many distractors. Initially, reaction time increased with the number of distractors. However, if targets were never recycled as distractors, reaction time flattened out (with little increase with number of distractors), as though participants could search the display in a parallel fashion. This change did not occur if targets were recycled, identifying one of the key preconditions for the development of automaticity.

Figure 1.

Figure 1.

Example of two abbreviated mixed-talker trials in the standard speeded monitoring paradigm (9 items are shown here instead of the full 16 to conserve space). The rectangles represent the visual display the participant will see (which always has the target listed on screen) with the auditory stimulus they hear to the right. Different talkers are indicated with colored text. In the standard design, items that serve as a target (underlined in this schematic) on one trial can serve as a distractor on subsequent trials; in this example, BALL is the target for the first trial but a distractor for the second. Furthermore, each talker produces the target item on every mixed-talker trial (both the “green” talker and the “purple” talker produce the target items), meaning that participants must respond to two unique productions; by contrast, they need only respond to one unique production on blocked trials.

Unlike the “multiple talkers producing targets” deviation from consistent mapping we have already discussed, this design detail is not constrained to mixed-talker trials; target recycling also occurs for blocked-talker trials. However, it could be that target recycling interacts with talker mixing, as Nusbaum and Morin (1992) found for digit load. That is, it may generate difficulty for blocked- or mixed-talker trials but interact such that its impact is amplified by the attentional and/or resource demands imposed by talker mixing. This led us to ask whether either or both forms of attentional demand (target recycling or multiple talker tokens) disrupt the normalization process such that multi-talker processing costs can be observed. We confirmed that Nusbaum (personal communication, August 21, 2020) predicted that removing either attentional demand (varied mapping or multiple target tokens) could damp or wipe out mixed-talker effects, but that whether either or both are crucial for observing talker variability effects had not been explicitly tested.

If we were to remove the two sources of attentional demand in speeded monitoring – having two talkers produce the target items in the mixed-talker condition (doubling the number of unique tokens that need to be monitored for), and target recycling – at least four outcomes are possible. First, it is possible that talker changes have sufficient impact that we would still observe increased processing difficulty in mixed-talker trials relative to blocked-talker trials. Second, on the logic proposed by Nusbaum and Morin (1992), some degree of varied mapping may be required to induce sufficient demands on attention to induce detectable mixed-talker effects, and either form of varied mapping may suffice. Third, it may be that only one of the two aspects of varied mapping matters. Finally, it may be that both are required to induce sufficient attentional demands.

In the present study, we tested these possibilities. We first attempted to replicate previous studies that have shown a multi-talker processing cost with the speeded monitoring paradigm, following the approach that has been used in previous work (Experiment 1); critically, this approach recycles targets as distractors and necessitates monitoring for multiple target tokens in mixed-talker trials (1 per talker), but only one target token in blocked-talker trials. In subsequent experiments (Experiments 2–4), we modified the paradigm to eliminate target recycling and/or to control for the number of talkers producing target tokens for blocked-talker trials and mixed-talker trials. The 2 × 2 design for the experiments in this study is summarized in Table 1.

Table 1.

Overview of the designs for the four experiments.

Number of talkers producing targets on mixed trials
Two One

Target recycling Experiment 1 Experiment 4
No target recycling Experiment 3 Experiment 2

Note. In the standard design (Experiment 1), two talkers produce the target items on mixed-talker trials, and an item that serves as a target on one trial can serve as a distractor on a subsequent trial. The other experiments remove one or both of these design features (Experiment 2 removes both, while Experiment 3 isolates the impact of multiple talkers producing mixed-trial targets and Experiment 4 isolates the impact of recycling targets as distractors).

General Methods

We pre-registered our experimental design and analysis plans on the Open Science Framework (https://osf.io/wx4kd) prior to data collection. For expository clarity, we have revised the order of the experiments in this paper. All stimuli and analysis scripts are available at https://github.com/disaltzman/TalkerTeam-Mapping.

Stimuli

Stimuli were produced by four native speakers of American English (two males, two females), who were recorded in a sound-attenuated booth using a RØDE NT-1 condenser microphone with a Focusrite Scarlet 6i6 digital audio interface. Each talker produced three repetitions of each of 19 phonetically distinct words from the word monitoring study of Nusbaum and Morin (1992). Productions from two talkers (one male, one female) were selected for the word monitoring experiments described in this study. We selected the best tokens from each talker’s repetitions and edited them to remove leading and trailing silence. All stimuli were scaled to an RMS amplitude of 70 dB SPL in Praat (Boersma & Weenik, 2017). The stimuli were otherwise unmodified. We note that the durations of the female talker’s stimuli (M = 606 ms) were significantly longer than those of the male talker (M = 568 ms), as indicated by a paired t-test, t(18) = 2.20, p = 0.04; however, we do not believe that this difference has any theoretical or functional implications, and so we did not modify the original stimuli. Stimuli were delivered via OpenSesame v3.2.4 through Sony MDR-7506 or Sennheiser HD-595 headphones.

Participants

We analyzed data from 176 participants (47 Male, 126 Female, 3 no report). Across all four experiments, 183 participants were recruited in total and seven were excluded on the basis of poor accuracy). For all experiments, participants were recruited through the University of Connecticut Psychological Sciences participant pool. All participants indicated that they were monolingual English speakers with normal or corrected-to-normal vision and hearing and no history of speech, language, or neurological impairments. Written informed consent was obtained from every participant in accordance with the guidelines of the University of Connecticut IRB. Participants received course credit for their participation.

Given that accuracy tends to be high in word monitoring experiments (e.g., Heald & Nusbaum, 2014; Magnuson & Nusbaum, 2007), we decided a priori to exclude participants with accuracy levels below 90% (collapsing across mixed-talker and blocked-talker trials). This criterion has been used in previous studies on talker normalization (e.g., Choi & Perrachione, 2019b). For each experiment, we recruited until we had 44 participants who met the 90% accuracy criterion. Our sample size was based on a different word monitoring study conducted in our lab where we considered how multi-talker penalties (measured within subject) might be modulated by a between-subjects factor (Luthra et al., 2021). For that study, a power analysis of previous data (Magnuson & Nusbaum, 2007) demonstrated that 42 participants per level of the between-subjects factor were necessary for power of 0.90 at an α of 0.05 given an estimated mean effect size of approximately partial η2 = 0.114 (the effect size for the critical significant interaction in Magnuson & Nusbaum, 2007). In this study, there are no between-subjects factors, so 42 participants per experiment should be adequate for statistical power. We rounded this up to 44 so that our number is divisible by four (for counterbalancing whether subjects receive mixed/blocked-talker trials first and whether they receive male/female blocked trials first).

Procedure

Participants first went through the informed consent process and then were seated at a testing computer. They were instructed that in each trial they would hear a series of words and should press the spacebar on the keyboard as quickly as possible any time they heard the target word, which would be identified on-screen shortly before the trial began.

Each subject received 48 mixed-talker trials and 48 blocked-talker trials; we counterbalanced whether participants received all their mixed trials first or all their blocked trials first. In a given blocked trial, the stimuli were either all spoken by the male speaker or all by the female speaker. Within the blocked-talker trials, we counterbalanced whether participants received all of the male or female blocked-talker trials first.

Each trial contained 16 auditory tokens, and the target appeared four times in each trial. The target did not appear in positions 1 or 16, and there was always at least one distractor between two targets (i.e., targets did not appear consecutively). A unique randomization was generated for every subject. Following Heald and Nusbaum (2014), we set an inter-trial interval (ITI) of 2500 ms. This ITI consisted of a fixation cross for the first 1000 ms, a blank screen for the next 250 ms, and then the visual presentation of the target word for the upcoming trial. Immediately following the ITI, the stimulus train for the trial began, with a stimulus-onset asynchrony of 750 ms. The target word remained on screen for the duration of the trial. The outcome of interest was the reaction time (RT) to target items. Following Magnuson and Nusbaum (2007), RT was measured from stimulus onset, and RTs that occurred within 150 ms of stimulus onset were considered as a response to the previous item.

In Experiment 1, we sought to replicate the finding that that multi-talker processing costs can be elicited in the standard word monitoring paradigm, as has been previously found (Heald & Nusbaum, 2014; Magnuson & Nusbaum, 2007; Magnuson et al., 2021; Nusbaum & Magnuson, 1997; Nusbaum & Morin, 1992). In keeping with the previous studies, items that served as targets could be used as distractors on subsequent trials. Furthermore, in every mixed-talker trial, the target was produced twice by the male talker and twice by the female talker. Thus, Experiment 1 included both target recycling on blocked and mixed trials, and single-target tokens on blocked trials but targets produced by multiple talkers within mixed trials.

In Experiment 2, we modified the speeded monitoring paradigm to address two features of the standard paradigm that prevent conditions for consistent mapping. First, we ensured that target items would never be recycled as distractors in other trials. Second, we modified mixed-talker trials such that only one talker produced the target item, although both talkers produced distractor items. This maintains the same level of acoustic variability in the mixed-talker trials as in Experiment 1 but reduces the potential working memory load, as subjects only need to monitor for one unique production.

In Experiment 3, we did not allow target recycling in the speeded monitoring paradigm to ensure that the mapping between targets and responses was fully consistent (i.e., items that served as a target on one trial could not serve as a distractor on another). However, as in the conventional design, targets in mixed-talker trials were produced by each talker. Thus, Experiment 3 tests whether multi-talker processing costs in the monitoring paradigm can be driven solely by the need to respond to multiple target tokens.

In Experiment 4, we test the possibility that target recycling might be a sufficient condition for multi-talker processing costs. That is, we asked whether multi-talker processing costs persist even when target items are spoken only by one talker, but target recycling is allowed.

Analysis

RT data from trials with correct responses were submitted to a generalized linear mixed-effects model that was implemented in R (R Core Team, 2019) with the packages lme4 (Bates, Mächler, Bolker, & Walker, 2015) and afex (Singmann, Bolker, Westfall, Aust, & Ben-Shacar, 2020). No responses to target items were filtered or removed based upon their RT. Lo and Andrews (2015) have argued that RT transformations may obscure meaningful differences between conditions and therefore that raw RTs are a more theoretically justified dependent variable. We therefore used generalized linear mixed models for analyzing RTs; such an approach allows for the use of raw RTs as the dependent variable while allowing the user to specify a statistical distribution that reflects the actual distribution of RT. As suggested by Lo and Andrews, we specified a gamma distribution with an identity link. For all experiments, chi-square tests indicated that this approach yielded significantly better model fit than equivalent linear mixed-effects models with either the raw RT data or log-transformed RT data. W

As outlined in our pre-registered analysis plan, we identified the most parsimonious random effects structure using a backwards-stepping procedure (Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017). Likelihood ratio tests were implemented using the ‘mixed’ function in the R afex package to test for effects of our fixed factors; we report chi-squared values and associated p values from these tests.

Results

Data from all four experiments were submitted to an omnibus analysis that used a generalized linear mixed model with fixed factors of Condition (Blocked vs. Mixed, sum-coded), Target Recycling (Present vs. Absent, sum-coded), Number of Talkers Producing Targets in Mixed Trials (One vs. Two, sum-coded), and the accompanying two and three-way interactions. The model with by-subject random slopes for Condition was estimated to have the best fit. There was a significant main effect of Condition (χ2 = 19.63, p < 0.001), indicating that across the experiments, responses were slower to mixed talker trials than blocked talker trials and a significant main effect of Target Recycling (χ2 = 4.55, p = 0.03), indicating that responses were slower for experiments where Target Recycling was absent (Exp. 2 & 3) compared to those where it was present (Exp. 1 & 4). Only the two-way interaction between Condition and Number of Target Talkers in Mixed Trials was significant (χ2 = 6.94, p = 0.008).

This interaction was explored using the R package emmeans (Lenth, 2020) to compare estimated marginal means (EMM) for the effect of Condition at each level of Number of Talkers Producing Targets in Mixed Trials. There was a significant difference between Blocked and Mixed trials when one talker produced the target items in mixed talker trials (EMM = −5.85, p = 0.01), though this difference was much larger when two talkers produced the target items in mixed talker trials (EMM = −21.56, p < 0.0001), indicating that MTPC was smaller (though not non-existent) when only one talker produced the target items in mixed talker trials.

General Discussion

Over the course of four experiments, we investigated the possibility that either, both, or neither of the attention-demanding features in conventional speeded monitoring paradigms might be crucial for observing multi-talker processing costs. Specifically, we tested whether detecting this processing cost requires (1) the recycling of target items as distractor items on subsequent trials and/or (2) two talkers producing the target items in mixed-talker trials, thereby requiring the listener to monitor for twice as many unique items as the blocked-talker trials. We found evidence for the latter, as multi-talker processing costs were elicited when the mixed-talker condition required responses to two unique tokens (Experiments 1 and 3) but substantially reduced when responses were made to a single target in both mixed and blocked talker (Experiments 2 and 4).

While previous work by Schneider and Shiffrin (1977) led us to hypothesize that the recycling of targets might also be a critical factor governing the emergence of multi-talker processing costs3, we did not find evidence to support this hypothesis. This may be because the visual search task used by Schneider and Shiffrin may differ too much from the word monitoring paradigm. The difference in modality (visual versus auditory) notwithstanding, a key difference between the auditory monitoring task and their visual search task is the amount of practice participants had with the task; participants in Schneider and Shiffrin’s studies had substantial exposure to repeated targets and distractors before the crucial test data were collected (on the order of thousands of trials), while participants in our study had fewer trials, and no prior training with items before data was collected. Schneider and Shiffrin posited that the two criteria for achieving automaticity in processing are consistent mapping and practice to reinforce that mapping; consistent mappings without substantial practice are not sufficient to develop automaticity. Thus, even when targets were not recycled (as in Experiments 2 and 3), participants may have been engaging in controlled processing as the mapping between certain words and their status as a “target” had little reinforcement, and the length of the paradigm used in the reported experiments was unlikely to be sufficient practice to reinforce that. To further investigate this possibility, future work might test whether multi-talker processing costs dissipate if targets are not recycled, and participants receive considerable practice with the task. The present results fit into a broader literature suggesting that it is difficult (and perhaps impossible) to consider the problem of talker normalization without considering other aspects of cognitive processing, including the mapping between stimuli and responses, an individual’s level of practice with the experimental task, and the degree of cognitive load.

Rather than finding evidence that target recycling was the key factor for eliciting multi-talker costs, our results suggest that in the speeded monitoring paradigm, the presence of a multi-talker processing cost depends on how many talkers produce the target stimuli in mixed-talker trials. As the speeded monitoring paradigm does not require participants to make responses to most items, and because participants in Experiments 2 and 4 only needed to respond to one talker’s productions for a given mixed-talker trial, it is possible that listeners may have been able to effectively ignore the second talker who was only producing task-irrelevant distractors. As such, performance on mixed-talker trials may have been similar to performance on blocked-talker trials insofar as there was only a single target to monitor for on a given trial. This observation is consistent with the well-attested “cocktail party” effect (see Shinn-Cunningham, 2008, for a review) – the ability for listeners to attend to and segment one stream of speech from competing, irrelevant information (Cherry, 1953). When only a single talker produces the target items, the selective-attention required for mixed-talker trials changes – the target consists of only a single combination of talker and item, which reduces the cognitive demands in place, and perhaps allowing normalization to occur automatically and nearly undetectably. That said, it is important to note that the identity of the talker producing the target items on mixed-talker trials varied from trial to trial, so subjects could not have known in advance which talker they needed to attend to (at least prior to the first target on a given mixed-talker trial).

In other words, our results suggest that the key factor governing the emergence of multi-talker processing costs in the speeded word monitoring paradigm is whether both talkers are behaviorally relevant with regard to participants’ responses. Our results suggest that when all the target items are produced by one talker (i.e., only one talker is behaviorally relevant), then the costs involved in talker normalization are dramatically reduced. The latter position – namely, that talker normalization is a highly-automatized process that is only observable when listeners must engage in highly controlled processing – is consistent with the stance taken by Nusbaum and Morin (1992).

Our findings suggest that to produce measurable multi-talker penalties in speeded monitoring paradigms, researchers should ensure that both talkers are behaviorally relevant (i.e., that listeners must make behavioral responses to both talkers) in order to elicit multi-talker processing costs. However, while both talkers are indeed behaviorally relevant in the standard speeded monitoring paradigm (Experiment 1), the standard design has inherent asymmetries between mixed-talker and blocked-talker trials with regard to the number of tokens (i.e., unique stimuli) to which listeners must respond. This makes it difficult to determine whether the observed multi-talker processing costs are truly a result of talker normalization per se or a result of general acoustic variation. While previous work by Magnuson and Nusbaum (2007) suggests that not all acoustic variation (e.g., changes in amplitude) elicits a processing cost, we suggest that additional studies are needed to distinguish whether the processing costs in this paradigm are specifically due to talker variation.

It is important to acknowledge that multi-talker processing costs have also been observed in other paradigms, and thus are unlikely to be an artifact of the monitoring paradigm. For example, Mullennix et al. (1989) assigned participants to either a blocked-talker group or multi-talker group and asked them to identify what words were spoken. Across a range of signal-to-noise ratios, participants in the multi-talker group were reliably slower to respond and less accurate than those in the blocked-talker group. Regardless of whether they were asked to type the word or speak it aloud, participants in the multi-talker group were reliably slower to respond and less accurate than those in the blocked-talker group. Multi-talker processing costs have also been repeatedly observed in the speeded classification paradigm (Carter et al., 2019; Choi et al., 2018; Choi & Perrachione, 2019a, 2019b; Kapadia & Perrachione, 2020; Lim, Tin, Qu, & Perrachione, 2019), where listeners hear a single item (e.g., boot) on each trial and must indicate what they heard from a limited set of response options (e.g., boot or boat). Notably, in both of these tasks, listeners must make a behavioral response on every item, meaning that both talkers are behaviorally relevant. This again points to the fact that normalization may only occur (or that multi-talker processing costs may only be measurable) when changes in talker are kept in the attentional focus.

More generally, in considering the utility of multi-talker processing costs as a tool for studying talker normalization, it is worth noting that some researchers have suggested that multi-talker processing costs may emerge simply because there is a break in low-level acoustic information that disrupts auditory streaming (Choi & Perrachione, 2019b; Lim, Shinn-Cunningham, & Perrachione, 2019), rather than reflecting talker normalization per se. Specifically, when listeners hear speech from one talker, they can attribute ongoing variation in the auditory signal to a single physical source with relative ease – that is, they can group the relevant auditory input into a single auditory object. By contrast, when the speech signal alternates between two talkers, the formation of one auditory object (for the first talker) may be disrupted by the need to form a second auditory object (for the second talker). In their view, this makes it harder to attend to – and thus harder to perceptually analyze – the speech signal, yielding multi-talker processing costs. However, our results appear inconsistent with this notion. The streaming account should predict multi-talker processing costs even when mixed-talker trial targets are produced by only one talker (since there is still talker variability within the trial, with equivalent numbers of talker changes), which was not the case in Experiment 2 and 4. Such a result suggests that multi-talker processing costs indeed reflect a process of talker normalization/accommodation, rather than emerging simply because of disruptions in auditory object formation.

Conclusions

In closing, it is worth underscoring that the lack of invariance problem remains a critical issue for research on speech perception, and despite decades of concerted effort, as a field, we are still far from understanding how listeners accommodate sources of variance, including variation between talkers. For proponents of the view that phonetic constancy results from active, controlled processing, our results identify the potentially crucial attentional aspect of speeded monitoring for detecting the operation of talker normalization. These findings also call into question the automaticity of talker normalization, suggesting that processing penalties may only emerge (or may only be observable) when the talker change is in attentional focus. Future work will be required to further elucidate the nature of the attention-demanding processing mechanisms that appear to be associated with maintaining phonetic constancy, and to fully equate sources of variability between blocked- and mixed-talker trials.

Table 2.

Demographic information for each of the four experiments.

N before exclusion Excluded for low accuracy Gender

Experiment 1 44 0 29 F, 13 M, 2 NR
Experiment 2 45 1 36 F, 8 M
Experiment 3 47 3 30 F, 13 M, 1 NR
Experiment 4 47 3 31 F, 13 M

Note. F = Female, M = Male, NR = No Report

Table 3.

Generalized linear mixed-effects model output using the R package afex for the omnibus analysis.

Effect df χ2 p value

Condition (Blocked/Mixed) 1 19.63 *** <0.001
Target Recycling 1 4.55 * 0.033
Number of Talkers Producing Targets in Mixed Trials 1 0.46 0.499
Condition * Target Recycling 1 0 >0.999
Condition * Number of Talkers Producing Targets in Mixed Trials 1 6.94 ** 0.008
Target Recycling * Number of Talkers Producing Targets in Mixed Trials 1 2.59 0.108
Condition * Target Recycling * Number of Talkers Producing Targets in Mixed Trials 1 0 0.969

Public Significance Statement:

This study highlights the importance of attention to the process of accommodating the unique way each individual speaks, which may not occur automatically unless the talker is relevant to the current situation.

Author Note

This research was supported by NIH grant R01 H14-001 (PI: EBM) and NSF grants NRT 1747486 and PAC 1754284 (PI: JSM). SL was supported by an NSF Graduate Research Fellowship. This work was presented at the 61st Annual Meeting of the Psychonomic Society. As noted in the manuscript, the preregistration plan for this study is available on the Open Science Framework (https://osf.io/wx4kd), and all stimuli and analysis code can be found at https://github.com/disaltzman/TalkerTeam-Mapping.

Footnotes

1

In the present work, we focus on normalization based on preceding speech, often termed extrinsic normalization. In contrast, most proposals for intrinsic normalization hold that each speech sample contains sufficient information to map acoustics to perceptual categories (Ainsworth, 1975; Lobanov, 1971; Syrdal & Gopal, 1986), and thus do not predict that talker changes should induce processing costs. While extrinsic and intrinsic normalization could be complementary mechanisms that promote phonetic constancy (Nearey, 1989), we focus on contextual tuning theories of extrinsic normalization (Magnuson & Nusbaum, 2007; Magnuson, Nusbaum, Akahane-Yamada, & Saltzman, 2021) which explicitly predict processing costs due to talker changes (and subsequent re-computation of the acoustics-to-percepts mapping).

2

Because “normalization” is often associated with the notion of destructive abstraction, whereby speech is stripped of surface details and mapped to abstract phonological and/or lexical categories, Magnuson and Nusbaum (2007) proposed that a better term might be “talker accommodation.” (They also discussed the fact that most proposals for talker normalization do not explicitly or implicitly propose destructive abstraction.)

3

As we noted earlier, while target recycling occurs in both blocked- and mixed-talker conditions in the monitoring paradigm, it could have interacted with talker mixing by contributing additional attentional demand to allow multi-talker processing costs to be observed.

References

  1. Ainsworth W (1975). Intrinsic and extrinsic factors in vowel judgments. In Auditory Analysis and Perception of Speech (pp. 103–113.). [Google Scholar]
  2. Antoniou M, Wong PCM, & Wang S (2015). The effect of intensified language exposure on accommodating talker variability. Journal of Speech, Language, and Hearing Research, 58(3), 722–727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bates D, Maechler M, Bolker B, & Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. doi: 10.18637/jss.v067.i01. [DOI] [Google Scholar]
  4. Boersma P, & Weenik D (2017). Praat: Doing phonetics by computer. [Google Scholar]
  5. Bosker HR (2018). Putting Laurel and Yanny in context. The Journal of the Acoustical Society of America, 144(6), EL503–EL508. [DOI] [PubMed] [Google Scholar]
  6. Carter YD, Lim S, & Perrachione TK (2019). Talker continuity facilitates speech processing independent of listeners’ expectations. In 19th International Congress of Phonetic Sciences. [Google Scholar]
  7. Cherry EC (1953). Some Experiments on the Recognition of Speech, with One and with Two Ears. The Journal of the Acoustical Society of America, 25(5), 975–979. 10.1121/1.1907229 [DOI] [Google Scholar]
  8. Choi JY, Hu ER, & Perrachione TK (2018). Varying acoustic-phonemic ambiguity reveals that talker normalization is obligatory in speech processing. Attention, Perception, and Psychophysics, 80(3), 784–797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Choi JY, & Perrachione TK (2019a). Noninvasive neurostimulation of left temporal lobe disrupts rapid talker adaptation in speech processing. Brain and Language, 196, 104655, 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Choi JY, & Perrachione TK (2019b). Time and information in perceptual adaptation to speech. Cognition, 192, 103982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dorman MF, Studdert-Kennedy M, & Raphael LJ (1977). Stop-consonant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues. Perception & Psychophysics, 22(2), 109–122. [Google Scholar]
  12. Francis AL, & Nusbaum HC (1996). Paying attention to speaking rate. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96 (Vol. 3, pp. 1537–1540). [Google Scholar]
  13. Heald SLM, & Nusbaum HC (2014). Talker variability in audio-visual speech perception. Frontiers in Psychology, 5, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Joos M (1948). Acoustic phonetics. Language, 24(2), 5–136. [Google Scholar]
  15. Kapadia AM, & Perrachione TK (2020). Selecting among competing models of talker adaptation: Attention, cognition, and memory in speech processing efficiency. Cognition, 204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ladefoged P, & Broadbent DE (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 29(1), 98–104. [DOI] [PubMed] [Google Scholar]
  17. Laing EJC, Liu R, Lotto AJ, & Holt LL (2012). Tuned with a tune: Talker normalization via general auditory processes. Frontiers in Psychology, 3, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Liberman AM, Harris KS, Hoffman HS, & Griffith BC (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54(5), 358–368. [DOI] [PubMed] [Google Scholar]
  19. Lim S-J, Shinn-Cunningham BG, & Perrachione TK (2019). Effects of talker continuity and speech rate on auditory working memory. Attention, Perception, & Psychophysics, 81, 1167–1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lim S-J, Tin JAA, Qu A, & Perrachione TK (2019). Attentional reorientation explains processing costs associated with talker variability. In 19th International Congress of Phonetic Sciences. [Google Scholar]
  21. Lo S, & Andrews S (2015). To transform or not to transform: using generalized linear mixed models to analyse reaction time data. Frontiers in Psychology, 6(August), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lobanov BM (1971). Classification of Russian vowels spoken by different speakers. The Journal of the Acoustical Society of America, 49(2B), 606–608. [Google Scholar]
  23. Luthra S, Saltzman D, Myers EB, & Magnuson JS (2021). Listener expectations and the perceptual accommodation of talker variability: A pre-registered replication. Attention, Perception, & Psychophysics. 10.3758/s13414-021-02317-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Magnuson JS, & Nusbaum HC (2007). Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology, 33(2), 391–409. [DOI] [PubMed] [Google Scholar]
  25. Magnuson JS, Nusbaum HC, Akahane-Yamada R, & Saltzman D (2021). Talker familiarity and the accommodation of talker variability. Attention, Perception, and Psychophysics. 10.3758/s13414-020-02203-y. [DOI] [PubMed] [Google Scholar]
  26. Mullennix JW, Pisoni DB, & Martin CS (1989). Some effects of talker variability on spoken word recognition. The Journal of the Acoustical Society of America, 85(1), 365–378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Nearey TM (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85(5), 2088–2113. [DOI] [PubMed] [Google Scholar]
  28. Nusbaum HC, & Magnuson JS (1997). Talker normalization: Phonetic constancy as a cognitive process. Talker Variability and Speech Processing, 109–132. [Google Scholar]
  29. Nusbaum HC, & Morin TM (1992). Paying attention to differences among talkers. In Speech Perception, Production and Linguistic Structure (pp. 133–134). [Google Scholar]
  30. Nusbaum HC, & Schwab EC (1986). The role of attention and active processing in speech perception. In Schwab EC & Nusbaum HC (Eds.), Pattern Recognition by Humans and Machines, Volume 1: Speech Perception (1st ed., pp. 113–157). Academic Press. [Google Scholar]
  31. Peterson GE, & Barney HL (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184. [Google Scholar]
  32. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/. [Google Scholar]
  33. Schneider W, & Shiffrin RM (1977). Controlled and automatic human information processing: I. Detection, search, and attention. Psychological Review, 84(1), 1–66. [Google Scholar]
  34. Shiffrin RM, & Schneider W (1977). Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. Psychological Review, 84(2), 127–190. [Google Scholar]
  35. Shinn-Cunningham BG (2008). Object-based auditory and visual attention. Trends in Cognitive Sciences, 12(5), 182–186. 10.1016/j.tics.2008.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Singmann H, Bolker B, Westfall J, Aust F, & Ben-Shachar MS (2020). afex: Analysis of Factorial Experiments. R package version 0.27–2. https://CRAN.R-project.org/package=afex [Google Scholar]
  37. Sjerps MJ, Fox NP, Johnson KA, & Chang EF (2018). Speaker-normalized vowel representations in the human auditory cortex. Nature Communications, (2019), 1–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Stilp CE (2019). Auditory enhancement and spectral contrast effects in speech perception. The Journal of the Acoustical Society of America, 146(2), 1503–1517. [DOI] [PubMed] [Google Scholar]
  39. Syrdal AK, & Gopal HS (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. The Journal of the Acoustical Society of America, 79(4), 1086–1100. [DOI] [PubMed] [Google Scholar]
  40. Verbrugge RR, Strange W, Shankweiler DP, & Edman TR (1976). What information enables a listener to map a talker’s vowel space? The Journal of the Acoustical Society of America, 60(1), 198–212. [DOI] [PubMed] [Google Scholar]
  41. Wong PCM, Nusbaum HC, & Small SL (2004). The neural basis of talker normalization. Journal of Cognitive Neuroscience, 16, 1173–1184. [DOI] [PubMed] [Google Scholar]
  42. Zhang C, Peng G, & Wang WSY (2013). Achieving constancy in spoken word identification: Time course of talker normalization. Brain and Language, 126(2), 193–202. [DOI] [PubMed] [Google Scholar]

RESOURCES