Abstract
The identity of a speaker influences language comprehension through modulating perception and expectation. This review explores speaker effects and proposes an integrative model of language and speaker processing that integrates distinct mechanistic perspectives. We argue that speaker effects arise from the interplay between bottom-up perception-based processes, driven by acoustic-episodic memory, and top-down expectation-based processes, driven by a speaker model. We show that language and speaker processing are functionally integrated through multi-level probabilistic processing: prior beliefs about a speaker modulate language processing at the phonetic, lexical, and semantic levels, while the unfolding speech and message continuously update the speaker model, refining broad demographic priors into precise individualized representations. Within this framework, we distinguish between speaker-idiosyncrasy effects arising from familiarity with an individual and speaker-demographics effects arising from social group expectations. We discuss how speaker effects serve as indices for assessing language development and social cognition, and we encourage future research to extend these findings to the emerging domain of artificial intelligence (AI) speakers, as AI agents represent a new class of social interlocutors that are transforming the way we engage in communication.
Keywords: Language comprehension, Speaker effect, Acoustic episode, Speaker model, Probabilistic processing, Social cognition
Introduction
Despite being used in the psycholinguistic literature, the term speaker effect, also known as talker effect, is often used without being formally defined. It refers to how language comprehension1 is influenced by the identity of the speaker. For example, when a common name like “Kevin” is mentioned by a colleague, a listener might think of a middle-aged workmate named Kevin, whereas if the same name is mentioned by their school-age son, they are more likely to think of a boy from his class (Barr et al., 2014). Similarly, while it seems natural for a little girl to say she cannot sleep without her teddy bear, hearing the same sentence from an adult man may be unexpected (Van Berkum et al., 2008). These examples illustrate that language is understood in a context that includes the identity of the speaker.
However, using speaker effect as an umbrella term often obscures the distinct mechanisms at play in different scenarios. In the “Kevin” example, the effect may arise from the activation of acoustic-episodic memory, linking the name with the voice of a specific speaker (such as the workmate or the school-age son). In contrast, the “teddy bear” example may illustrate the influence of the listener’s mental model regarding demographic stereotypes. There is currently a lack of a theoretical framework in which various types of speaker effects can be mechanistically explained and integrated.
To address this issue, we propose a theoretical framework for understanding speaker effects in language comprehension. We begin by noting that the physical basis of speaker effects is the variability in voices across speakers, and that a speaker’s voice provides rich information that allows listeners to perceive and identify the speaker. We then consider the interplay between voice and linguistic content by contrasting a one-system view, which assumes voice and language processing are integrated, and a two-system view, which regards them as independent processes. These two perspectives give rise to two accounts of speaker effects. The acoustic-episode account, which aligns with the one-system view, emphasizes the role of acoustic-episodic memory in modulating language comprehension. In contrast, the speaker-model account, which aligns with the two-system view, focuses on the influence of the listener’s mental model of the speaker on language comprehension. To reconcile these two accounts, we propose an integrative model of language and speaker processing that incorporates the roles of both acoustic-episodic memory and the speaker model. We also illustrate how the speaker model may modulate language comprehension through multiple levels of probabilistic processing. Building on this integrative model, we differentiate between speaker-idiosyncrasy effects and speaker-demographics effects. The former arise from the listener’s familiarity with specific individual speakers, while the latter stem from the listener’s accumulated experience interacting with a demographic population. Acoustic-episodic memory and the speaker model contribute to these two types of effects to varying degrees depending on the requirements of different comprehension tasks. Finally, we discuss the potential for using speaker effects as measures for assessing linguistic and socio-cognitive abilities, and suggest extending future research to artificial intelligence (AI) agents as humanlike speakers, as the increasing prevalence of voice-based human-AI interaction may give rise to new types of speaker effects.
Voice as the physical basis of speaker effects
Throughout evolutionary history, communication systems in humans and other species have shared a common purpose: to convey the vocalizer’s identity and physiological characteristics (Creel & Bregman, 2011). The phenomenon of “speaker” effects is not only prevalent in human communication but is also evident in the animal kingdom. Non-human primates, who share a common ancestor with humans, exhibit vocal recognition systems that enable them to identify individual members within their social groups (Bergman et al., 2003) and perceive cues regarding physical characteristics such as sex, age, and body size (Ey et al., 2007).
Similarly, humans can extract social and biological information from the voice, which is a cornerstone of social cognition (Belin et al., 2004). For example, males with lower-pitched voices are often perceived by other males as more physically and socially dominant, reflecting the importance of voice in male intrasexual competition and mating success (Puts et al., 2006). People rapidly form personality judgments from hearing a new voice, even from just a brief utterance like the word “hello” that lasts less than a second (McAleer et al., 2014). In this section, we explore how human speakers vary in vocal features and how listeners identify speakers on the basis of these features.
The source of speaker effects: Speaker variability
In spoken language comprehension, the cognitive processing of a speaker’s identity begins with the physical signal. In most cases,2 speaker effects arise from differences in how acoustic speech signals are produced by different individuals. These differences are known as speaker variability, which is often linked to the speaker’s unique physiological characteristics and learned behaviors.
Voice production results from interactions between the laryngeal source and the vocal tract filter (Ghazanfar & Rendall, 2008). Source properties include the vibration of the vocal folds, which determines a speaker’s fundamental frequency (f0), a primary cue listeners use to track pitch. Filter properties include the dynamic changes in shape and size of the vocal tract, which acts as a resonator to reinforce certain frequencies. These resonant frequencies of the vocal tract are known as formants. The largest acoustic differences in voice among speakers are observed between men and women, and between adults and children (Fitch & Giedd, 1999; Johnson & Sjerps, 2021; Kreiman & Sidtis, 2011). These differences primarily result from physiological factors. For example, men typically have larger larynges with longer and thicker vocal folds compared to women (Hammond et al., 2000), leading to lower rates of vibration and, consequently, lower pitch during speech production (Stevens, 1998). Men also have longer vocal tracts and proportionally longer pharyngeal cavities (Simpson, 2001), resulting in generally lower vowel formant frequencies (Gelfer & Bennett, 2013). Sex-based differences in voice quality, such as breathiness, may arise from variations in subglottal pressure and laryngeal adjustments (Klatt & Klatt, 1990).
However, some differences in voice cannot be explained solely by anatomical differences; instead, they are influenced by social and cultural factors (Munson & Babel, 2019). For example, the acoustic difference between male and female children’s speech may be partly attributed to sex-specific articulatory behaviors (Bennett, 1981; Perry et al., 2001). Disparities in fundamental frequencies between males and females throughout their lifespan may arise from culturally ingrained, gender-based pronunciation practices (Whiteside, 2001). A study examining the voice of boys diagnosed with gender identity disorder (now typically referred to as gender dysphoria) showed that they sounded less like control boys, likely due to subtle, learned speech behaviors rather than variations in vocal tract size or vocal cord shape (Munson et al., 2015). Transgender speakers can modify vocal characteristics to align with their identity. For example, transgender women can raise their pitch and adjust resonance despite anatomical constraints, and transgender men can adopt articulation patterns that correlate more with their gender identity (Zimman, 2018). These findings show that the voice can be a social construct mediated by identity rather than strictly a biological outcome.
Beyond gender differences, the acoustic properties of the human voice change with age, marking significant divergence between adults and children. The development of children’s vocalization is primarily due to anatomical maturation of the vocal tract, such as the increase in vocal tract length, which leads to a decrease in formant frequencies. By puberty, notable sex-based differences in vocal tract length emerge, with distinct patterns for males and females (Vorperian et al., 2009). Compared to adults, the speech of children is characterized by elevated pitch and formant frequencies, extended speech segment durations, and increased variability in both timing and frequency spectra (Lee et al., 1999). These distinct patterns enable listeners to identify the age of a speaker upon hearing their voice.
Speaker identification by voice
Humans can easily identify a person through their voice, and this ability develops very early in life. Human fetuses show increased heart rates in response to their mother’s voice and decreased heart rates to a stranger’s voice (Kisilevsky et al., 2003). Similarly, newborn babies less than 3 days old can distinguish and show a preference for their mother’s voice over that of a female stranger, as indicated by their sucking behavior (DeCasper & Fifer, 1980). For adults, a speaker’s voice is a primary acoustic signal that provides rich information about the speaker (Schweinberger et al., 2014). When encountering familiar speakers, listeners can often identify the individual using acoustic cues (Schweinberger et al., 1997). With unfamiliar speakers, listeners can extract demographic attributes from their voice, including their sex (Leung et al., 2018), age (Mulac & Giles, 1996), physical characteristics (Krauss et al., 2002), region of origin (Clopper & Pisoni, 2004b), socio-economic status (Labov, 1973), perceived competence (Rakić et al., 2011; Ko et al., 2009), and sexual orientation (Pierrehumbert et al., 2004).
Listeners can identify a person through their voice remarkably quickly. They show rapid responses differentiating voices from other sounds. In a magnetoencephalography (MEG) study, Capilla et al. (2013) found that listeners begin to show distinct brain responses to vocal and non-vocal sounds as early as 150 ms after stimulus onset. These voice-preferential responses are localized to bilateral mid-superior temporal sulci (mid-STS) and mid-superior temporal gyri (mid-STG), overlapping with the brain regions known as the temporal voice areas. Beyond differentiating voice from non-voice, it takes around 200–300 ms to identify a voice as familiar. An electroencephalogram (EEG) study by Beauchemin et al. (2006) found that familiar voices elicited greater mismatch negativity (MMN) and P3a components, which peaked around 200 ms and 300 ms after stimulus onset, respectively. Furthermore, the categorization of a speaker’s social group based on voice also occurs rapidly and interacts with the processing of other vocal cues. Jiang et al. (2020) demonstrated that listeners differentiated vocally expressed confidence as early as approximately 100–200 ms after voice onset for in-group speakers; however, these early differentiation effects were altered or absent for out-group speakers. For newly learned voices, Zäske et al., (2014) found that successful identification was associated with beta-band (16–17 Hz) neural oscillations in central and right temporal regions, starting around 290 ms after stimulus onset. These oscillations appeared to be elicited independently of linguistic content, suggesting a possible dissociation between voice and language processing.
Regarding how voices are represented in memory, some models suggest prototype-based processing as a potential mechanism (e.g., Lavner et al., 2001). In these models, voices are represented in a multidimensional voice space (Petkov & Vuong, 2013). Each dimension represents a vocal feature (e.g., the vocal tract length). The central point of this space represents the prototype voice (Latinus et al., 2013), an average voice formed through prior exposure to different voices. Direct support for this mechanism comes from Lavan et al. 2019b), who found that listeners, after having learned a voice from a specific set of acoustic exemplars, subsequently recognized the untrained mathematical average of that voice better than the specific exemplars they had actually heard. This suggests that listeners automatically construct norm-based prototypes, rather than relying solely on the storage of specific exemplars.
Each voice is represented in the voice space by its deviation from the averaged prototype. Listeners estimate the similarity of an incoming voice to a reference voice based on these deviation patterns (Maguinness et al., 2018). These deviations are compared to stored reference patterns, which may represent a specific speaker (for familiar voices) or broader templates of a demographic group, such as a “young Glaswegian male” (Lavan et al., 2019a).
Furthermore, voice-identity processing is often likened to face-identity processing (Yovel & Belin, 2013), with a person’s voice sometimes referred to as their “auditory face” (Belin et al., 2004, 2011; Young et al., 2020). This account, originally developed based on the model of face perception (Bruce & Young, 1986), emphasizes the similarity between voice and face processing (Schirmer, 2018; Young et al., 2020). It suggests that incoming acoustic signals undergo general low-level auditory analysis before being processed in a structural analysis stage where three essential aspects (linguistic, voice, and affective information) are processed through dissociable but interacting pathways. The voice information pathway connects to higher-order semantic nodes of the speaker’s identity, which in turn link to other modalities such as the visual system.
The interplay between voice and linguistic content
Voice is not only a medium for personal identity but also a vehicle for linguistic content (Ladefoged & Broadbent, 1957; Scott, 2019). These dual functions give rise to an interplay between voice and language processing. Over the years, this interplay has been approached from different perspectives, which can be broadly characterized by their focus. Some theories, often grouped as the two-system view, suggest that voice and language are processed independently. In contrast, the one-system view proposes that voice and language are processed within a single cognitive system from the very beginning. As much of the literature suggests a middle ground where these processes are separable but not wholly independent (e.g., Mullennix & Pisoni, 1990), these two views are perhaps best seen as theoretical endpoints on a continuum.
The two-system view
Models of voice processing such as the “auditory face” model assume that voice is processed independently from linguistic content. Similarly, abstractionist theories of speech processing (e.g., Liberman & Mattingly, 1985; McQueen et al., 2006) suggest that language comprehension is independent of the processing of paralinguistic information like the speaker’s identity. This framework is known as the two-system view. In this view, linguistic and paralinguistic (e.g., speaker-related) information are processed in different systems, meaning that indexical information is retained separately (but not completely discarded) (Magnuson & Nusbaum, 2007). To cope with the variability in speech signals from different speakers, the linguistic system engages in normalization, an active control process that resolves the many-to-many mapping between acoustics and phonetic categories (Choi et al., 2018; Magnuson et al., 2021) by tuning the speech processing system to speaker-specific acoustic properties (Sjerps et al., 2019).
The hypothesis of speaker normalization has been supported by studies demonstrating performance costs, such as reduced accuracy and slower processing speed, when listeners perceive speech from multiple speakers compared to a single speaker (Clopper & Pisoni, 2004a; Mullennix et al., 1989). Listeners who are told to expect two different speakers experience these performance costs, while listeners who expect a single speaker do not (Magnuson & Nusbaum, 2007; but see Luthra et al., 2021). Furthermore, neuroimaging evidence shows that perceiving speech under a mixed-speaker condition results in greater activity in the middle/superior temporal and superior parietal regions, compared to a blocked-speaker condition. This increased neural activity in the temporal-parietal network was considered to reflect the heightened demand for selective attention required in resolving acoustic–phonetic ambiguities introduced by multiple speakers (Wong et al., 2004). Traditionally, these behavioral costs and increased neural activity have been interpreted as the cognitive load associated with the active renormalization of phonetic categories. However, more recent research suggests that these effects may not be driven solely by normalization. An alternative “auditory attention” hypothesis proposes that talker-switching acts as salient stimulus discontinuities that disrupt auditory streaming, triggering an involuntary reorienting of attention (Lim et al., 2019; Luthra, 2024). Importantly, these views are likely complementary rather than mutually exclusive: recent evidence suggests that multitalker processing costs may be driven by both attentional disruptions over short time scales and phonetic normalization over longer time scales (Luthra, 2024).
The two-system view is further supported by neuroimaging evidence indicating that voice and language are processed in separate regions in the brain. The left STG is sensitive to linguistic content, processing phonetic (Yi et al., 2019) and syntactic information (Friederici et al., 2010). This area exhibits flexibility in adapting to different listening environments (Evans et al., 2016). In contrast, the right temporal regions focus more on voice-specific information (Lattner et al., 2005), which is often associated with the speaker’s identity. This reflects a hemispheric asymmetry: the left hemisphere is more involved in language processing, while the right is more attuned to nonlinguistic vocal features (González & McLennan, 2007; Schall et al., 2015; Scott, 2019).
Some researchers suggest that this asymmetry arises from differences in the temporal scale of acoustic information processed by the two hemispheres (Creel & Bregman, 2011; Creel & Tumlin, 2011; Poeppel, 2003). The left hemisphere focuses on rapid temporal events, aligning with linguistic elements necessary for speech perception. Conversely, the right hemisphere is sensitive to slower temporal events, which often correspond to the nonlinguistic features that indicate the speaker’s identity. However, this purely acoustic account has been challenged by more recent evidence. For example, Myers and Theodore (2017) found that the right hemisphere is recruited to process voice-onset time (a classic rapid temporal cue) when that cue serves as a marker of speaker identity. This suggests that hemispheric specialization might not be driven solely by the physical temporal scale of the stimulus, but also by its functional significance – whether the cue signals linguistic content or vocal identity.
Despite this hemispheric specialization, the two systems must interact to accommodate speaker-specific phonetic variability. Neurobiological evidence shows that the integration of speaker information during speech perception is achieved through coordinated activity of neural networks. Functional connectivities reveal that access to speaker-specific phonetic patterns relies on interactions between the left-lateralized phonetic processing system and the right-lateralized voice processing system, with the right posterior temporal cortex serving as a crucial interface for this integration (Luthra, 2021; Luthra et al., 2023).
It is important to note that although the two-system view suggests separate processing of voice and linguistic content, it does not dismiss the influence of voice on language processing. Instead, it posits that the influence is at most indirect, in contrast to the direct influence proposed by the one-system view.
The one-system view
At the other endpoint of the theoretical continuum, the one-system view posits that voice and language processing are interdependent and use the same set of representations. According to this view, people learn to distinguish between elements in speech signals that convey meaning and those that identify speakers. Some research also emphasizes that both voice and language processing share an evolutionary root in humans’ early ability to recognize individuals from vocal cues (e.g., Creel & Bregman, 2011). The one-system view is represented by exemplar-based theories, which propose that the human memory system, including the mental lexicon, stores detailed records of prior experiences with various stimuli (Medin & Schaffer, 1978; Nosofsky, 1986). When new stimuli are encountered, they are compared with stored exemplars for classification. If a new stimulus matches a stored exemplar, the memory of that exemplar is reinforced; otherwise, a new exemplar is created and stored (Gradoville, 2023).
In a radical version of exemplar-based theories (e.g., Goldinger, 1996, 1998), memory systems (e.g., the mental lexicon) store intact episodes with detailed acoustic traces. Incoming speech signals activate similar acoustic traces in episodic memory, leading to identification. From this standpoint, there is no distinction, as far as speech perception is concerned, in the nature of representations between linguistic units and voice: both are encoded as unified records in the memory system. This comprehensive record-keeping allows for the emergence of various information clusters, such as words or speakers (Werker & Curtin, 2005).
Listeners direct their attention towards different clusters depending on the task, such as speech perception or voice identification. For example, listeners can flexibly allocate attention to speaker identity or phonemic information depending on the utility of that information for the task at hand (Creel & Tumlin, 2011). This task-dependent processing is also supported by neuroimaging evidence showing that the neural encoding of speech sounds is dynamically reshaped by behavioral goals. Cortical response patterns and the phase of cortical oscillations realign to reflect the specific dimension (speaker vs. vowel) to which the listener is attending (Bonte et al., 2009, 2014).
While radical exemplar-based theories are theoretically appealing, they are often criticized for assuming a memory system that stores vast amounts of information and a comprehension system that requires high computational speed, which may not be economical. Additionally, neurophysiological evidence shows that the brain does encode speech by phonetic categories (Chang et al., 2010) and features such as places and manners of articulation (Mesgarani et al., 2014), as well as specific acoustic features like vowel formants (Oganian et al., 2023). A softer version of exemplar-based theories allows for certain degrees of abstraction (e.g., Ambridge, 2020; Goldinger, 2007; Johnson, 2006), suggesting that both detailed episodic traces and abstract linguistic representations can coexist in the mental lexicon. Nonetheless, the core assumption of exemplar-based theories, and the one-system view in general, is that language processing is directly influenced by the acoustic characteristics of the speaker’s voice. This is because phonemes and other paralinguistic acoustics are essentially the same and are represented together as one system in the brain.
Why do speaker effects occur during language comprehension?
The distinct perspectives of the two-system and one-system views give rise to accounts of speaker effects with different theoretical focuses. The two-system view, by separating speaker characteristics from linguistic content, gives rise to a top-down speaker-model account, which includes how listeners form phonetic, syntactic, semantic, and pragmatic expectations about the speaker. In contrast, the one-system view, with its focus on holistic memory traces, is most clearly embodied in an acoustic-episode account, which highlights the direct episodic influence of the speaker’s voice on language comprehension.
The speaker-model account
Under the two-system view, the information about a speaker’s identity carried by acoustic signals is processed separately from linguistic content. This information enters the voice-processing system and connects to abstract representations related to the speaker, forming a speaker model. This model includes the listener’s beliefs and knowledge about the speaker, such as their sex, age, socio-economic status, and region of origin. Listeners use this model to form expectations and interpret meaning by integrating the linguistic content with speaker characteristics.
The existence of the speaker model is supported by evidence showing that speaker characteristics can influence language comprehension independently of acoustic variations. For example, Cai et al. (2017) investigated how listeners comprehend cross-dialectally ambiguous English words such as “flat” and “gas.” They showed that listeners had more access to the American meaning when these words were spoken by a speaker with an American accent than by one with a British accent. Critically, such speaker effects do not arise from accent details in a word but instead from a mental model listeners have constructed for the speaker (e.g., a British vs. American English speaker): listeners still had more access to the American meaning of word tokens morphed to be accent-neutral as long as they believed the word tokens were produced by an American English speaker (see also Cai, 2022; King & Sumner, 2015).
The speaker model influences comprehension across various modalities, and speaker effects can occur even when acoustic cues are absent. Geiselman and Bellezza (1977) discovered that listeners confused the gender of the speaker with the gender of the agent during a sentence-memorization task. For example, listeners were more likely to remember the speaker being female for the sentence “The queen spent the money” and being male for the sentence “The gentleman entered the house.” Fairchild and Papafragou (2018) showed that readers judged under-informative written sentences (e.g., “Some people have noses with two nostrils”) as more plausible when they believed the sentences were from a non-native speaker compared to a native speaker. This indicates that expectations about a speaker’s linguistic competence modulate pragmatic interpretation even without acoustic input (see also Gibson et al., 2017; Hanulíková et al., 2012). Similarly, Foucart et al. (2019) demonstrated that a brief prior exposure to a speaker’s foreign accent modulated the neural processing (N400) of subsequent written sentences attributed to that speaker, suggesting that the reduced reliability associated with the accent was integrated into the speaker model and affected comprehension even when the voice was not heard. More recently, Rao et al. (2025a) extended this to AI “speakers,” showing that neural responses to text-based semantic and syntactic anomalies differed significantly depending on whether readers believed the text was generated by a human or a large language model (LLM) (see also Rao et al., 2025b, c). These findings suggest that speaker properties are represented as higher-level abstract features that interact with other domains such as text, and that the speaker model emerges from these combined features.
In addition to these semantic and pragmatic expectations, a speaker model also includes expectations about the speaker’s phonetic characteristics. For example, Johnson et al. (1999) demonstrated that participants who were exposed to a gender-neutral voice perceived vowel boundaries differently based on whether they believed the speaker was male or female. This effect occurred when they saw the video clips of a male or female speaker and persisted even when they were simply instructed to imagine a male or female speaker during the task. Similar speaker model effects on speech perception have also been observed regarding a speaker’s nationality (Niedzielski, 1999), ethnicity (Staum Casasanto, 2008), and age (Hay et al., 2006). An explanation for this is the “ideal adapter” framework (Kleinschmidt, 2019; Kleinschmidt & Jaeger, 2015). In this framework, listeners solve the “lack of invariance” problem (i.e., the fact that one speaker’s acoustic cues for a phoneme, like/s/, differ from another’s) by learning a speaker’s “generative model.” This generative model is a set of statistical distributions for that speaker’s phonetic categories. This framework accounts for how listeners recognize a familiar speaker by deploying a stored, speaker-specific generative model, and generalize to a group of similar speakers by using a group-level model (e.g., based on accent or gender) as a starting point for adaptation. The influence of these implicit speaker-phonetic beliefs extends beyond early speech perception; they can also modulate the dynamics of lexical access, such as restricting competition from words that are phonologically incompatible with a speaker’s accent (Trude & Brown-Schmidt, 2012), and influence recognition of words (Luthra et al., 2018).
The acoustic-episode account
Under the one-system view, the intertwined nature of linguistic and speaker representations provides an intuitive explanation for speaker effects. The acoustic-episode account, most closely associated with exemplar-based theories, suggests that a speaker’s identity influences speech processing by providing a greater or lesser acoustic match to listeners’ previous encounters with specific speech episodes (Goldinger, 1996, 1998; Kapnoula & Samuel, 2019; Pufahl & Samuel, 2014). When a word is produced by a familiar speaker, the acoustic details match the listener’s episodic memory better than when it is produced by a new speaker (Creel & Tumlin, 2011), leading to speaker effects in speech perception.
In a study by Goldinger (1996), participants were exposed to a list of words spoken by various speakers in a study phase. Later in a test phase, they were presented with another list of words and asked to determine whether each word had been previously heard. The results indicated that they were more accurate in identifying words as previously heard when words were spoken by the same speaker between the study phase and the test phase, compared to when the words were spoken by different speakers. Further research showed that recognition was even better in cases where word tokens were identical (i.e., the same recording), compared to cases where word tokens were not identical (i.e., different recordings) even if uttered by the same speaker (Clapp, Vaughn, Todd et al., 2023b). On the other hand, when learning novel words with similar pronunciations, participants distinguished the words faster when spoken by different speakers during the study phase than by the same speaker (Creel et al., 2008; Creel & Tumlin, 2011); this effect could be detected even when the study phase and the test phase were 24 h apart, suggesting that the speaker’s voice may be encoded as part of the mental lexicon (Kapnoula & Samuel, 2019). These findings support the notion that detailed acoustic information, including speaker-specific characteristics, is stored in memory and directly influences speech processing.
Interestingly, the influence of acoustic episodes extends to other acoustic information beyond the speaker’s voice, suggesting a highly episodic mechanism in speech perception. In a study by Pufahl and Samuel (2014), participants listened to spoken words accompanied by environmental sounds (e.g., a phone ringing or a dog barking), and made an animacy decision for each word. Later in a test phase, participants’ ability to identify acoustically filtered versions of those words was impaired to a similar degree either when the voice changed (e.g., test words were accompanied with the same environmental sound but spoken by a different speaker) or when the environmental sound changed (e.g., test words were spoken by the same speaker but accompanied by a different environmental sound). Similar effects with background noise have been observed for white and sine wave noise (Cooper et al., 2015; Cooper & Bradlow, 2017; Creel et al., 2012; Strori et al., 2018). These findings suggest that lexical and sound representations are deeply integrated, with acoustic-episodic memory directly impacting speech processing.
An integrative model of language and speaker processing
The acoustic-episode account and the speaker-model account offer distinct perspectives on the locus and nature of speaker effects (see Creel, 2014 for a similar discussion). The acoustic-episode account assumes that speaker effects arise from bottom-up perceptual processes. In this view, listeners search their memories for the best episodic match to incoming speech signals to determine the word and meaning of a speech token. The speaker’s voice, along with other acoustic details, is considered an integral part of the mental representation of spoken words, and these detailed representations directly influence language comprehension. Conversely, the speaker-model account assumes that speaker effects occur in top-down expectation-based processes. According to this account, listeners construct a comprehensive model of the speaker, which includes their beliefs and knowledge about the speaker’s characteristics. Listeners then use this model to form expectations and interpret the message by integrating the speaker’s characteristics.
While these two accounts may seem contradictory at first glance, they are not mutually exclusive. Speaker effects can take place at multiple representational levels simultaneously (Creel & Tumlin, 2011). Each mechanism can contribute to a speaker effect to varying degrees depending on task requirements. To reconcile these two accounts, we propose an integrative model of language and speaker processing that incorporates both bottom-up influences of acoustic episodes and top-down influences of the speaker model on language comprehension.
As illustrated in Fig. 1, incoming sound signals are perceived and form acoustic representations. These acoustic representations are considered unified records of acoustics that do not distinguish between types of information, such as linguistic content or speaker identity. Instead, the acoustic representations capture the complete range of acoustic details present in the speech signal, including both linguistic and paralinguistic information. Listeners can allocate their attention to different aspects of the acoustic representations depending on the context and task requirements, allowing for the emergence of different clusters of acoustics. For example, in a speech perception task, listeners may allocate their attention to distinguishing acoustic clusters between different phonemes and words; in a speaker identification task, listeners may focus on the difference between clusters that represent different speakers.
Fig. 1.

Schematic representation of an integrative model of language and speaker processing. Solid arrows indicate the primary feedforward flow of information involved in constructing the message, moving from acoustic-episodic representations to the formation of the speaker model and linguistic representations. Dashed arrows represent modulatory or feedback influences. Specifically, the speaker model modulates speech perception and meaning access, while linguistic features also inform and modify the speaker model. Additionally, the constructed message can trigger reanalysis of linguistic information and the updating of the speaker model
These acoustic representations proceed through two pathways: one for processing linguistic information and the other for processing speaker information. In the language comprehension pathway, the relevant acoustic features map onto linguistic categories, including smaller units such as phonemes and syllables, and larger units such as words and phrases, ultimately accessing the linguistic meaning. In the speaker perception pathway, the relevant acoustic features map onto representations related to the speaker’s characteristics, constructing a model that incorporates information about a specific individual (individual speaker model), or a template model about a social group (demographic speaker model).
An individual speaker model refers to the listener’s mental representation of a specific, familiar speaker, encompassing a wide range of information such as the speaker’s unique voice characteristics, speaking style, personality traits, background knowledge, and shared experiences with the listener. When a listener encounters a familiar speaker, the acoustic features of the speaker’s voice activate the corresponding individual speaker model, which then influences language comprehension by providing a rich context for interpreting the speaker’s utterances. On the other hand, a demographic speaker model refers to the listener’s mental representation of a social group or category to which a speaker belongs, based on the listener’s general knowledge, beliefs, and stereotypes about the characteristics typically associated with members of that group. When a listener encounters an unfamiliar speaker, they may rely on demographic models to make inferences about the speaker’s characteristics and to guide their expectations.
Individual and demographic speaker models are not entirely separate; rather, they exist on a continuum and can influence each other. On the one hand, the construction of individual models is usually based on initial demographic models, as a listener’s prior experiences with speakers from a particular social group may shape their expectations and biases when encountering a new speaker from that same group; on the other hand, as a listener gains more experience with a particular speaker, they may begin to develop an individual model of that speaker that gradually overrides or modifies the initial demographic model. The relationship between an individual model and a demographic model also aligns with the distinction made in the “ideal adapter” framework between speaker-specific generative models and more general, group-level priors (Kleinschmidt, 2019; Kleinschmidt & Jaeger, 2015).
The speaker model modulates the language comprehension pathway at multiple levels from the top down. At the level of speech perception, the speaker model biases phonetic and lexical processing by applying different prior probabilities to linguistic units. For example, if the speaker model indicates that the speaker might be from a particular dialect region (a demographic model) or is a specific person known to produce/s/with a low-frequency spectrum (an individual model), it may assign higher probabilities to phonetic and lexical variants associated with that speaker (Kleinschmidt, 2019; Kleinschmidt & Jaeger, 2015; Sumner et al., 2014). At the level of meaning access, the speaker model influences meaning interpretation by creating a context that biases dominant word meanings and pragmatic inferences for sentences. For example, if the speaker model suggests that the speaker might be an American English speaker, it may bias the interpretation of ambiguous words or phrases towards meanings more commonly used in American English. Finally, the message is interpreted by integrating the linguistic information with speaker information provided by the speaker model.
It should be noted that the modulation between language and speaker processing is bidirectional. For example, structured phonetic variation in the input also facilitates speaker identification (Ganugapati & Theodore, 2019). Specific linguistic features, such as accent, inform listeners about speaker attributes like region of origin (e.g., identifying a speaker as British vs. American; Cai et al., 2017; Martin et al., 2016), and the speaker model can also be informed by the speaker’s lexical and syntactic choices (Porter et al., 2016) and linguistic style (Bradac et al., 1976).
In summary, the proposed model is “integrative” in two senses. In one sense, during language and speaker processing, the bottom-up perception and top-down expectation are integrated, driven by the interaction between detailed acoustic-episodic memory and a more abstract speaker model. In another sense, language and speaker processing are functionally integrated, where the construction of linguistic meaning and the perception of speaker characteristics are not resolved in isolation, but are intertwined throughout comprehension.
Probabilistic processing in the integrative model
The integrative model highlights a dynamic, probabilistic interaction between the speaker model and language processing. This dynamic nature can be formalized using a Bayesian framework (see also Kleinschmidt, 2019; Kleinschmidt & Jaeger, 2015), which describes how listeners integrate prior beliefs about a speaker with incoming evidence. This probabilistic processing occurs at multiple levels, including the modulation of speech perception, the modulation of linguistic meaning access, speaker-contextualized message construction, and the updating of the speaker model by the message.
The speaker model modulates speech perception. Formally, the probability of identifying a linguistic form (e.g., a phoneme) given the acoustic input and the perceived speaker identity can be expressed as:
Here, the term p (acoustics | form, speaker) is the likelihood of encountering specific acoustic patterns given that a speaker (with a perceived identity) produces a certain linguistic form. The term p (form | speaker) represents the prior probability of the linguistic form given the speaker. This aligns with the “ideal adapter” framework (Kleinschmidt & Jaeger, 2015), where listeners utilize stored statistical distributions associated with that identity to bias lower-level phonetic perception. For example, upon hearing an ambiguous fricative sound, a listener’s perception of it as/s/or/ʃ/is based not only on the population-level distribution of the linguistic form but also on the specific phonetic habits of that speaker.
Crucially, this flow of information is not strictly bottom-up but involves a dynamic recalibration process. Listeners use disambiguating lexical information, such as an ambiguous sound (e.g., midway between/s/and/f/) in a context where one interpretation yields a valid word (e.g., “giraffe”) and the other a nonword (e.g., “girasse”), to update their beliefs and retune prelexical phonetic categories (Eisner & McQueen, 2005; Norris et al., 2003). This recalibration involves variable spectral cues like fricatives (Kraljic & Samuel, 2005, 2007) and stable temporal cues like stop consonants (Kraljic & Samuel, 2006). This process tracks cumulative input statistics of a speaker’s speech over time to iteratively update their phonetic categories (Myers & Mesite, 2014; Tzeng et al., 2021).
The speaker model modulates linguistic meaning access. This involves evaluating the probability of a certain meaning given the linguistic form and the speaker’s identity, formalized as:
This process is demonstrated in the comprehension of cross-dialectal ambiguous words. Cai et al. (2017) showed that for a word like “bonnet,” listeners were more likely to interpret it as a car part (compared to a type of hat) when the speaker was British compared to when the speaker was American. In the current framework, the term p (meaning | speaker) represents the prior probability of the speaker expressing a specific concept. In this case, the prior probability of referring to a car part or a hat may be similar across English speakers (i.e., Americans and British people are equally likely to talk about cars or hats). Consequently, the access to the meaning is largely determined by the likelihood p (form | meaning, speaker). If the speaker is British, the likelihood p (form = “bonnet” | meaning = car part, speaker = British) is high; if the speaker is American, the likelihood shifts: p (form = “bonnet” | meaning = hat, speaker = American) is now high while the likelihood that an American uses “bonnet” for a car part is low (as they would use “hood”). This shift in the likelihood driven by the speaker identity boosts the accessibility of the hat meaning when the listener perceives an American accent.
The listener constructs the final message (i.e., the speaker-contextualized meaning) by integrating the linguistic meaning with the speaker information. This involves a rational evaluation of the joint probability of these two components:
For example, Van Berkum et al. (2008) found that hearing the meaning “Every evening I drink some wine” from a child speaker creates a conflict, triggering an N400 effect, because the joint probability p (meaning = drink wine, speaker = child) is low. Wu and Cai (2026) further showed that this joint probability serves as a cue for selecting the appropriate processing strategy. If the joint probability is low but still within a reasonable range (e.g., a social-stereotype violation such as a man talking about himself regularly getting a manicure), the listener engages in effortful integration of social stereotypes and the speaker’s identity, reflected as an N400 effect. However, if the joint probability is very low (e.g., a perceived biological violation like a man talking about himself getting pregnant),3 the listener treats the input as an error and engages in correction/reanalysis, reflected as a P600 effect. As discussed in Wu and Cai (2026), this P600 “error correction” process may itself involve a new probabilistic inference, such as re-evaluating the perceived speaker identity (e.g., misinterpretation of speaker gender based on the voice) or the perceived linguistic content (e.g., misperception of words or inferring a metaphorical interpretation).
Finally, the speaker model is updated in light of the message. This updating process allows the model to evolve from demographic stereotypes to individualized representations. This can be formalized as a belief update:
Here, the posterior belief about the speaker, p (speaker model | message), is updated based on the likelihood of observing the current message, p (message | speaker model) and the listener’s prior beliefs about that speaker, p (speaker model). For example, Wu et al. (2025) showed that listeners track the frequency of a speaker making stereotype-incongruent statements. Hearing a child say “I drink whisky every night” for the first time might be a surprising message. However, if the child keeps talking about leading a stereotypically adult lifestyle, the listener updates their speaker model. Wu et al. found that listeners exhibited different neural oscillatory responses depending on the frequency of stereotype-incongruent statements made by a speaker. This indicates that listeners dynamically update their prior speaker model based on the cumulative evidence provided by the message.
The temporal dynamics of the integrative model
In the integrative model, the interplay between bottom-up acoustic episodes and top-down speaker models occurs rapidly and incrementally as speech unfolds. The influence of acoustic episodes on spoken language processing emerges very early. For example, Creel and Tumlin (2011) demonstrated that listeners use talker-specific acoustic details to distinguish competing words as early as 200 ms after word onset. This suggests that the retrieval of acoustic-episodic traces occurs almost simultaneously with initial phonetic analysis, rapidly constraining lexical selection before the word is fully articulated. This aligns with the general time course of acoustic processing in spoken language comprehension, where acoustic-phonetic analysis occurs within the first 80–200 ms (Tezcan et al., 2023).
The integration of the speaker model with linguistic content does not wait until the end of a sentence; rather, it occurs incrementally as a sentence unfolds. For example, Van Berkum et al. (2008) showed that when a specific word in a sentence mismatches the speaker’s identity in terms of social stereotypes, the brain detects this conflict within 200–300 ms of the word’s onset, eliciting an N400 effect (similar results were reported in Pélissier & Ferragne, 2022, van den Brink et al., 2012, and Wu & Cai, 2026). This timing suggests that the speaker model is continuously active and integrates dynamically with linguistic content. Although some studies report a P600 effect instead of an N400 in response to speaker-content mismatch, indicating a later stage integration (e.g., Foucart et al., 2015; Lattner & Friederici, 2003), the integrative model interprets these results as a subsequent inference process involving error correction or reanalysis (Wu & Cai, 2026), as discussed in the previous section. Thus, within the integrative model, speaker effects are dynamic: they can manifest as early perceptual biases, concurrent semantic integration, or later error correction, depending on the nature of the input and the listener’s rational inference.
Speaker-idiosyncrasy effects and speaker-demographics effects in language comprehension
When discussing one’s identity, the term can refer to the idiosyncratic characteristics of an individual speaker, highlighting the unique traits and perspectives that distinguish one person from another. The speaker effects occurring at this level are defined as speaker-idiosyncrasy effects. Alternatively, identity can refer to the collective attributes of a demographic group, reflecting shared characteristics typical of a specific social, ethnic, gender, or age group. Speaker effects at this level are defined as speaker-demographics effects.
However, it is important to note that this distinction is not binary. As similarly proposed by Kleinschmidt (2019), speakers can be conceptualized within a hierarchy of group membership. Listeners’ beliefs about a speaker become more specific as the grouping becomes more precise, moving along a continuum from broad demographic categories (e.g., gender, ethinicity) to highly specific idiosyncratic traits. In this view, demographic representations emerge from the experience of interacting with individuals within a demographic group, while these demographic features can, in turn, serve as a basis or prior for forming expectations about a specific individual. In this section, we review studies that examine speaker-idiosyncrasy effects and those that explore speaker-demographics effects.
Speaker-idiosyncrasy effects
The speaker-idiosyncrasy effect refers to how a speaker’s unique characteristics, along with the listener’s prior experience with that speaker, can influence language comprehension. Research shows that speech is more intelligible from a familiar speaker than from an unfamiliar speaker, a phenomenon known as familiar talker advantage (Domingo et al., 2020; Souza et al., 2013). Evidence suggests this advantage relies on precise, linguistically specific knowledge, as listeners trained on a voice in one language did not show improved intelligibility for that speaker in another language (Levi et al., 2011). Furthermore, this advantage appears not to require explicit recognition of the speaker’s identity, as listeners retained the intelligibility advantage even when acoustic manipulations (e.g., of vocal tract length) prevented them from consciously identifying the voice (Holmes et al. 2018). The advantage was also found to be context-dependent, manifesting most strongly when a competing masker is linguistically similar (e.g., speech) rather than dissimilar (e.g., noise), and correlates with the listener’s learning accuracy of the voice (Levi et al., 2019). Neuroimaging evidence further shows that familiar voices elicit more robust neural representations in the posterior STG and MTG (Holmes & Johnsrude, 2021), regions known to represent phonetic categories. These results highlight the influence of both acoustic-episodic memory and the speaker model. The fact that the intelligibility benefit survives without explicit speaker identification (Holmes et al., 2018) suggests that low-level acoustic-episodic traces can directly facilitate processing. However, the linguistic specificity of the effect (Levi et al., 2011) indicates that the speaker model must also provide precise, speaker-specific priors for phonetic categories to resolve ambiguity.
Beyond speech intelligibility, research shows that word recognition is faster and more accurate when words are spoken by the same speaker during both learning and test. Craik and Kirsner (1974) had participants listen to a string of words and decide if each word had appeared earlier in the sequence (i.e., whether the word was repeated). Repeated words were spoken either by the same speaker or by a different speaker. They found that participants’ responses to the words repeated by the same speaker were more accurate than responses to those repeated by a different speaker. This result was replicated in further studies (Clapp, Vaughn, Sumner et al., 2023a, Clapp, Vaughn, Todd et al., 2023b; Goh, 2005; Goldinger, 1996; Palmeri et al., 1993). In this case, the acoustic match between the initial and repeated tokens of the same word is better when spoken by the same speaker than by different speakers, which leads to more efficient word recognition.
However, the influence of these speaker-specific acoustics on word recognition is not unconditional. Research suggests that these effects are often time-dependent, emerging primarily when processing is relatively slow or difficult. For example, McLennan and Luce (2005) found that these effects emerged in slow responses but were reduced in fast ones. This suggests that abstract phonological representations dominate early, rapid processing (e.g., quickly identifying a clearly spoken word based solely on its phonemes), while specific indexical details emerge only during later, slower processing (e.g., relying on memory of a specific speaker’s voice to help identify a difficult word). This idea was further supported by evidence showing that speaker effects were stronger for the speech of dysarthric individuals than for control individuals, as the increased processing time required to decode degraded speech signals enhanced the retrieval of detailed acoustic-episodic traces (Mattys & Liss, 2008). Theodore et al. (2015) further showed that even when processing is fast, speaker effects emerge if listeners explicitly attend to speaker details during encoding (e.g., actively focusing on who is speaking rather than just what is being said).
Speaker-idiosyncrasy effects also occur in higher-level comprehension tasks, such as referent label processing. Typically, listeners expect speakers to consistently use the same label when referring to the same object (Brennan & Clark, 1996; Shintel & Keysar, 2007). For example, if a speaker initially refers to a piece of furniture as a “couch,” listeners anticipate that the speaker will continue using this label, rather than switching to an alternative label like “sofa.” When speakers occasionally switch to a different label, comprehension can be disrupted (Barr & Keysar, 2002). Experiments in referent label processing usually involve two phases: initially, a speaker uses a label for an object; later, either the same or a different speaker uses the same or an alternative label for that object – a process known as label switching. Evidence shows that the disruptive effect of label switching is modulated by the speaker’s identity. Metzing and Brennan (2003) found that when hearing a new referent label, listeners were slower to find the object when the new label was uttered by the original speaker than by a different speaker. This finding has been replicated in further studies using behavioral (Brown-Schmidt, 2009; Horton & Slaten, 2012; Kronmüller & Barr, 2007, 2015) and neurophysiological measures (Bögels et al., 2015). In the context of the integrative model, these studies suggest that listeners develop an individual speaker model based on their experience with a specific speaker’s language use. This model encompasses information about the speaker’s prior label usage. When the same speaker switches to a new label, it violates the listener’s expectations based on their mental model of that speaker, leading to a disruption in comprehension. In contrast, when a different speaker uses a new label, the listener may not have a well-established model for that speaker, resulting in less disruption.
Another example where individual speaker models influence language comprehension is perspective modeling (also known as perspective taking). Listeners actively consider what the speaker can physically see when interpreting messages. In a scenario described by Brown-Schmidt et al. (2015), a speaker and a listener sit at a table on which there are two red triangles and one blue triangle. One of the red triangles is blocked from the speaker’s view but is visible to the listener. When the speaker instructs the listener to move the “red one,” it would not be ambiguous if the listener models the speaker’s perspective, considering what the speaker can see. Studies have shown that perspective modeling significantly affects language comprehension, especially in referent disambiguation (Brown-Schmidt, 2012; Brown-Schmidt et al., 2008; Hanna et al., 2003). This perspective modeling can be considered part of an individual speaker model that encompasses the listener’s understanding of what the speaker knows and does not know (Clark, 1996; Heller et al., 2012; Wu & Keysar, 2007). This aspect of the individual speaker model is constructed through the listener’s experience with the specific speaker and their shared context, and likely shares cognitive mechanisms with the spontaneous modeling of co-listeners (Jouravlev et al., 2019; Rueschemeyer et al., 2015). When the listener encounters an ambiguous referent, they can use their individual speaker model to infer the speaker’s intended meaning based on their knowledge of the speaker’s perspective.
In a proper name comprehension study by Barr et al. (2014), pairs of friends played a communication game in which one friend (addressee) identified a target person from four photos based on a name spoken by their friend or a stranger. The addressee was informed whether the name was chosen by their friend or the stranger. Results showed that addressees identified the target more quickly when the name was spoken by their friend, possibily due to a better match of acoustic details, as they were more familiar with their friend’s voice. Meanwhile, responses were slower when told the name was not chosen by the speaker but by the other person (e.g., a friend speaking a name from a stranger), reflecting the addressee’s effort to verify the speaker model regarding whether the speaker knows the target person or not. In this case, the listener’s mental model of their friend includes knowledge about the friend’s social network and familiarity with specific individuals. When the friend speaks a name chosen by the stranger, it conflicts with the listener’s mental model of the friend, prompting them to engage in additional processing to verify the speaker’s knowledge of the target person.
Speaker-demographics effects
The speaker-demographics effect refers to how language comprehension is influenced by the collective attributes of a group of speakers who share characteristics typical of a specific social, ethnic, gender, or age population. In the referent label processing studies discussed in the previous section, researchers manipulated the speaker’s identity by contrasting whether the speaker who switched referent labels was the one who established the original label in the first place. This involves comparing specific individuals within the same demographic group (e.g., adult speaker A vs. adult speaker B). In contrast, studying speaker-demographics effects involves comparing speakers from different demographic backgrounds (e.g., an adult vs. a child) to examine how group-level expectations influence processing.
Wu et al. (2024) used event-related potentials (ERPs) to explore whether listeners expect a child speaker to be less likely to switch labels compared to an adult speaker, based on the common belief that children are less flexible in language use. They used pictures with alternative labels (e.g., a piece of furniture can be labeled either as a “couch” or as a “sofa”). Each picture was shown twice across two phases. In the establishment phase, participants heard either an adult or a child label a picture and judged whether the label matched the picture. In the test phase, the same speaker either repeated the original label or switched to an alternative label, and participants again judged the label’s match to the picture. ERP results showed that switched labels elicited an N400 effect compared to repeated labels. Importantly, the N400 effect was larger with a child speaker than with an adult speaker, indicating greater difficulty in comprehending switched labels from children than from adults.
In this case, although the possibility that the acoustic difference between the original label and the alternative label might be larger for a child speaker than for an adult speaker cannot be entirely ruled out, the speaker-demographics effect here is likely driven primarily by listeners’ modeling of the speaker’s linguistic flexibility. Specifically, it can be attributed to the listener’s demographic speaker model, which incorporates general beliefs and expectations about the linguistic flexibility of different age groups. When listeners encounter a child speaker, their demographic model suggests that children are less likely to switch labels compared to adults. This top-down influence of the demographic speaker model leads to greater processing difficulty when a child speaker violates this expectation by switching labels.
ERP studies also show that the speaker demographics modulate sentence comprehension. In an early study, Lattner and Friederici (2003) asked participants to listen to self-referential sentences that expressed a stereotypically gendered idea, including stereotypically masculine sentences such as “I like to play soccer” or stereotypically feminine ones such as “I like to wear lipstick.” Each sentence was spoken by both male and female speakers. They found that the mismatch between the speaker’s biological sex (as inferred from their voice) and the stereotypically gendered sentence elicited a P600 effect at the critical words at the end of sentences (e.g., “soccer” spoken by a female speaker and “lipstick” spoken by a male speaker). Van Berkum et al. (2008) used a similar paradigm and tested more demographic attributes, including age and social status. They contrasted sentences such as “Every evening I drink some wine before I go to sleep” spoken by an adult speaker versus by a child speaker. Their results showed that the mismatch between speaker demographics and the linguistic content elicited an N400 effect at the critical word “wine,” similar to the classic N400 effects elicited by semantic anomalies (Kutas & Hillyard, 1980; Van Berkum et al., 1999) and world knowledge violations (Hagoort et al., 2004). These speaker-demographics effects on sentence comprehension have been replicated by further studies using similar paradigms (Foucart et al., 2015; Martin et al., 2016; Pélissier & Ferragne, 2022; Tesink, Petersson et al., 2009b; van den Brink et al., 2012; Wu & Cai, 2026).
These findings can be explained by considering the role of the demographic speaker model in sentence comprehension. As the sentence unfolds, listeners incrementally integrate the sentence meaning with their knowledge about the speaker’s demographic background, which is captured by the demographic speaker model. When the critical word in the sentence conflicts with the expectations generated by the demographic model (e.g., a child speaker talking about drinking wine), it elicits an N400 effect. This effect has been interpreted by most authors as reflecting increased difficulty in integrating the unexpected word into the current speaker context. However, others might interpret the N400 as a lexico-semantic prediction error (DeLong et al., 2005; Nour Eddine et al., 2024). In this view, the speaker model generates probabilistic predictions about likely upcoming words, and the N400 amplitude indexes the degree of mismatch or “surprise” arising when the bottom-up input conflicts with these top-down predictions.
Another example of such population-specific word frequency effects is demonstrated in the study by Walker and Hay (2011), in which participants completed an auditory lexical decision task where they listened to words that were more prevalent among older people (e.g., “knitting”) and words that were more prevalent among younger people (e.g., “lifestyle”). All words were presented in the voices of both older and younger speakers. They found that participants responded faster and more accurately when the age of the voice matched the typical age of the word (see Kim, 2016, for a similar finding). The authors interpreted this finding within an exemplar framework, suggesting that lexical access is facilitated when the specific phonetic detail of the input matches the generalized acoustic detail of the listener’s stored exemplars. They argued against a top-down semantic priming account (e.g., the speaker model) by demonstrating that the effect was predicted by objective corpus frequency ratios but not by explicit post hoc ratings of “word age.” However, without specific information about the time course of the effect, it remains possible that the speaker model exerts an implicit top-down influence not captured by explicit ratings.
Speaker effects as indices of language ability and socio-cognitive traits
Despite not directly focusing on the underlying mechanisms of speaker effects, some studies utilize speaker effects as indices for assessing other cognitive abilities. One such ability is language ability, where the influence of acoustic details can reflect the development of an individual’s mental lexicon. Another is socio-cognitive ability, which is often linked to the robustness of the speaker model during communication.
Acoustic-detail effects in phonetic learning
An essential component in language acquisition is learning what elements of speech signals (e.g., phonemes) differentiate meanings. Theoretically, a fully abstract linguistic system would normalize variability that does not distinguish one linguistic unit from another. The presence and magnitude of speaker effects, especially sensitivity to acoustic details during speech perception, can indicate whether language learners have achieved linguistic abstraction. It also reflects whether they can efficiently process linguistically relevant information without being overly influenced by extralinguistic factors like speaker variability. In this sense, attenuated speaker effects (e.g., less disruption caused by speaker changes) may indicate more successful generalization.
During the initial stages of language acquisition, spoken word representations are highly acoustic. This makes it challenging for infants, the primary language learners, to generalize beyond specific acoustic details of their language input. To assess infants’ ability to generalize words across different speakers, Houston and Jusczyk (2000) familiarized infants with isolated words (learning materials) spoken by one speaker and then tested them with passages (test materials) containing those words spoken by another speaker. They discovered that at 7.5 months, infants paid more attention to test materials containing familiar words only when both the learning and test materials were produced by speakers of the same sex. By 10.5 months, the speaker-sex effect was no longer observed, indicating that infants’ word-form representations become more abstract with age (for similar findings, see Schmale & Seidl, 2009).
In a study focusing on young children, Ryalls and Pisoni (1997) used a word-recognition task where children aged 3–5 years were asked to identify words from a list by pointing to corresponding pictures. The words were spoken by either a single speaker or multiple speakers. Results indicated that children’s word recognition was adversely affected by an increased number of speakers. However, as children aged, their ability to process words from multiple speakers improved. Additionally, when asked to repeat the words, younger children matched the duration of the words more closely than older children and adults, suggesting that they retain more acoustic details in their speech representation. These findings imply that infants and young children are more sensitive to acoustic details in speech, with this sensitivity gradually decreasing as they develop (Creel & Tumlin, 2011).
Furthermore, the ability to move beyond these specific acoustic details to generalize across speakers appears to be directly linked to language ability. Levi et al. (2019) investigated the familiar talker advantage in children with varying language abilities. They found that while all children benefitted from familiarity (i.e., successfully mapping specific acoustic details to linguistic units), only those with higher language scores could generalize this knowledge to recognize words spoken by unfamiliar speakers with the same accent. This suggests that while the ability to use speaker-specific acoustic cues is robust even in children with lower language skills, the ability to abstract these patterns to new speakers can indicate higher language ability.
On the other hand, training with multiple speakers can aid speech learning for both first language (Quam & Creel, 2021) and second language (Zhang et al., 2021) acquisition. In an early study, Lively et al. (1993) trained Japanese listeners to distinguish between English/r/and/l/sounds, using either multiple speakers or a single speaker. Those trained with multiple speakers successfully generalized their learning to new words spoken by new speakers, whereas those trained with a single speaker did not. This suggests that exposure to multiple speakers fosters more robust and abstract linguistic representations, which can facilitate the development of phonetic categories and the generalization of speech perception ability. Rost and McMurray (2009, 2010) further explored this idea by showing that acoustic variability aids infants in developing phonetic categories, such as/b/and/p/. Their studies revealed that infants’ phonetic learning could be improved by presenting words produced by multiple speakers, compared to presenting words produced by a single speaker. These findings suggest that speaker variability, irrelevant of contrasting phonetic units, can help young language learners acquire those phonetic units (see also Quam et al., 2017). By exposing learners to a wide range of acoustic variations, multi-speaker training may help them extract the invariant features that define phonetic categories, leading to more successful generalization across speakers and contexts.
Speaker model modulated by a listener’s socio-cognitive traits
Language communication is a primary form of social interaction. Consequently, individual differences in social cognition are often reflected in how people process language. Specifically, a listener’s socio-cognitive traits may influence their ability to construct a mental model that accurately captures the features of a specific individual or the general attributes of a demographic group.
This link between social cognition and language processing emerges early in life. Kinzler et al. (2007) demonstrated that infants and young children use acoustic cues (e.g., accent) to form social preferences that guide interaction. They found that 5-month-old infants prefer to look at native-language speakers, 10-month-olds prefer to accept toys from native speakers, and 5-year-olds choose to be friends with children who speak with a native accent rather than a foreign accent. For adults, Dragojevic and Giles (2016) showed that processing fluency (i.e., the ease with which speech is processed) acts as a mechanism for social evaluation. They found that when listeners encountered speech that was difficult to process (e.g., due to an unfamiliar accent), they showed a negative affective reaction. This negative affect, in turn, led listeners to evaluate the speaker more negatively.
While the ability to extract social identity from voice and speech is a hallmark of typical development, disruptions in voice processing are frequently observed in clinical and neurodiverse populations. Along with phonagnosia (also known as pure voice processing deficit, Hailstone et al., 2010; Van Lancker & Canter, 1982), difficulties in voice processing are observed among populations with schizophrenia, dyslexia, and autism (Stevenage, 2018). Individuals with schizophrenia, particularly those experiencing auditory hallucinations, often struggle to recognize a speaker’s identity through voice (Alba-Ferrara et al., 2012; Badcock & Chhabra, 2013; Chhabra et al., 2012). This difficulty is linked to reduced activation in the right STG (Zhang et al., 2008), a region crucial for voice perception (Lattner et al., 2005). Dyslexic individuals generally retain normal facial recognition abilities (Brachacki et al., 1994) but encounter challenges in voice identification (Perea et al., 2014; Perrachione et al., 2011). For autistic individuals, research indicates that challenges in vocal-identity processing often coincide with difficulties in face-identity processing (Boucher et al., 1998), and similar findings are also observed in relation to autistic traits in the general population (Skuk et al., 2019). Individuals with higher autistic traits show reduced activation in the right STS/STG when processing vocal sounds, compared to control individuals (Schelinski et al., 2016).
In the integrative model, deficits in vocal-identity processing may impair the construction and robustness of the speaker model during language comprehension. As the speaker model relies on the listener’s ability to extract and process relevant speaker characteristics from the acoustic signal, difficulties in voice processing may lead to a less accurate representation of the speaker. This, in turn, can affect the top-down influence of the speaker model on language comprehension, potentially leading to impairments in the integration of speaker information with linguistic content.
As a direct investigation of this idea, Tesink, Buitelaar et al. (2009a) used fMRI to explore whether autistic individuals differ from non-autistic controls in how they integrate speaker demographics (inferred from speaker voice) with linguistic content during spoken language comprehension. They found that, compared to control participants, autistic participants showed increased activation in the right inferior frontal gyrus (IFG) for utterances where speaker demographics mismatched the linguistic content, such as “I cannot sleep without my teddy bear in my arms” spoken by an adult speaker. Given their comparable behavioral performance, the authors concluded that it was more difficult for autistic individuals to process speaker properties during language comprehension, and that the heightened IFG activity reflected a cognitive compensation due to increased task demands.
In the general population, speaker effects in language comprehension are influenced by personal traits such as empathy and openness. Using EEG, van den Brink et al. (2012) discovered that individuals with greater empathy showed an increased N400 effect and gamma band oscillatory power when comprehending messages that violated stereotypical expectations associated with the speaker’s population. This suggests that more empathetic individuals may have a more detailed or more readily activated demographic speaker model, leading to greater sensitivity to mismatches between the speaker’s characteristics and linguistic content. Similarly, Wu and Cai (2026) showed that the magnitude of speaker effects elicited by social stereotypes decreased as a function of the participants’ openness trait for both EEG and behavioral measures, as more open-minded people tend to have fewer stereotypical views. Wu et al. (2025) showed that the neural oscillatory response to stereotype-incongruent statements was also modulated by the listener’s openness, specifically within the theta frequency band (4–6 Hz). They found that while participants with lower openness scores tended to exhibit increased theta power when encountering incongruent statements (interpreted as reflecting the effortful maintenance of their initial stereotype-based model), those with higher openness scores tended to show a decrease in theta power, suggesting a flexible deployment of attention to updating the speaker model based on the new input. These findings imply that individuals with higher openness not only rely less on fixed demographic stereotypes as priors but also possess greater cognitive flexibility to dynamically update their speaker models when presented with conflicting evidence.
Future directions: Artificial agents as speakers
Thus far, we have reviewed how language comprehension is influenced by the speaker characteristics ranging from acoustic-episodic memory and shared experience to expectations derived from human demographic categories such as age, gender, and region of origin. However, the rapidly evolving landscape of communication presents a test for the universality of this framework: the emergence of the artificial agent as an interlocutor. As voice-based AI technology transitions from novelty to ambient infrastructure, artificial agents are establishing themselves as a new, synthetic “demographic” group. AI speakers are now ubiquitous in daily life, functioning in various communicative roles, such as virtual assistants (Hoy, 2018), customer service agents (Adam et al., 2021), news anchors (Fitria, 2024), language teachers (Schmidt & Strassner, 2022), navigators (Kun et al., 2007), and even psychotherapists (Fiske et al., 2019). This trend calls for an expansion of research on language comprehension to include artificial agents as a type of speaker and to consider their unique features in studying AI language comprehension.
Research suggests that people often attribute human-like qualities to artificial systems, interacting with them as if they were humans (Nass & Moon, 2000; Reeves & Nass, 1996). This interaction involves applying social norms and behaviors such as politeness (Nass et al., 1999), gender stereotypes (Nass et al., 1997), and reciprocity (Fogg & Nass, 1997). In this sense, artificial agents can be viewed as a particular demographic population of “digital humans,” contrasting with the “real human” population. This perspective raises questions about how demographic representations, which are typically based on human social categories, may be adapted or extended to accommodate artificial agents as a unique demographic group.
Studies show that awareness that a speaker is artificial changes the way people interact with them. People tend to control and simplify their language (Amalberti et al., 1993; Kennedy, 1988), exhibit less politeness (Hill et al., 2015), feel less social pressure (Vollmer et al., 2018), and show less desire to establish relationships (Shechtman & Horowitz, 2003) when interacting with artificial agents compared to humans. In the psycholinguistic literature, studies show that people are more likely to reuse lexical expressions (Branigan et al., 2011; Shen & Wang, 2023) previously used by artificial interlocutors than those used by human ones. This tendency for lexical repetition is stronger when interacting with basic artificial systems than advanced ones, possibly in an attempt to enhance understanding with a linguistically limited agent (Branigan et al., 2011; Cai et al., 2021; Pearson et al., 2006).
Despite significant efforts to understand how people interact with artificial agents, limited attention has been paid to how people comprehend their language. Historically, artificial agents were seen as limited in world knowledge (Broussard, 2018) and linguistic capabilities (Kennedy, 1988). However, the development of generative AI has significantly changed the landscape, demonstrating impressive capabilities akin to human creativity (Haase & Hanel, 2023) and language use (Cai et al., 2024). Understanding how AI development influences language comprehension becomes increasingly important, as it may challenge existing assumptions about the limitations of artificial agents.
In one such attempt, Yin et al. (2024) explored whether AI-generated language could make people “feel heard” and whether the “AI label” could influence this feeling. They found that AI-generated messages made participants feel heard to a larger extent than human-generated ones, suggesting that AI was better at detecting emotions in that specific context. However, when participants were told that the messages were from an AI, they felt heard to a lesser extent. This suggests that the “AI identity” affects perceived emotional support in language comprehension. In a direct investigation of how knowing the language is AI-generated influences comprehension, Rao et al. (2025a) used ERPs to test participants’ brain responses when encountering semantic and syntactic anomalies perceived as being produced by a large language model (LLM) versus humans. They found that while participants showed overall N400 effects for semantic anomalies and P600 effects for syntactic anomalies, the semantic N400 effects were smaller, and syntactic P600 effects were larger when they were informed that the anomalies were produced by an LLM compared to by a human (see also Rao et al., 2025b, c).
Aside from the influence of artificial agents’ non-human nature, a further question is whether this non-human identity interacts with the demographic personas assigned to them. People often attribute traits such as gender, age, and linguistic background to artificial systems. For example, they perceive humanoid artificial agents as male or female based on their appearance (Eyssel & Hegel, 2012) or synthesized voice (Nass et al., 1997). People perceive female agents to be more knowledgeable about dating, using fewer words to explain dating norms compared to male agents (Powers et al., 2005). Similarly, people perceive artificial agents as having a certain age based on their facial features (Powers & Kiesler, 2006) or synthesized voice (Sandygulova & O’Hare, 2015). People are more compliant with requests from agents with a baby face than those with an adult face (Powers & Kiesler, 2006) and prefer a child voice for home companion agents but an adult voice for educational agents (Dou et al., 2021).
These findings align with the idea that people construct an anthropomorphic model of an artificial agent. This anthropomorphic model can be considered a specific type of demographic speaker model, which incorporates expectations about the artificial agent’s characteristics and capabilities based on the attributed demographic features (e.g., gender, age). This model can then influence language comprehension in a similar way to the demographic speaker model for human speakers, by biasing the processing of linguistic content and generating expectations about the speaker’s knowledge, perspectives, and communicative goals.
However, the extent to which the anthropomorphic model of an artificial agent overlaps with or differs from the demographic speaker model for a human speaker remains an open question. It is possible that people have distinct expectations and biases for artificial agents compared to human speakers, even when they are attributed the same demographic features. For example, people may expect a female artificial agent to have different knowledge and capabilities compared to a female human speaker, due to the perceived differences in their underlying nature and origins. This raises the question of whether findings from human language comprehension can be generalized to AI language comprehension, a research area that remains to be explored.
Conclusion
In this review, we propose an integrative model of language and speaker processing to account for speaker effects in language comprehension. We argue that the influence of a speaker’s identity results from the interplay between lower-level acoustic-episodic memory and a higher-level speaker model. We formalize the interaction between language and speaker processing as a bidirectional probabilistic process: prior beliefs about a speaker modulate language comprehension, while the unfolding speech and message continuously updates the speaker model. Within this integrative framework, we define speaker-idiosyncrasy effects and speaker-demographics effects, and show how bottom-up and top-down processes interact at various levels depending on the task and context. We suggest that for studies beyond the psycholinguistic domain, speaker effects can be useful indices of language development and socio-cognitive traits. We encourage future research to explore the applicability of these findings to AI speakers, investigating whether the effects observed with human speakers can be generalized to non-human entities.
Authors’ contributions
Hanlin Wu: Conceptualization, visualization, writing – original draft; Zhenguang G. Cai: Writing – review and editing.
Funding
This work was supported by the General Research Fund (grant number: 14600220), University Grants Committee, Hong Kong.
Data availability
Not applicable.
Code availability
Not applicable.
Declarations
Conflicts of interest
The authors have no conflicts of interest to disclose.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Footnotes
In this review, we define comprehension as any process that maps lower- to higher-level linguistic representations (Pickering & Garrod, 2013). This includes lower-level processing such as speech perception, as well as higher-level processing like understanding the meaning of a sentence or the speaker’s intention.
There are cases where speaker effects arise even in the absence of the speaker’s voice. We discuss such cases in The speaker-model account section.
It should be noted that while this example is categorized as a biological violation in the literature, it represents an event of very low probability rather than a strict impossibility, as it is suggested that transgender men can become pregnant (Yoshida et al., 2022).
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Adam, M., Wessel, M., & Benlian, A. (2021). AI-based chatbots in customer service and their effects on user compliance. Electronic Markets,31(2), 427–445. 10.1007/s12525-020-00414-7 [Google Scholar]
- Alba-Ferrara, L., Weis, S., Damjanovic, L., Rowett, M., & Hausmann, M. (2012). Voice identity recognition failure in patients with schizophrenia. The Journal of Nervous and Mental Disease,200(9), 784–790. [DOI] [PubMed] [Google Scholar]
- Amalberti, R., Carbonell, N., & Falzon, P. (1993). User representations of computer systems in human-computer speech interaction. International Journal of Man-Machine Studies,38(4), 547–566. 10.1006/imms.1993.1026 [Google Scholar]
- Ambridge, B. (2020). Abstractions made of exemplars or ‘You’re all right, and I’ve changed my mind’: Response to commentators. First Language,40(5–6), 640–659. 10.1177/0142723720949723/FORMAT/EPUB [Google Scholar]
- Badcock, J. C., & Chhabra, A. (2013). Voices to reckon with: Perceptions of voice identity in clinical and non-clinical voice hearers. Frontiers in Human Neuroscience,7, 114. 10.3389/fnhum.2013.00114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barr, D. J., & Keysar, B. (2002). Anchoring comprehension in linguistic precedents. Journal of Memory and Language,46(2), 391–418. 10.1006/JMLA.2001.2815 [Google Scholar]
- Barr, D. J., Jackson, L., & Phillips, I. (2014). Using a voice to put a name to a face: The psycholinguistics of proper name comprehension. Journal of Experimental Psychology: General,143(1), 404–413. 10.1037/A0031813 [DOI] [PubMed] [Google Scholar]
- Beauchemin, M., De Beaumont, L., Vannasing, P., Turcotte, A., Arcand, C., Belin, P., & Lassonde, M. (2006). Electrophysiological markers of voice familiarity. European Journal of Neuroscience,23(11), 3081–3086. 10.1111/j.1460-9568.2006.04856.x [DOI] [PubMed] [Google Scholar]
- Belin, P., Fecteau, S., & Bédard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends in Cognitive Sciences,8(3), 129–135. 10.1016/j.tics.2004.01.008 [DOI] [PubMed] [Google Scholar]
- Belin, P., Bestelmeyer, P. E. G., Latinus, M., & Watson, R. (2011). Understanding voice perception. British Journal of Psychology,102(4), 711–725. 10.1111/j.2044-8295.2011.02041.x [DOI] [PubMed] [Google Scholar]
- Bennett, S. (1981). Vowel formant frequency characteristics of preadolescent males and females. The Journal of the Acoustical Society of America,69(1), 231–238. 10.1121/1.385343 [DOI] [PubMed] [Google Scholar]
- Bergman, T. J., Beehner, J. C., Cheney, D. L., & Seyfarth, R. M. (2003). Hierarchical classification by-rank and kinship in baboons. Science,302(5648), 1234–1236. 10.1126/science.1087513 [DOI] [PubMed] [Google Scholar]
- Bögels, S., Barr, D. J., Garrod, S., & Kessler, K. (2015). Conversational interaction in the scanner: Mentalizing during language processing as revealed by MEG. Cerebral Cortex,25(9), 3219–3234. 10.1093/cercor/bhu116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonte, M., Valente, G., & Formisano, E. (2009). Dynamic and task-dependent encoding of speech and voice by phase reorganization of cortical oscillations. Journal of Neuroscience, 29(6), 1699-1706. [DOI] [PMC free article] [PubMed]
- Bonte, M., Hausfeld, L., Scharke, W., Valente, G., & Formisano, E. (2014). Task-dependent decoding of speaker and vowel identity from auditory cortical response patterns. Journal of Neuroscience,34(13), 4548–4557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boucher, J., Lewis, V., & Collis, G. (1998). Familiar face and voice matching and recognition in children with autism. Journal of Child Psychology and Psychiatry and Allied Disciplines,39(2), 171–181. 10.1017/S0021963097001820 [PubMed] [Google Scholar]
- Brachacki, G. W., Fawcett, A. J., & Nicolson, R. I. (1994). Adults with dyslexia have a deficit in voice recognition. Perceptual and Motor Skills,78(1), 304–306. 10.2466/pms.1994.78.1.304 [DOI] [PubMed] [Google Scholar]
- Bradac, J. J., Konsky, C. W., & Davies, R. A. (1976). Two studies of the effects of linguistic diversity upon judgments of communicator attributes and message effectiveness. Communication Monographs,43(1), 70–79. [Google Scholar]
- Branigan, H. P., Pickering, M. J., Pearson, J., McLean, J. F., & Brown, A. (2011). The role of beliefs in lexical alignment: Evidence from dialogs with humans and computers. Cognition,121(1), 41–57. 10.1016/j.cognition.2011.05.011 [DOI] [PubMed] [Google Scholar]
- Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition,22(6), 1482–1493. 10.1037/0278-7393.22.6.1482 [DOI] [PubMed] [Google Scholar]
- Broussard, M. (2018). Artificial unintelligence: How computers misunderstand the world. The MIT Press. [Google Scholar]
- Brown-Schmidt, S. (2009). Partner-specific interpretation of maintained referential precedents during interactive dialog. Journal of Memory and Language,61(2), 171–190. 10.1016/J.JML.2009.04.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown-Schmidt, S. (2012). Beyond common and privileged: Gradient representations of common ground in real-time language use. Language and Cognitive Processes,27(1), 62–89. 10.1080/01690965.2010.543363 [Google Scholar]
- Brown-Schmidt, S., Gunlogson, C., & Tanenhaus, M. K. (2008). Addressees distinguish shared from private information when interpreting questions during interactive conversation. Cognition,107(3), 1122–1134. 10.1016/j.cognition.2007.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown-Schmidt, S., Yoon, S. O., & Ryskin, R. A. (2015). People as contexts in conversation. In: Psychology of Learning and Motivation - Advances in Research and Theory (vol. 62). Elsevier Ltd. 10.1016/bs.plm.2014.09.003
- Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology,77(3), 305–327. 10.1111/j.2044-8295.1986.tb02199.x [DOI] [PubMed] [Google Scholar]
- Cai, Z. G. (2022). Interlocutor modelling in comprehending speech from interleaved interlocutors of different dialectic backgrounds. Psychonomic Bulletin & Review,29(3), 1026–1034. 10.3758/s13423-022-02055-7 [DOI] [PubMed] [Google Scholar]
- Cai, Z. G., Gilbert, R. A., Davis, M. H., Gaskell, M. G., Farrar, L., Adler, S., & Rodd, J. M. (2017). Accent modulates access to word meaning: Evidence for a speaker-model account of spoken word recognition. Cognitive Psychology,98, 73–101. 10.1016/j.cogpsych.2017.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai, Z. G., Sun, Z., & Zhao, N. (2021). Interlocutor modelling in lexical alignment: The role of linguistic competence. Journal of Memory and Language,121, 104278. 10.1016/j.jml.2021.104278 [Google Scholar]
- Cai, Z., Duan, X., Haslett, D., Wang, S., & Pickering, M. (2024). Do large language models resemble humans in language use? In T. Kuribayashi, G. Rambelli, E. Takmaz, P. Wicke, & Y. Oseki (Eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 37–56). Association for Computational Linguistics.
- Capilla, A., Belin, P., & Gross, J. (2013). The early spatio-temporal correlates and task independence of cerebral voice processing studied with MEG. Cerebral Cortex,23(6), 1388–1395. 10.1093/cercor/bhs119 [DOI] [PubMed] [Google Scholar]
- Chang, E. F., Rieger, J. W., Johnson, K., Berger, M. S., Barbaro, N. M., & Knight, R. T. (2010). Categorical speech representation in human superior temporal gyrus. Nature Neuroscience,13(11), 1428–1432. 10.1038/nn.2641 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chhabra, S., Badcock, J. C., Maybery, M. T., & Leung, D. (2012). Voice identity discrimination in schizophrenia. Neuropsychologia,50(12), 2730–2735. 10.1016/j.neuropsychologia.2012.08.006 [DOI] [PubMed] [Google Scholar]
- Choi, J. Y., Hu, E. R., & Perrachione, T. K. (2018). Varying acoustic-phonemic ambiguity reveals that talker normalization is obligatory in speech processing. Attention, Perception, & Psychophysics,80(3), 784–797. 10.3758/s13414-017-1395-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clapp, W., Vaughn, C., & Sumner, M. (2023). The episodic encoding of talker voice attributes across diverse voices. Journal of Memory and Language,128, 104376. 10.1016/j.jml.2022.104376 [Google Scholar]
- Clapp, W., Vaughn, C., Todd, S., & Sumner, M. (2023). Talker-specificity and token-specificity in recognition memory. Cognition,237, 105450. 10.1016/j.cognition.2023.105450 [DOI] [PubMed] [Google Scholar]
- Clark, H. (1996). Using language. Cambridge University Press. [Google Scholar]
- Clopper, C. G., & Pisoni, D. B. (2004). Effects of talker variability on perceptual learning of dialects. Language and Speech,47(3), 207–239. 10.1177/00238309040470030101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clopper, C. G., & Pisoni, D. B. (2004). Some acoustic cues for the perceptual categorization of American English regional dialects. Journal of Phonetics,32(1), 111–140. 10.1016/S0095-4470(03)00009-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cooper, A., & Bradlow, A. R. (2017). Talker and background noise specificity in spoken word recognition memory. Laboratory Phonology,8(1), Article 29. 10.5334/labphon.99 [Google Scholar]
- Cooper, A., Brouwer, S., & Bradlow, A. R. (2015). Interdependent processing and encoding of speech and concurrent background noise. Attention, Perception, & Psychophysics,77(4), 1342–1357. 10.3758/s13414-015-0855-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Craik, F. I. M., & Kirsner, K. (1974). The effect of speaker’s voice on word recognition. Quarterly Journal of Experimental Psychology,26(2), 274–284. 10.1080/14640747408400413 [Google Scholar]
- Creel, S. C. (2014). Preschoolers’ flexible use of talker information during word learning. Journal of Memory and Language,73(1), 81–98. 10.1016/j.jml.2014.03.001 [Google Scholar]
- Creel, S. C., & Bregman, M. R. (2011). How talker identity relates to language processing. Language and Linguistics Compass,5(5), 190–204. 10.1111/j.1749-818X.2011.00276.x [Google Scholar]
- Creel, S. C., & Tumlin, M. A. (2011). On-line acoustic and semantic interpretation of talker information. Journal of Memory and Language,65(3), 264–285. 10.1016/j.jml.2011.06.005 [Google Scholar]
- Creel, S. C., Aslin, R. N., & Tanenhaus, M. K. (2008). Heeding the voice of experience: The role of talker variation in lexical access. Cognition,106(2), 633–664. 10.1016/j.cognition.2007.03.013 [DOI] [PubMed] [Google Scholar]
- Creel, S. C., Aslin, R. N., & Tanenhaus, M. K. (2012). Word learning under adverse listening conditions: Context-specific recognition. Language and Cognitive Processes,27(7–8), 1021–1038. 10.1080/01690965.2011.610597 [Google Scholar]
- DeCasper, A. J., & Fifer, W. P. (1980). Of human bonding: Newborns prefer their mothers’ voices. Science,208(4448), 1174–1176. 10.1126/science.7375928 [DOI] [PubMed] [Google Scholar]
- DeLong, K. A., Urbach, T. P., & Kutas, M. (2005). Probabilistic word pre-activation during language comprehension inferred from electrical brain activity. Nature Neuroscience,8(8), 1117–1121. [DOI] [PubMed] [Google Scholar]
- Domingo, Y., Holmes, E., & Johnsrude, I. S. (2020). The benefit to speech intelligibility of hearing a familiar voice. Journal of Experimental Psychology. Applied,26(2), 236. [DOI] [PubMed] [Google Scholar]
- Dou, X., Wu, C. F., Lin, K. C., Gan, S., & Tseng, T. M. (2021). Effects of different types of social robot voices on affective evaluations in different application fields. International Journal of Social Robotics,13(4), 615–628. 10.1007/s12369-020-00654-9 [Google Scholar]
- Dragojevic, M., & Giles, H. (2016). I don’t like you because you’re hard to understand: the role of processing fluency in the language attitudes process. Human Communication Research, 42, 396–420.
- Eisner, F., & McQueen, J. M. (2005). The specificity of perceptual learning in speech processing. Perception & Psychophysics,67(2), 224–238. [DOI] [PubMed] [Google Scholar]
- Evans, S., McGettigan, C., Agnew, Z. K., Rosen, S., & Scott, S. K. (2016). Getting the cocktail party started: Masking effects in speech perception. Journal of Cognitive Neuroscience,28(3), 483–500. 10.1162/jocn_a_00913 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ey, E., Pfefferle, D., & Fischer, J. (2007). Do age- and sex-related variations reliably reflect body size in non-human primate vocalizations? A review. Primates,48(4), 253–267. 10.1007/s10329-006-0033-y [DOI] [PubMed] [Google Scholar]
- Eyssel, F., & Hegel, F. (2012). (S)he’s got the look: Gender stereotyping of robots. Journal of Applied Social Psychology,42(9), 2213–2230. 10.1111/j.1559-1816.2012.00937.x [Google Scholar]
- Fairchild, S., & Papafragou, A. (2018). Sins of omission are more likely to be forgiven in non-native speakers. Cognition,181, 80–92. [DOI] [PubMed] [Google Scholar]
- Fiske, A., Henningsen, P., & Buyx, A. (2019). Your robot therapist will see you now: Ethical implications of embodied artificial intelligence in psychiatry, psychology, and psychotherapy. Journal of Medical Internet Research,21(5), 1–12. 10.2196/13216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. The Journal of the Acoustical Society of America,106(3), 1511–1522. 10.1121/1.427148 [DOI] [PubMed] [Google Scholar]
- Fitria, T. N. (2024). Artificial Intelligence (AI) news anchors: How do they perform in the journalistic sector? Indonesia Technology-Enhanced Language Learning (ITELL) Journal,1(1), 29–42. [Google Scholar]
- Fogg, B. J., & Nass, C. (1997). How users reciprocate to computers: An experiment that demonstrates behavior change. In CHI’97 Extended Abstracts on Human Factors in Computing Systems (pp. 331–332). Association for Computing Machinery. 10.1145/1120212.1120419
- Foucart, A., Garcia, X., Ayguasanosa, M., Thierry, G., Martin, C., & Costa, A. (2015). Does the speaker matter? Online processing of semantic and pragmatic information in L2 speech comprehension. Neuropsychologia,75, 291–303. 10.1016/j.neuropsychologia.2015.06.027 [DOI] [PubMed] [Google Scholar]
- Foucart, A., Santamaría-García, H., & Hartsuiker, R. J. (2019). Short exposure to a foreign accent impacts subsequent cognitive processes. Neuropsychologia,129, 1–9. 10.1016/j.neuropsychologia.2019.02.021 [DOI] [PubMed] [Google Scholar]
- Friederici, A. D., Kotz, S. A., Scott, S. K., & Obleser, J. (2010). Disentangling syntax and intelligibility in auditory language comprehension. Human Brain Mapping,31(3), 448–457. 10.1002/hbm.20878 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ganugapati, D., & Theodore, R. M. (2019). Structured phonetic variation facilitates talker identification. The Journal of the Acoustical Society of America,145(6), EL469–EL475. [DOI] [PubMed] [Google Scholar]
- Gelfer, M. P., & Bennett, Q. E. (2013). Speaking fundamental frequency and vowel formant frequencies: Effects on perception of gender. Journal of Voice,27(5), 556–566. 10.1016/j.jvoice.2012.11.008 [DOI] [PubMed] [Google Scholar]
- Geiselman, R. E., & Bellezza, F. S. (1977). Incidental retention of speaker’s voice. Memory & Cognition, 5(6), 658-665. [DOI] [PubMed]
- Ghazanfar, A. A., & Rendall, D. (2008). Evolution of human vocal production. Current Biology,18(11), R457–R460. 10.1016/j.cub.2008.03.030 [DOI] [PubMed] [Google Scholar]
- Gibson, E., Tan, C., Futrell, R., Mahowald, K., Konieczny, L., Hemforth, B., & Fedorenko, E. (2017). Don’t underestimate the benefits of being misunderstood. Psychological Science,28(6), 703–712. [DOI] [PubMed] [Google Scholar]
- Goh, W. D. (2005). Talker variability and recognition memory: Instance-specific and voice-specific effects. Journal of Experimental Psychology. Learning, Memory, and Cognition,31(1), 40–53. 10.1037/0278-7393.31.1.40 [DOI] [PubMed] [Google Scholar]
- Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology. Learning, Memory, and Cognition,22(5), 1166–1183. 10.1037/0278-7393.22.5.1166 [DOI] [PubMed] [Google Scholar]
- Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review,105(2), 251–279. 10.1037/0033-295X.105.2.251 [DOI] [PubMed] [Google Scholar]
- Goldinger, S. D. (2007). A complementary-systems approach to abstract and episodic speech perception. In: 16th International Congress of Phonetic Sciences, (pp. 49–54).
- González, J., & McLennan, C. T. (2007). Hemispheric differences in indexical specificity effects in spoken word recognition. Journal of Experimental Psychology. Human Perception and Performance,33(2), 410–424. 10.1037/0096-1523.33.2.410 [DOI] [PubMed] [Google Scholar]
- Gradoville, M. (2023). The future of exemplar theory. The Handbook of Usage-Based Linguistics (pp. 527–544). Wiley. 10.1002/9781119839859.ch29 [Google Scholar]
- Haase, J., & Hanel, P. H. P. (2023). Artificial muses: Generative artificial intelligence chatbots have risen to human-level creativity. Journal of Creativity,33(3), 100066. 10.1016/j.yjoc.2023.100066 [Google Scholar]
- Hagoort, P., Hald, L., Bastiaansen, M., & Petersson, K. M. (2004). Integration of word meaning and world knowledge in language comprehension. Science,304(5669), 438–441. 10.1126/SCIENCE.1095455 [DOI] [PubMed] [Google Scholar]
- Hailstone, J. C., Crutch, S. J., Vestergaard, M. D., Patterson, R. D., & Warren, J. D. (2010). Progressive associative phonagnosia: A neuropsychological analysis. Neuropsychologia,48(4), 1104–1114. 10.1016/j.neuropsychologia.2009.12.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hammond, T. H., Gray, S. D., & Butler, J. E. (2000). Age- and gender-related collagen distribution in human vocal folds. Annals of Otology, Rhinology & Laryngology,109(10 I), 913–920. 10.1177/000348940010901004 [DOI] [PubMed] [Google Scholar]
- Hanna, J. E., Tanenhaus, M. K., & Trueswell, J. C. (2003). The effects of common ground and perspective on domains of referential interpretation. Journal of Memory and Language,49(1), 43–61. 10.1016/S0749-596X(03)00022-6 [Google Scholar]
- Hanulíková, A., Van Alphen, P. M., Van Goch, M. M., & Weber, A. (2012). When one person’s mistake is another’s standard usage: The effect of foreign accent on syntactic processing. Journal of Cognitive Neuroscience,24(4), 878–887. [DOI] [PubMed] [Google Scholar]
- Hay, J., Warren, P., & Drager, K. (2006). Factors influencing speech perception in the context of a merger-in-progress. Journal of Phonetics,34(4), 458–484. 10.1016/j.wocn.2005.10.001 [Google Scholar]
- Heller, D., Gorman, K. S., & Tanenhaus, M. K. (2012). To name or to describe: Shared knowledge affects referential form. Topics in Cognitive Science,4(2), 290–305. 10.1111/j.1756-8765.2012.01182.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill, J., Randolph Ford, W., & Farreras, I. G. (2015). Real conversations with artificial intelligence: A comparison between human-human online conversations and human-chatbot conversations. Computers in Human Behavior,49, 245–250. 10.1016/j.chb.2015.02.026 [Google Scholar]
- Holmes, E., & Johnsrude, I. S. (2021). Speech-evoked brain activity is more robust to competing speech when it is spoken by someone familiar. NeuroImage,237, 118107. [DOI] [PubMed] [Google Scholar]
- Holmes, E., Domingo, Y., & Johnsrude, I. S. (2018). Familiar voices are more intelligible, even if they are not recognized as familiar. Psychological Science,29(10), 1575–1583. [DOI] [PubMed] [Google Scholar]
- Horton, W. S., & Slaten, D. G. (2012). Anticipating who will say what: The influence of speaker-specific memory associations on reference resolution. Memory & Cognition,40(1), 113–126. 10.3758/S13421-011-0135-7/TABLES/1 [DOI] [PubMed] [Google Scholar]
- Houston, D. M., & Jusczyk, P. W. (2000). The role of talker-specific information in word segmentation by infants. Journal of Experimental Psychology. Human Perception and Performance,26(5), 1570–1582. 10.1037/0096-1523.26.5.1570 [DOI] [PubMed] [Google Scholar]
- Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly,37(1), 81–88. 10.1080/02763869.2018.1404391 [DOI] [PubMed] [Google Scholar]
- Jiang, X., Gossack-Keenan, K., & Pell, M. D. (2020). To believe or not to believe? How voice and accent information in speech alter listener impressions of trust. Quarterly Journal of Experimental Psychology,73(1), 55–79. [DOI] [PubMed] [Google Scholar]
- Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics,34(4), 485–499. 10.1016/j.wocn.2005.08.004 [Google Scholar]
- Johnson, K., & Sjerps, M. J. (2021). Speaker normalization in speech perception. The Handbook of Speech Perception (pp. 145–176). John Wiley & Sons Inc. 10.1002/9781119184096.ch6 [Google Scholar]
- Johnson, K., Strand, E. A., & D’Imperio, M. (1999). Auditory-visual integration of talker gender in vowel perception. Journal of Phonetics,27(4), 359–384. 10.1006/jpho.1999.0100 [Google Scholar]
- Jouravlev, O., Schwartz, R., Ayyash, D., Mineroff, Z., Gibson, E., & Fedorenko, E. (2019). Tracking colisteners’ knowledge states during language comprehension. Psychological Science, 30(1), 3-19. [DOI] [PMC free article] [PubMed]
- Kapnoula, E. C., & Samuel, A. G. (2019). Voices in the mental lexicon: Words carry indexical information that can affect access to their meaning. Journal of Memory and Language,107, 111–127. 10.1016/J.JML.2019.05.001 [Google Scholar]
- Kennedy, A. (1988). Dialogue with machines. Cognition,30(1), 37–72. 10.1016/0010-0277(88)90003-0 [DOI] [PubMed] [Google Scholar]
- Kim, J. (2016). Perceptual associations between words and speaker age. Laboratory Phonology,7(1), Article 18. 10.5334/labphon.33 [Google Scholar]
- King, E., & Sumner, M. (2015). Voice-specific effects in semantic association. Proceedings of the 37th Annual Meeting of the Cognitive Science Society (pp. 1111–1116). Cognitive Science Society. [Google Scholar]
- Kinzler, K. D., Dupoux, E., & Spelke, E. S. (2007). The native language of social cognition. Proceedings of the National Academy of Sciences ,104 (30), 12577–12580. [DOI] [PMC free article] [PubMed]
- Kisilevsky, B. S., Hains, S. M. J., Lee, K., Xie, X., Huang, H., Ye, H. H., ... & Wang, Z. (2003). Effects of experience on fetal voice recognition. Psychological Science,14(3), 220–224. 10.1111/1467-9280.02435 [DOI] [PubMed] [Google Scholar]
- Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America,87(2), 820–857. 10.1121/1.398894 [DOI] [PubMed] [Google Scholar]
- Kleinschmidt, D. F. (2019). Structure in talker variability: How much is there and how much can it help? Language, Cognition and Neuroscience,34(1), 43–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review,122(2), 148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ko, Sei Jin, Judd, C. M., & Stapel, D. A. (2009). Stereotyping based on voice in the presence of individuating information: Vocal femininity affects perceived competence but not warmth. Personality and Social Psychology Bulletin,35(2), 198–211. 10.1177/0146167208326477 [DOI] [PubMed] [Google Scholar]
- Kraljic, T., & Samuel, A. G. (2005). Perceptual learning for speech: Is there a return to normal? Cognitive Psychology,51(2), 141–178. [DOI] [PubMed] [Google Scholar]
- Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech. Psychonomic Bulletin & Review,13(2), 262–268. [DOI] [PubMed] [Google Scholar]
- Kraljic, T., & Samuel, A. G. (2007). Perceptual adjustments to multiple speakers. Journal of Memory and Language,56(1), 1–15. [Google Scholar]
- Krauss, R. M., Freyberg, R., & Morsella, E. (2002). Inferring speakers’ physical attributes from their voices. Journal of Experimental Social Psychology,38(6), 618–625. 10.1016/S0022-1031(02)00510-3 [Google Scholar]
- Kreiman, J., & Sidtis, D. (2011). Physical characteristics and the voice: Can we hear what a speaker looks like? Foundations of Voice Studies (pp. 110–155). Wiley. 10.1002/9781444395068.ch4 [Google Scholar]
- Kronmüller, E., & Barr, D. J. (2007). Perspective-free pragmatics: Broken precedents and the recovery-from-preemption hypothesis. Journal of Memory and Language,56(3), 436–455. 10.1016/j.jml.2006.05.002 [Google Scholar]
- Kronmüller, E., & Barr, D. J. (2015). Referential precedents in spoken language comprehension: A review and meta-analysis. Journal of Memory and Language,83, 1–19. 10.1016/J.JML.2015.03.008 [Google Scholar]
- Kun, A., Paek, T., & Medenica, Z. (2007). The effect of speech interface accuracy on driving performance. In: International Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech, (vol. 4, pp. 2332–2335). 10.21437/interspeech.2007-406
- Kutas, M., & Hillyard, S. A. (1980). Reading senseless sentences: Brain potentials reflect semantic incongruity. Science,207(4427), 203–205. 10.1126/science.7350657 [DOI] [PubMed] [Google Scholar]
- Labov, W. (1973). Sociolinguistic patterns. University of Pennsylvania press. [Google Scholar]
- Ladefoged, P., & Broadbent, D. E. (1957). Information Conveyed by Vowels. The Journal of the Acoustical Society of America,29(1), 98–104. 10.1121/1.1908694 [DOI] [PubMed] [Google Scholar]
- Latinus, M., McAleer, P., Bestelmeyer, P. E. G., & Belin, P. (2013). Norm-based coding of voice identity in human auditory cortex. Current Biology,23(12), 1075–1080. 10.1016/j.cub.2013.04.055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lattner, S., & Friederici, A. D. (2003). Talker’s voice and gender stereotype in human auditory sentence processing - Evidence from event-related brain potentials. Neuroscience Letters,339(3), 191–194. 10.1016/S0304-3940(03)00027-2 [DOI] [PubMed] [Google Scholar]
- Lattner, S., Meyer, M. E., & Friederici, A. D. (2005). Voice perception: Sex, pitch, and the right hemisphere. Human Brain Mapping,24(1), 11–20. 10.1002/hbm.20065 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lavan, N., Burton, A. M., Scott, S. K., & McGettigan, C. (2019a). Flexible voices: Identity perception from variable vocal signals. Psychonomic Bulletin & Review,26(1), 90–102. 10.3758/s13423-018-1497-7 [DOI] [PMC free article] [PubMed]
- Lavan, N., Knight, S., & McGettigan, C. (2019b). Listeners form average-based representations of individual voice identities. Nature Communications,10(1), 2404. [DOI] [PMC free article] [PubMed]
- Lavner, Y., Rosenhouse, J., & Gath, I. (2001). The prototype model in speaker identification. International Journal of Speech Technology,4, 63–74. 10.1023/A:1009656816383 [Google Scholar]
- Lee, S., Potamianos, A., & Narayanan, S. (1999). Acoustics of children’s speech: Developmental changes of temporal and spectral parameters. The Journal of the Acoustical Society of America,105(3), 1455–1468. 10.1121/1.426686 [DOI] [PubMed] [Google Scholar]
- Leung, Y., Oates, J., & Chan, S. P. (2018). Voice, articulation, and prosody contribute to listener perceptions of speaker gender: A systematic review and meta-analysis. Journal of Speech, Language, and Hearing Research,61(2), 266–297. 10.1044/2017_JSLHR-S-17-0067 [DOI] [PubMed] [Google Scholar]
- Levi, S. V., Harel, D., & Schwartz, R. G. (2019). Language ability and the familiar talker advantage: Generalizing to unfamiliar talkers is what matters. Journal of Speech, Language, and Hearing Research,62(5), 1427–1436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levi, S. V., Winters, S. J., & Pisoni, D. B. (2011). Effects of cross-language voice training on speech perception: Whose familiar voices are more intelligible? The Journal of the Acoustical Society of America,130(6), 4053–4062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition,21(1), 1–36. 10.1016/0010-0277(85)90021-6 [DOI] [PubMed] [Google Scholar]
- Lim, S. J., Shinn-Cunningham, B. G., & Perrachione, T. K. (2019). Effects of talker continuity and speech rate on auditory working memory. Attention, Perception, & Psychophysics,81(4), 1167–1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listeners to identify English /r/ and /l/. II: The role of phonetic environment and talker variability in learning new perceptual categories. The Journal of the Acoustical Society of America,94(3), 1242–1255. 10.1121/1.408177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luthra, S. (2021). The role of the right hemisphere in processing phonetic variability between talkers. Neurobiology of Language,2(1), 138–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luthra, S. (2024). Why are listeners hindered by talker variability? Psychonomic Bulletin & Review,31(1), 104–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luthra, S., Fox, N. P., & Blumstein, S. E. (2018). Speaker information affects false recognition of unstudied lexical semantic associates. Attention, Perception, & Psychophysics, 80, 894-912. [DOI] [PMC free article] [PubMed]
- Luthra, S., Saltzman, D., Myers, E. B., & Magnuson, J. S. (2021). Listener expectations and the perceptual accommodation of talker variability: A pre-registered replication. Attention, Perception, & Psychophysics,83(6), 2367–2376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luthra, S., Magnuson, J. S., & Myers, E. B. (2023). Right posterior temporal cortex supports integration of phonetic and talker information. Neurobiology of Language,4(1), 145–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magnuson, J. S., & Nusbaum, H. C. (2007). Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology. Human Perception and Performance,33(2), 391–409. 10.1037/0096-1523.33.2.391 [DOI] [PubMed] [Google Scholar]
- Magnuson, J. S., Nusbaum, H. C., Akahane-Yamada, R., & Saltzman, D. (2021). Talker familiarity and the accommodation of talker variability. Attention, Perception, & Psychophysics,83(4), 1842–1860. 10.3758/s13414-020-02203-y [DOI] [PubMed] [Google Scholar]
- Maguinness, C., Roswandowitz, C., & von Kriegstein, K. (2018). Understanding the mechanisms of familiar voice-identity recognition in the human brain. Neuropsychologia,116, 179–193. 10.1016/j.neuropsychologia.2018.03.039 [DOI] [PubMed] [Google Scholar]
- Martin, C. D., Garcia, X., Potter, D., Melinger, A., & Costa, A. (2016). Holiday or vacation? The processing of variation in vocabulary across dialects. Language, Cognition and Neuroscience,31(3), 375–390. 10.1080/23273798.2015.1100750 [Google Scholar]
- Mattys, S. L., & Liss, J. M. (2008). On building models of spoken-word recognition: When there is as much to learn from natural “oddities” as artificial normality. Perception & Psychophysics,70(7), 1235–1242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McAleer, P., Todorov, A., & Belin, P. (2014). How do you say “hello”? Personality impressions from brief novel voices. PLoS ONE,9(3), 1–9. 10.1371/journal.pone.0090779 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLennan, C. T., & Luce, P. A. (2005). Examining the time course of indexical specificity effects in spoken word recognition. Journal of Experimental Psychology. Learning, Memory, and Cognition,31(2), 306–321. [DOI] [PubMed] [Google Scholar]
- McQueen, J. M., Cutler, A., & Norris, D. (2006). Phonological abstraction in the mental lexicon. Cognitive Science,30(6), 1113–1126. 10.1207/s15516709cog0000_79 [DOI] [PubMed] [Google Scholar]
- Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning. Psychological Review,85(3), 207–238. 10.1037/0033-295X.85.3.207 [Google Scholar]
- Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science,343(6174), 1006–1010. 10.1126/science.1245994 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Metzing, C., & Brennan, S. E. (2003). When conceptual pacts are broken: Partner-specific effects on the comprehension of referring expressions. Journal of Memory and Language,49(2), 201–213. 10.1016/S0749-596X(03)00028-7 [Google Scholar]
- Mulac, A., & Giles, H. (1996). “You’re only as old as you sound”: Perceived vocal age and social meanings. Health Communication,8(3), 199–215. 10.1207/s15327027hc0803_2 [Google Scholar]
- Mullennix, J. W., & Pisoni, D. B. (1990). Stimulus variability and processing dependencies in speech perception. Perception & Psychophysics,47(4), 379–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of talker variability on spoken word recognition. The Journal of the Acoustical Society of America,85(1), 365–378. 10.1121/1.397688 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munson, B., & Babel, M. (2019). The phonetics of sex and gender. The Routledge Handbook of Phonetics (pp. 499–525). Taylor and Francis. 10.4324/9780429056253-19 [Google Scholar]
- Munson, B., Crocker, L., Pierrehumbert, J. B., Owen-Anderson, A., & Zucker, K. J. (2015). Gender typicality in children’s speech: A comparison of boys with and without gender identity disorder. The Journal of the Acoustical Society of America,137(4), 1995–2003. 10.1121/1.4916202 [DOI] [PubMed] [Google Scholar]
- Myers, E. B., & Mesite, L. M. (2014). Neural systems underlying perceptual adjustment to non-standard speech tokens. Journal of Memory and Language,76, 80–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Myers, E. B., & Theodore, R. M. (2017). Voice-sensitive brain networks encode talker-specific phonetic detail. Brain and Language,165, 33–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nass, C., & Moon, Y. (2000). Machines and mindlessness: Social responses to computers. Journal of Social Issues,56(1), 81–103. 10.1111/0022-4537.00153 [Google Scholar]
- Nass, C., Moon, Y., & Green, N. (1997). Are machines gender neutral? Gender-stereotypic responses to computers with voices. Journal of Applied Social Psychology,27(10), 864–876. 10.1111/j.1559-1816.1997.tb00275.x [Google Scholar]
- Nass, C., Moon, Y., & Carney, P. (1999). Are respondents polite to computers? Social responses to computers. Journal of Applied Social Psychology,29(5), 1093–1110. http://www.bellpub.com
- Niedzielski, N. (1999). The effect of social information on the perception of sociolinguistic variables. Journal of Language and Social Psychology,18(1), 62–85. 10.1177/0261927X99018001005 [Google Scholar]
- Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology,47(2), 204–238. [DOI] [PubMed] [Google Scholar]
- Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General,115(1), 39–57. 10.1037/0096-3445.115.1.39 [DOI] [PubMed] [Google Scholar]
- NourEddine, S. N., Brothers, T., Wang, L., Spratling, M., & Kuperberg, G. R. (2024). A predictive coding model of the N400. Cognition,246, 105755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oganian, Y., Bhaya-Grossman, I., Johnson, K., & Chang, E. F. (2023). Vowel and formant representation in the human auditory speech cortex. Neuron, 111(13), 2105-2118. [DOI] [PMC free article] [PubMed]
- Palmeri, T. J., Goldinger, S. D., & Pisoni, D. B. (1993). Episodic encoding of voice attributes and recognition memory for spoken words. Journal of Experimental Psychology: Learning, Memory, and Cognition,19(2), 309–328. 10.1037/0278-7393.19.2.309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearson, J., Hu, J., Branigan, H. P., Pickering, M. J., & Nass, C. I. (2006). Adaptive language behavior in HCl: How expectations and beliefs about a system affect users’ word choice. In: Proceedings of the SIGCHI conference on Human Factors in computing systems, (pp. 1177–1180).
- Pélissier, M., & Ferragne, E. (2022). The N400 reveals implicit accent-induced prejudice. Speech Communication,137, 114–126. 10.1016/J.SPECOM.2021.10.004 [Google Scholar]
- Perea, M., Jiménez, M., Suárez-Coalla, P., Fernández, N., Viña, C., & Cuetos, F. (2014). Ability for voice recognition is a marker for dyslexia in children. Experimental Psychology,61(6), 480–487. 10.1027/1618-3169/a000265 [DOI] [PubMed] [Google Scholar]
- Perrachione, T. K., Del Tufo, S. N., & Gabrieli, J. D. E. (2011). Human voice recognition depends on language ability. Science,333(6042), 595. 10.1126/science.1207327 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perry, T. L., Ohde, R. N., & Ashmead, D. H. (2001). The acoustic bases for gender identification from children’s voices. The Journal of the Acoustical Society of America,109(6), 2988–2998. 10.1121/1.1370525 [DOI] [PubMed] [Google Scholar]
- Petkov, C. I., & Vuong, Q. C. (2013). Neuronal coding: The value in having an average voice. Current Biology,23(12), R521–R523. 10.1016/j.cub.2013.04.077 [DOI] [PubMed] [Google Scholar]
- Pickering, M. J., & Garrod, S. (2013). An integrated theory of language production and comprehension. Behavioral and Brain Sciences,36(4), 329–347. 10.1017/S0140525X12001495 [DOI] [PubMed] [Google Scholar]
- Pierrehumbert, J. B., Bent, T., Munson, B., Bradlow, A. R., & Bailey, J. M. (2004). The influence of sexual orientation on vowel production (L). The Journal of the Acoustical Society of America,116(4), 1905–1908. 10.1121/1.1788729 [DOI] [PubMed] [Google Scholar]
- Poeppel, D. (2003). The analysis of speech in different temporal integration windows: Cerebral lateralization as “asymmetric sampling in time.” Speech Communication,41(1), 245–255. 10.1016/S0167-6393(02)00107-3 [Google Scholar]
- Porter, S. C., Rheinschmidt-Same, M., & Richeson, J. A. (2016). Inferring identity from language: Linguistic intergroup bias informs social categorization. Psychological Science,27(1), 94–102. [DOI] [PubMed] [Google Scholar]
- Powers, A., & Kiesler, S. (2006). The advisor robot: Tracing people’s mental model from a robot’s physical attributes. In: HRI 2006: Proceedings of the 2006 ACM Conference on Human-Robot Interaction (pp. 218–225).
- Powers, A., Kramer, A. D. I., Lim, S., Kuo, J., Lee, S. L., & Kiesler, S. (2005). Eliciting information from people with a gendered humanoid robot. In: Proceedings - IEEE International Workshop on Robot and Human Interactive Communication (pp. 158–163). 10.1109/ROMAN.2005.1513773
- Pufahl, A., & Samuel, A. G. (2014). How lexical is the lexicon? Evidence for integrated auditory memory representations. Cognitive Psychology,70, 1–30. 10.1016/J.COGPSYCH.2014.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Puts, D. A., Gaulin, S. J. C., & Verdolini, K. (2006). Dominance and the evolution of sexual dimorphism in human voice pitch. Evolution and Human Behavior,27(4), 283–296. 10.1016/j.evolhumbehav.2005.11.003 [Google Scholar]
- Quam, C., & Creel, S. C. (2021). Impacts of acoustic-phonetic variability on perceptual development for spoken language: A review. Wiley Interdisciplinary Reviews: Cognitive Science,12(5), 1–21. 10.1002/wcs.1558 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quam, C., Knight, S., & Gerken, L. (2017). The distribution of talker variability impacts infants’ word learning. Laboratory Phonology,8(1), 1–31. 10.5334/labphon.25 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rakić, T., Steffens, M. C., & Mummendey, A. (2011). When it matters how you pronounce it: The influence of regional accents on job interview outcome. British Journal of Psychology,102(4), 868–883. 10.1111/j.2044-8295.2011.02051.x [DOI] [PubMed] [Google Scholar]
- Rao, X., Wu, H., & Cai, Z. G. (2025). Comprehending semantic and syntactic anomalies in text attributed to an LLM versus a human: An ERP study. Applied Psycholinguistics,46(e51), 1–27. 10.1017/S0142716425100404 [Google Scholar]
- Rao, X., Wu, H., & Cai, Z. G. (2025b). A funny companion: Distinct neural responses to perceived AI-versus human-generated humor. arXiv preprint arXiv:2509.10847. 10.48550/arXiv.2509.10847
- Rao, X., Wu, H., & Cai, Z. G. (2025c). When AI companions become witty: Can human brain recognize AI-generated irony?. arXiv preprint arXiv:2510.17168. 10.48550/arXiv.2510.17168
- Reeves, B., & Nass, C. (1996). Media equation: How people treat computers, television, and new media like real people and places. Collection Management,24(3–4), 310–311. 10.1300/j105v24n03_14 [Google Scholar]
- Rost, G. C., & McMurray, B. (2009). Speaker variability augments phonological processing in early word learning. Developmental Science,12(2), 339–349. 10.1111/j.1467-7687.2008.00786.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost, G. C., & McMurray, B. (2010). Finding the signal by adding noise: The role of noncontrastive phonetic variability in early word learning. Infancy,15(6), 608–635. 10.1111/j.1532-7078.2010.00033.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rueschemeyer, S. A., Gardner, T., & Stoner, C. (2015). The Social N400 effect: how the presence of other listeners affects language comprehension. Psychonomic bulletin & review, 22, 128-134. [DOI] [PubMed]
- Ryalls, B. O., & Pisoni, D. B. (1997). The effect of talker variability on word recognition in preschool children. Developmental Psychology,33(3), 441–452. 10.1037/0012-1649.33.3.441 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sandygulova, A., & O’Hare, G. M. (2015). Children’s perception of synthesized voice: Robot’s gender, age and accent. In: International Conference on Social Robotics, (pp. 594–602).
- Schall, S., Kiebel, S. J., Maess, B., & von Kriegstein, K. (2015). Voice identity recognition: Functional division of the right STS and its behavioral relevance. Journal of Cognitive Neuroscience,27(2), 280–291. 10.1162/jocn_a_00707 [DOI] [PubMed] [Google Scholar]
- Schelinski, S., Borowiak, K., & von Kriegstein, K. (2016). Temporal voice areas exist in autism spectrum disorder but are dysfunctional for voice identity recognition. Social Cognitive and Affective Neuroscience,11(11), 1812–1822. 10.1093/scan/nsw089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schirmer, A. (2018). Is the voice an auditory face? An ALE meta-analysis comparing vocal and facial emotion processing. Social Cognitive and Affective Neuroscience,13(1), 1–13. 10.1093/scan/nsx142 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmale, R., & Seidl, A. (2009). Accommodating variability in voice and foreign accent: Flexibility of early word representations. Developmental Science,12(4), 583–601. 10.1111/j.1467-7687.2009.00809.x [DOI] [PubMed] [Google Scholar]
- Schmidt, T., & Strassner, T. (2022). Artificial intelligence in foreign language learning and teaching. Anglistik,33(1), 165–184. 10.33675/angl/2022/1/14 [Google Scholar]
- Schweinberger, S. R., Herholz, A., & Sommer, W. (1997). Recognizing famous voices: Influence of stimulus duration and different types of retrieval cues. Journal of Speech, Language, and Hearing Research,40(2), 453–463. 10.1044/jslhr.4002.453 [DOI] [PubMed] [Google Scholar]
- Schweinberger, S. R., Kawahara, H., Simpson, A. P., Skuk, V. G., & Zäske, R. (2014). Speaker perception. Wiley Interdisciplinary Reviews: Cognitive Science,5(1), 15–25. 10.1002/wcs.1261 [DOI] [PubMed] [Google Scholar]
- Scott, S. K. (2019). From speech and talkers to the social world: The neural processing of human spoken language. Science,366(6461), 58–62. 10.1126/science.aax028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shechtman, N., & Horowitz, L. M. (2003). Media inequality in conversation: how people behave differently when interacting with computers and people. In Proceedings of the SIGCHI conference on Human factors in computing systems, (pp. 281–288). 10.1145/642659.642661
- Shen, H., & Wang, M. (2023). Effects of social skills on lexical alignment in human-human interaction and human-computer interaction. Computers in Human Behavior,143, 107718. 10.1016/j.chb.2023.107718 [Google Scholar]
- Shintel, H., & Keysar, B. (2007). You said it before and you’ll say it again: Expectations of consistency in communication. Journal of Experimental Psychology. Learning, Memory, and Cognition,33(2), 357–369. 10.1037/0278-7393.33.2.357 [DOI] [PubMed] [Google Scholar]
- Simpson, A. P. (2001). Dynamic consequences of differences in male and female vocal tract dimensions. The Journal of the Acoustical Society of America,109(5), 2153–2164. 10.1121/1.1356020 [DOI] [PubMed] [Google Scholar]
- Sjerps, M. J., Fox, N. P., Johnson, K., & Chang, E. F. (2019). Speaker-normalized sound representations in the human auditory cortex. Nature communications, 10(1), 2465. [DOI] [PMC free article] [PubMed]
- Skuk, V. G., Palermo, R., Broemer, L., & Schweinberger, S. R. (2019). Autistic traits are linked to individual differences in familiar voice identification. Journal of Autism and Developmental Disorders,49(7), 2747–2767. 10.1007/s10803-017-3039-y [DOI] [PubMed] [Google Scholar]
- Souza, P., Gehani, N., Wright, R., & McCloy, D. (2013). The advantage of knowing the talker. Journal of the American Academy of Audiology,24(08), 689–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Staum Casasanto, L. (2008). Does social information influence sentence processing? In Proceedings of the Annual Meeting of the Cognitive Science Society,30(30), 799–804. [Google Scholar]
- Stevenage, S. V. (2018). Drawing a distinction between familiar and unfamiliar voice processing: A review of neuropsychological, clinical and empirical findings. Neuropsychologia,116, 162–178. 10.1016/j.neuropsychologia.2017.07.005 [DOI] [PubMed] [Google Scholar]
- Stevens, K. N. (1998). Acoustic phonetics. MIT Press. [Google Scholar]
- Strori, D., Zaar, J., Cooke, M., & Mattys, S. L. (2018). Sound specificity effects in spoken word recognition: The effect of integrality between words and sounds. Attention, Perception & Psychophysics,80(1), 222–241. 10.3758/s13414-017-1425-3 [DOI] [PubMed] [Google Scholar]
- Sumner, M., Kim, S. K., King, E., & McGowan, K. B. (2014). The socially weighted encoding of spoken words: A dual-route approach to speech perception. Frontiers in Psychology,4, 1015. 10.3389/FPSYG.2013.01015/XML/NLM [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tesink, C. M. J. Y., Buitelaar, J. K., Petersson, K. M., Van Der Gaag, R. J., Kan, C. C., Tendolkar, I., & Hagoort, P. (2009). Neural correlates of pragmatic language comprehension in autism spectrum disorders. Brain,132(7), 1941–1952. 10.1093/brain/awp103 [DOI] [PubMed] [Google Scholar]
- Tesink, C. M. J. Y., Petersson, K. M., Van Berkum, J. J. A., Van Den Brink, D., Buitelaar, J. K., & Hagoort, P. (2009). Unification of speaker and meaning in language comprehension: An fMRI study. Journal of Cognitive Neuroscience,21(11), 2085–2099. 10.1162/jocn.2008.21161 [DOI] [PubMed] [Google Scholar]
- Tezcan, F., Weissbart, H., & Martin, A. E. (2023). A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension. eLife,12, e82386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Theodore, R. M., Blumstein, S. E., & Luthra, S. (2015). Attention modulates specificity effects in spoken word recognition: Challenges to the time-course hypothesis. Attention, Perception & Psychophysics,77, 1674–1684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng, C. Y., Nygaard, L. C., & Theodore, R. M. (2021). A second chance for a first impression: Sensitivity to cumulative input statistics for lexically guided perceptual learning. Psychonomic Bulletin & Review,28(3), 1003–1014. [DOI] [PubMed] [Google Scholar]
- Van Berkum, J. J. A., Hagoort, P., & Brown, C. M. (1999). Semantic integration in sentences and discourse: Evidence from the N400. Journal of Cognitive Neuroscience,11(6), 657–671. 10.1162/089892999563724 [DOI] [PubMed] [Google Scholar]
- Van Berkum, J. J. A., Van Den Brink, D., Tesink, C. M. J. Y., Kos, M., & Hagoort, P. (2008). The neural integration of speaker and message. Journal of Cognitive Neuroscience,20(4), 580–591. 10.1162/jocn.2008.20054 [DOI] [PubMed] [Google Scholar]
- van den Brink, D., Van berkum, J. J. A., Bastiaansen, M. C. M., Tesink, C. M. J. Y., Kos, M., Buitelaar, J. K., & Hagoort, P. (2012). Empathy matters: ERP evidence for inter-individual differences in social language processing. Social Cognitive and Affective Neuroscience,7(2), 173–183. 10.1093/scan/nsq094 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Lancker, D. R., & Canter, G. J. (1982). Impairment of voice and face recognition in patients with hemispheric damage. Brain and Cognition,1(2), 185–195. 10.1016/0278-2626(82)90016-1 [DOI] [PubMed] [Google Scholar]
- Vollmer, A. L., Read, R., Trippas, D., & Belpaeme, T. (2018). Children conform, adults resist: A robot group induced peer pressure on normative social conformity. Science Robotics,3(21), 1–8. 10.1126/scirobotics.aat7111 [DOI] [PubMed] [Google Scholar]
- Vorperian, H. K., Wang, S., Chung, M. K., Schimek, E. M., Durtschi, R. B., Kent, R. D., ... & Gentry, L. R. (2009). Anatomic development of the oral and pharyngeal portions of the vocal tract: An imaging study. The Journal of the Acoustical Society of America,125(3), 1666–1678. 10.1121/1.3075589 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker, A., & Hay, J. (2011). Congruence between ‘word age’ and ‘voice age’ facilitates lexical access. Laboratory Phonology,2(1), 219–237. 10.1515/LABPHON.2011.007 [Google Scholar]
- Werker, J. F., & Curtin, S. (2005). PRIMIR: A developmental framework of infant speech processing. Language Learning and Development,1(2), 197–234. 10.1080/15475441.2005.9684216 [Google Scholar]
- Whiteside, S. P. (2001). Sex-specific fundamental and formant frequency patterns in a cross-sectional study. The Journal of the Acoustical Society of America,110(1), 464–478. 10.1121/1.1379087 [DOI] [PubMed] [Google Scholar]
- Wong, P. C. M., Nusbaum, H. C., & Small, S. L. (2004). Neural bases of talker normalization. Journal of Cognitive Neuroscience,16(7), 1173–1184. 10.1162/0898929041920522 [DOI] [PubMed] [Google Scholar]
- Wu, H., & Cai, Z. G. (2026). When a man says he is pregnant: Event-related potential evidence for a rational account of speaker-contextualized language comprehension. Journal of Cognitive Neuroscience,38(3), 545–560. 10.1162/JOCN.a.102 [DOI] [PubMed] [Google Scholar]
- Wu, S., & Keysar, B. (2007). The effect of information overlap on communication effectiveness. Cognitive Science,31(1), 169–181. 10.1080/03640210709336989 [DOI] [PubMed] [Google Scholar]
- Wu, H., Duan, X., & Cai, Z. G. (2024). Speaker demographics modulate listeners’ neural correlates of spoken word processing. Journal of Cognitive Neuroscience,36(10), 2208–2226. 10.1162/jocn_a_02225 [DOI] [PubMed] [Google Scholar]
- Wu, H., Rao, X., & Cai, Z. G. (2025). Probabilistic adaptation of language comprehension for individual speakers: Evidence from neural oscillations. Social Cognitive and Affective Neuroscience,20(1), nsaf085. 10.1093/scan/nsaf085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi, H. G., Leonard, M. K., & Chang, E. F. (2019). The encoding of speech sounds in the superior temporal gyrus. Neuron,102(6), 1096–1110. 10.1016/j.neuron.2019.04.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yin, Y., Jia, N., & Wakslak, C. J. (2024). AI can help people feel heard, but an AI label diminishes this impact. Proceedings of the National Academy of Sciences,121(14), 2017. 10.1073/pnas.2319112121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoshida, A., Kaji, T., Imaizumi, J., Shirakawa, A., Suga, K., Nakagawa, R., & Iwasa, T. (2022). Transgender man receiving testosterone treatment became pregnant and delivered a girl: A case report. Journal of Obstetrics and Gynaecology Research,48(3), 866–868. [DOI] [PubMed] [Google Scholar]
- Young, A. W., Frühholz, S., & Schweinberger, S. R. (2020). Face and voice perception: Understanding commonalities and differences. Trends in Cognitive Sciences,24(5), 398–410. 10.1016/j.tics.2020.02.001 [DOI] [PubMed] [Google Scholar]
- Yovel, G., & Belin, P. (2013). A unified coding strategy for processing faces and voices. Trends in Cognitive Sciences,17(6), 263–271. 10.1016/j.tics.2013.04.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zäske, R., Volberg, G., Kovács, G., & Schweinberger, S. R. (2014). Electrophysiological correlates of voice learning and recognition. Journal of Neuroscience,34(33), 10821–10831. 10.1523/JNEUROSCI.0581-14.2014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, Z. J., Hao, G. F., Shi, J. B., Mou, X. D., Yao, Z. J., & Chen, N. (2008). Investigation of the neural substrates of voice recognition in Chinese schizophrenic patients with auditory verbal hallucinations: An event-related functional MRI study. Acta Psychiatrica Scandinavica,118(4), 272–280. 10.1111/j.1600-0447.2008.01243.x [DOI] [PubMed] [Google Scholar]
- Zhang, X., Cheng, B., & Zhang, Y. (2021). The role of talker variability in nonnative phonetic learning: A systematic review and meta-analysis. Journal of Speech, Language, and Hearing Research,64(12), 4802–4825. 10.1044/2021_JSLHR-21-00181 [DOI] [PubMed] [Google Scholar]
- Zimman, L. (2018). Transgender voices: Insights on identity, embodiment, and the gender of the voice. Language and Linguistics Compass,12(8), e12284. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.
Not applicable.
