Functionally integrated neural processing of linguistic and talker information: An event-related fMRI and ERP study

Caicai Zhang; Kenneth R Pugh; W Einar Mencl; Peter J Molfese; Stephen J Frost; James S Magnuson; Gang Peng; William S-Y Wang

doi:10.1016/j.neuroimage.2015.08.064

. Author manuscript; available in PMC: 2017 Jan 1.

Published in final edited form as: Neuroimage. 2015 Sep 4;124(Pt A):536–549. doi: 10.1016/j.neuroimage.2015.08.064

Functionally integrated neural processing of linguistic and talker information: An event-related fMRI and ERP study

Caicai Zhang ^a,^b,^*, Kenneth R Pugh ^c,^d,^e, W Einar Mencl ^c,^e, Peter J Molfese ^c,^d, Stephen J Frost ^c, James S Magnuson ^c,^d, Gang Peng ^b,^f,^g,^*, William S-Y Wang ^b,^f,^g,^h

PMCID: PMC4820403 NIHMSID: NIHMS770839 PMID: 26343322

Abstract

Speech signals contain information of both linguistic content and a talker’s voice. Conventionally, linguistic and talker processing are thought to be mediated by distinct neural systems in the left and right hemispheres respectively, but there is growing evidence that linguistic and talker processing interact in many ways. Previous studies suggest that talker-related vocal tract changes are processed integrally with phonetic changes in the bilateral posterior superior temporal gyrus/superior temporal sulcus (STG/STS), because the vocal tract parameter influences the perception of phonetic information. It is yet unclear whether the bilateral STG are also activated by the integral processing of another parameter – pitch, which influences the perception of lexical tone information and are related to talker differences in tone languages. In this study, we conducted separate functional magnetic resonance imaging (fMRI) and event-related potential (ERP) experiments to examine the spatial and temporal loci of interactions of lexical tone and talker-related pitch processing in Cantonese. We found that the STG was activated bilaterally during the processing of talker changes when listeners attended to lexical tone changes in the stimuli and during the processing of lexical tone changes when listeners attended to talker changes, suggesting that lexical tone and talker processing are functionally integrated in the bilateral STG. It extends the previous study, providing evidence for a general neural mechanism of integral phonetic and talker processing in the bilateral STG. The ERP results show interactions of lexical tone and talker processing 500–800 ms after auditory word onset (a simultaneous posterior P3b and a frontal negativity). Moreover, there is some asymmetry in the interaction, such that unattended talker changes affect linguistic processing more than vice versa, which may be related to the ambiguity that talker changes cause in speech perception and/or attention bias to talker changes. Our findings have implications for understanding the neural encoding of linguistic and talker information.

Keywords: Neural bases, linguistic processing, talker processing, lexical tones, fMRI, ERP

Introduction

Speech signals contain two sources of information: linguistic content and the talker’s voice. Understanding the linguistic message and recognizing the talker have important evolutionary and social implications for guiding an individual’s behavior in the interaction and communication with other individuals (Hockett, 1960; Theunissen and Elie, 2014). An important and unresolved question is how these two sources of information are encoded from a holistic speech signal where linguistic and talker information are mixed. Traditionally, linguistic information and talker information are believed to be processed via different neural networks: linguistic information predominantly in the left hemisphere (e.g., Frost et al., 1999; Johnsrude et al., 1997) and talker information mainly in the right hemisphere (e.g., Lattner et al., 2005). Nevertheless, there is growing evidence that linguistic and talker processing interact in many ways. On the one hand, talker information facilitates the identification of linguistic information. Speech sounds from a familiar or learned talker are recognized more accurately than speech sounds from an unfamiliar talker (Nygaard and Pisoni, 1998; von Kriegstein and Giraud, 2004), which suggests that a talker’s voice, once learned, assists the processing and retrieval of linguistic information. On the other hand, linguistic information also facilitates talker recognition. Listeners are more accurate at identifying talkers if they are familiar with the language being spoken (Perrachione et al., 2009, 2011; Perrachione and Wong, 2007), suggesting that knowledge of a familiar language facilitates talker processing. Listeners can also use familiarity with a talker's idiosyncratic phonetic patterns to identify familiar talkers when primary cues to talker identity (pitch, timbre, etc.) are absent (with sinewave speech stimuli; Remez et al., 1997).

Mullennix and Pisoni (1990) provided critical evidence for the inter-dependencies of linguistic and talker processing behaviorally using the Garner selective attention paradigm (Garner, 1974; Garner and Felfody, 1970). The logic of the Garner paradigm is that if two dimensions are processed integrally (e.g., using the same sensory or cortical pathways), random changes in an unattended dimension would impede processing in the attended dimension, whereas random changes in an unattended dimension can be ignored, if two dimensions are separable. The authors found that listeners cannot ignore random talker changes when their task is to attend to phonetic information and vice versa, as indexed by longer reaction times for the orthogonal set, where the unattended dimension varies randomly, than for the control set, where the unattended dimension is fixed (see Table 1). It indicates that phonetic and talker dimensions are processed integrally (Garner, 1974). Moreover, the relationship is asymmetrical, because talker variability interferes more with phonetic processing than vice versa. The authors referred to such asymmetrical integral processing as a parallel-contingent relationship, i.e., linguistic and talker processing being parallel, but linguistic processing also being interfered more by talker processing (cf. Turvey, 1973). Integral processing is also sometimes identified under the Garner paradigm as changes in an unattended dimension facilitating the processing in the attended dimension, when changes in the unattended dimension are correlated with changes in the attended dimension (see correlated condition in Table 1). In other words, the changes in the attended dimension are predictable according to changes in the unattended dimension. But the correlated condition is best thought of as additional evidence, and the comparison of the orthogonal condition versus the control condition is most important (see Mullennix & Pisoni, 1990).

Table 1.

Example stimuli in the control set and orthogonal set used by Mullennix and Pisoni (1990).

	Linguistic task (initial consonant classification)	Talker task (gender classification)

Control 1	bad_talker1, pad_talker1	bad_talker1, bad_talker2

Control 2	bad_talker2, pad_talker2	pad_talker1, pad_talker2

Orthogonal	bad_talker1, pad_talker1, bad_talker2, pad_talker2	bad_talker1, pad_talker1, bad_talker2, pad_talker2

Correlated 1	bad_talker1, pad_talker2	bad_talker1, pad_talker2

Correlated 2	bad_talker2, pad_talker1	bad_talker2, pad_talker1

Open in a new tab

However, it remains unclear what neural mechanism underlies the aforementioned interactions of linguistic and talker processing. Within neuroimaging studies, three main lines of work emerge, which claimed to find evidence for interactions in lower-level to higher-level processing: from auditory processing, to phonological processing or categorization, to lexical/semantic processing.

A first line of work implies that the interaction of linguistic and talker processing may be detected as early as in regions of primary auditory cortex. Kaganovich et al. (2006) found that the interference of random changes in the unattended dimension (i.e., the orthogonal set) elicited a greater negativity 100–300 ms after the onset of auditory stimuli compared to the control condition without random changes in the unattended dimension. The authors interpreted the early onset of the interaction in the N1 time-window as indicating increased cognitive effort to extract information from the attended dimension in auditory processing, where the unattended dimension varies randomly. However, this finding is possibly confounded by habituation/neuronal refractoriness effects, due to the unmatched stimulus probability of the orthogonal set and control set. In the orthogonal set, four stimuli were presented in a block at equal probabilities of 25% each, whereas in the control set two stimuli were presented in a block at the probability of 50% each. More frequent presentation of two stimuli in a block could have habituated the neural responses more, reducing the N1 amplitude in the control set (cf. Budd et al., 1998). That said, the question of whether the inter-dependencies of linguistic and talker processing occur early in auditory processing remains unclear.

A second line of work suggests that the interaction of linguistic and talker processing occurs in the bilateral posterior superior temporal gyrus/superior temporal sulcus (STG/STS). In a functional magnetic resonance imaging (fMRI) study, von Kriegstein et al. (2010) found a neural network that integrates linguistic and talker processing in the bilateral posterior STG/STS, which play a role in higher-level phonological processing beyond processing in the Heschl’s gyrus (Hickok and Poeppel, 2000, 2004, 2007). This neural network is also adjacent to voice-selective areas in the upper bank of the bilateral STS (Belin et al., 2000; 2004). Von Kriegstein and colleagues compared two parameters, vocal tract length and pitch, both of which are related to talker differences, but only the vocal tract length is related to linguistic information in English (e.g., the vocal tract length of a talker influences the location of amplitude peaks in the speech spectrum, or formant frequencies, which affect the perception of vowels and sonorants). Speech recognition regions in the left posterior STG/STS responded more to talker-related changes in vocal tract length than to talker-related changes in pitch; the right posterior STG/STS responded more to vocal tract length changes than to pitch changes, specifically in the speech recognition task. Furthermore, left and right posterior STG/STS were functionally connected. In summary, processing of talker-related changes in vocal tract length, which influences the encoding of phonetic categories in English, is detected in the bilateral posterior STG/STS, whereas processing of talker-related pitch changes is detected in areas adjacent to Heschl’s gyrus, earlier than the posterior STG/STS in the auditory hierarchy.

It should be noted that it is unlikely that pitch changes have no linguistic significance at all in English. Particularly, pitch contours at the sentence level, or intonation, can indicate whether a sentence is a statement or a question. Kreitewolf et al. (2014) examined this question and found that talker-related pitch processing is integrated with linguistic intonation processing in the right Heschl’s gyrus, when listeners’ attention was directed to the intonation pattern of the stimuli in a question/statement classification task. Specifically, talker-related changes in pitch activated the right Heschl’s gyrus more in the intonation classification task than in the talker classification task. Moreover, the functional connectivity between right and left Heschl’s gyri was stronger for talker-related pitch changes than for vocal tract length changes in the intonation task.

The above findings seem to suggest a general neural mechanism involving left and right hemispheres in linguistic and talker processing. Talker processing is more integrated with linguistic processing if a parameter indexing talker changes is also linguistically significant. However, it is noteworthy that phonological and intonation processing differ in many ways. Firstly, phonological changes often occur over a rather short temporal interval (milliseconds) whereas intonation changes often occur over a much longer temporal interval (seconds). Furthermore, intonation is processed predominantly in the right hemisphere (Blumstein and Cooper, 1974; Tong et al., 2005), whereas phonemes are processed predominantly in the left hemisphere (Frost et al., 1999; Gu et al., 2013; Gandour et al., 2003; Johnsrude et al., 1997; Mäkelä et al., 2003; Liebenthal et al., 2005; Shestakova et al., 2002). Probably due to the above differences, previous studies have found that the vocal tract length parameter that is related to phonemic differences activates the bilateral posterior STG/STS, whereas the pitch parameter that is related to intonation differences activates the right Heschl’s gyrus.

In this regard, tone languages are useful for further examining the neural mechanism of integral linguistic and talker processing. Pitch changes in tone languages are phonemic; moreover, pitch plays a significant role in characterizing talker and gender differences (e.g., Smith and Patterson, 2005).

A third line of work suggests that the interaction of linguistic and talker processing can be detected at the lexical or semantic level (Chandrasekaran et al., 2011; von Kriegstein et al., 2003). The exemplar theory assumes that each heard token of a word leaves a trace in memory, such that the representation of auditory words comprises exemplars from different talkers (e.g., Craik and Kirsner, 1974; Goldinger, 1991, 1996, 1998; Hintzman et al., 1972; Palmeri et al., 1993). Craik and Kirsner (1974) found that listeners were more accurate in detecting whether a word was repeated when the words were repeated in the original talker’s voice than when the "repetition" was produced by a different talker. Such same-voice advantage suggests that talker information is implicitly preserved within the representation of words. In an fMRI study, Chandrasekaran et al. (2011) found that repeated real words attenuated the Blood Oxygenation Level Dependent (BOLD) signal in the left middle temporal gyrus (MTG) less when the words are "repeated" by multiple talkers than by a single talker. The reduced attenuation cannot be simply attributed to greater acoustic differences in the condition of multi-talker productions, because pseudowords produced by multiple talkers vs. a single talker activated the left MTG equivalently. The authors interpreted this effect as indicating integral neural representation of lexical and talker information in the left MTG, such that lexical representations contain talker-specific exemplars, reducing the repetition attenuation effect. Pseudowords, which have no lexical representations in the left MTG, therefore do not show such effects.

Current Study

In this study, we conducted separate fMRI and event-related potential (ERP) experiments to examine the spatial and temporal loci of the interaction of phonetic and talker processing in a tone language. As mentioned above, lexical tones are ideal for examining the neural mechanisms associated with inter-dependencies of phonetic and talker processing, because pitch differences are phonemic and correlated with talker differences in tone languages. We focus on testing whether the integral processing of the pitch parameter activates the bilateral posterior STG/STS, as has been shown for the vocal tract parameter (von Kriegstein et al., 2010). Moreover, the temporal loci of integrated phonetic and talker processing remain unclear, due to the possible confounding habituation/neuronal refractoriness effects caused by unmatched stimulus probabilities in the Garner paradigm. Although there have been several fMRI studies on the interaction of linguistic and talker processing, ERP studies are relatively scarce.

In this study, we followed the Garner paradigm with a modified design, which critically controlled for the unmatched stimulus probabilities discussed above. According to the rationale of the Garner paradigm, if two dimensions are processed integrally (e.g., using the same sensory or cortical pathways), changes in an unattended dimension would impede processing in the attended dimension (Mullennix and Pisoni, 1990). We reasoned that trials with unattended changes, compared to trials with attended changes presented in the same block at equal probabilities, might also show an interference effect on the processing of the attended dimension. Such interference effects would reveal differences in processing as a consequence of integration. To this end, we adopted a task (phonetic change detection and talker change detection) by trial type (no change, talker change, phonetic change, and phonetic+talker change) design. Each block was comprised of trials with no change, talker changes only, phonetic changes only and phonetic+talker changes, and the listeners’ attention was directed to either the phonetic or the talker dimension of the stimuli by the task. In the phonetic task, where participants were required to detect phonetic changes while ignoring talker changes, the phonetic change trial/deviant serves as the relevant condition, and the talker change trial/deviant as the interference condition; in the talker task, where participants were required to detect talker changes while ignoring phonetic changes, the talker change trial/deviant serves as the relevant condition, and the phonetic change trial/deviant as the interference condition. Moreover, we included trials with no changes as a control condition, and trials with both attended and unattended changes as a coupled condition. The coupled condition might facilitate the processing, reducing the cognitive effort to detect changes in the attended dimension, because the unattended dimension changes synchronously with the attended dimension. Note that our conditions do not map directly onto the Garner paradigm, though there are analogous conditions.

We define the spatial loci of integral phonetic and talker processing as brain regions that respond more to implicit processing of unattended changes, comparing the interference condition vs. the relevant condition in both phonetic and talker tasks. If the bilateral STG/STS are activated, beyond the Heschl’s gyrus, it would provide support for a general neural mechanism of integral phonetic and talker processing in the bilateral STG/STS. If the right Heschl’s gyrus is activated, it may suggest that the processing of pitch changes in a tone language is similar to the processing of intonation changes in English. We infer the temporal loci from time-windows where the ERPs are differentially modulated by the interference and relevant conditions, focusing on examining whether the interaction could be detected as early as in the N1 time-window when stimulus probabilities are matched. Lastly, if the coupled condition facilitates the processing as predicted, it might reduce the BOLD signal and ERP amplitude compared to the relevant condition. But the coupled condition is not the most crucial condition for the investigation of integral processing, as mentioned before.

The same group of subjects participated in the fMRI and ERP experiments. The same task by trial type design was adopted for both fMRI and ERP experiments, though the stimulus presentation differs slightly to suit the analysis needs of each imaging method. For the fMRI experiment, we adopted an adaptation paradigm (see Figure 1A; see Materials and Methods sections below for details). Each trial was consisted of four stimuli and all four trial types (no change, talker change, phonetic change, and phonetic+talker change) were presented pseudo-randomly in blocks at equal probabilities to allow for event-related analysis. For the ERP experiment, we adopted an active oddball paradigm (see Figure 1B; see Materials and Methods sections below for details). Each stimulus alone was a trial and the three deviants (talker change, phonetic change, and phonetic+talker change) were presented pseudo-randomly at equal probabilities in a stream of highly repetitive standards in a block. This design controls for the habituation/refractoriness effects discussed above.

A schematic representation of example trial types and paradigm. (A) fMRI event-related paradigm. Four trial types were presented pseudo-randomly in blocks, where each trial type consists of three identical standards and a fourth stimulus being (1) no change, (2) a talker change, (3) a phonetic change, or (4) a phonetic+talker change from the standards. (B) ERP active oddball paradigm. Three types of deviants (talker change, phonetic change and phonetic+talker change) were presented in a stream of highly repetitive standards in blocks. Note that each bar represents the location of a tone within a speaker’s fundamental frequency (F0) range. F0 range of the male and female talker does not overlap (see Figure 2), even though the two bars here overlap.

fMRI experiment

Material and methods

Participants

Nineteen native speakers of Hong Kong Cantonese (12 female, 7 male; mean age = 21.4 years, SD = 1.1, aged 19.6 to 24.4 years) were paid to participate in the experiment. All participants were university students, right-handed, with normal hearing, and no reported musical training or history of neurological illness. One male subject’s data were excluded from analysis due to excessive head movement (percentage of TRs censored: 25%, see fMRI Data Acquisition and Analyses below). The experimental procedures were approved by Shenzhen Institutes of Advanced Technology Institutional Review Board. Informed written consent was obtained from each participant in compliance with the experiment protocols.

Stimuli

The stimuli were two meaningful Cantonese words – /ji/ carrying high level tone (/ji55/ 醫 “a doctor”) and /ji/ carrying high rising tone (/ji25/ 椅 “a chair”) – produced by one female and one male native Cantonese speaker (neither of whom participated in the experiment). These four naturally produced syllables (female Tone 55, female Tone 25, male Tone 55, male Tone 25) were normalized in duration to 350 ms, and in average intensity to 80 dB in Praat (Boersma and Weenick, 2012). Figure 2 shows the fundamental frequency (F0) trajectory of the stimuli, and Table 3 shows the mean frequencies of F0, and the first and second formants (F1 and F2).

Table 3.

Mean F0, F1 and F2 frequencies of the four speech stimuli. The first and last 10% of a stimulus was excluded from averaging for the reason that F0, F1 and F2 are less stable at the beginning and end of a syllable.

	F0 (SD)	F1 (SD)	F2 (SD)
Female Tone 55	279 (2)	343 (39)	2781 (25)
Female Tone 25	212 (28)	372 (24)	2605 (90)
Male Tone 55	167 (6)	318 (22)	2322 (36)
Male Tone 25	122 (22)	252 (43)	2186 (29)

Open in a new tab

Procedure

We used an adaptation paradigm for the fMRI experiment (Celsis et al., 1999; Chandrasekaran et al., 2011; Joanisse et al., 2007; Salvata et al., 2012). Four speech stimuli were combined to form four trial types. Each trial type consists of four stimuli, the first three stimuli being identical standards, and the fourth stimulus being identical to the standards (no change), different from the standards in tone category but identical in talker’s voice (phonetic change), different from the standards in talker’s voice but identical in tone category (talker change), or different in both tone category and talker (phonetic+talker change) (see Figure 1A). Each trial type was 1550 ms in length, containing four 350-ms stimuli separated by 50 ms silence intervals. Repeated presentation of the standards is expected to habituate the BOLD signal; the subsequent presentation of a stimulus different from standards in the linguistic or talker dimension would result in a release from adaptation in regions sensitive to the processing of that dimension, showing an increased BOLD signal.

There were four blocks in total, with each of the four speech stimuli serving as standards in one block. Collapsed across the four blocks, all four trial types contain acoustically identical stimuli. Within a block, all four trial types were presented twelve times in pseudorandom order at jittered trial durations of 4, 5, 6 and 7 seconds to allow for event-related analysis. Occasional longer durations (i.e., null trials) were included to provide a better estimate of the baseline response.

The same four blocks were presented twice, once in a phonetic change detection task, and once in a talker change detection task. In the phonetic task, participants were instructed to press one button when there was no change in tone category in the fourth stimulus of a trial (“no change” response: no change and talker change trials), and to press the other button when there was a change in tone category (“change” response: phonetic change and phonetic+talker change trials). Accordingly, in the talker task, participants were instructed to press one button when there was no change in talker’s voice in the fourth stimulus of a trial (“no change” response: no change and phonetic change trials), and to press the other button when there was a change in talker’s voice (“change” response: talker change and phonetic+talker change trials). Participants were given two seconds to make a response after each trial. The manual responses were counterbalanced, with half of the participants making “same” responses with left thumb and “different” responses with right thumb, and left and right thumb responses switched in the other half of participants. In the phonetic task, seven crosses in a row (“+++++++”) were shown in the center of the screen throughout a block, to remind participants of the phonetic task; in the talker task, seven hyphens in a row (“−−−−−−−“) were shown in the center of the screen throughout a block, to remind participants of the talker task. Simple visual symbols were used to minimize visual processing and avoid interference with the experimental tasks.

For each task, the presentation order of four blocks was counterbalanced across the participants. Two consecutive blocks alternated between phonetic and talker tasks, in order to reduce adaptation for a particular task. Prior to the fMRI experiment, each participant was given six practice trials for each task (taken from the beginning of an experimental block) to familiarize them with the procedures.

fMRI data acquisition and analysis

fMRI data were acquired using a 3T Magnetom TRIO Scanner (Siemens, Erlangen, Germany) equipped with a 12-channel phased array receive-only head coil at the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. 3D MPRAGE was applied to obtain continuous high-resolution T1-weighted anatomical images (scan repetition time (TR) = 2530 ms; echo time (TE) = 2.4 ms; inversion time (TI) = 900 ms; flip angle = 7°; field of view (FOV) = 256 mm; in-plane resolution 1.0 mm × 1.0 mm × 1.0 mm; 176 slices total). Functional gradient-echo planar images (EPI) were acquired (TR = 2000 ms; TE = 30 ms; flip angle = 80°; FOV = 220 mm; 4 mm slice thickness, no gap; 64 × 64 matrix; 32 slices) in ascending interleaved axial slices.

Eight imaging runs, each containing 146 TRs, were obtained for each participant. Data analysis was performed using AFNI (Cox, 1996). The first six TRs were disregarded from each run. Images were aligned to the first, and corrected for slice acquisition time, motion corrected using a six-parameter rigid body transform, and spatially smoothed with an 8 mm Gaussian filter. Images exceeding 3 mm displacement or 3° rotation measured in TR-to-TR change were discarded. Images with more than 10% of voxels measured as outliers were also discarded. Mean percentage of TRs censored across all subjects was 3%. The high-resolution anatomical scan for each subject was normalized to Talairach and Tournoux stereotaxic space using the Colin27 template; all data were transformed to this same space using a single concatenated transform from EPI to high-resolution anatomical to Colin27 template.

Single-subject BOLD signals were scaled and submitted to a regression analysis with the idealized hemodynamic responses as regressors at each voxel, which were created by convolving the timing of a condition with a gamma function for each trial type respectively. The six parameters from the motion-correction process were included as nuisance regressors, as were baseline, linear, and quadratic trend. Regression coefficients from the single-subject level were input into group-level analysis. A mixed-factor ANOVA was conducted using 3danova3 of AFNI, with task (phonetic task, talker task) and trial type (no change, talker change, phonetic change, phonetic+talker change) as two fixed factors and subjects as a random factor. Contrast maps were obtained for comparisons of interest (see Activation Results for details). Statistic images were assessed for cluster-wise significance using a cluster-defining threshold of p = 0.001; the 0.01 FWE-corrected critical cluster size was 43.7.

Results

In-scanner behavioral results

Figure 3 shows the accuracy and reaction time of the in-scanner behavioral performance of 18 participants. Accuracy was calculated as the percentage that each of the four trial types was correctly classified as with or without the stimulus changes that the participants were required to detect by the task. Note that trials without any response received within the time limit were excluded from the accuracy analysis. The percentage of trials with missing responses varied between 2.8% and 4.7% across different trial types. Arcsine transformation was then applied to the percentage data. As for the reaction time, incorrect responses were disregarded from the analysis, as were trials with reaction time exceeding three SDs from the mean of each task (1.4% of correct trials). Two-way repeated measures ANOVAs were conducted on the transformed accuracy data and reaction time separately by indicating task (phonetic task, talker task) and trial type (no change, talker change, phonetic change, phonetic+talker change) as two factors. Greenhouse-Geisser method was used to correct the violation of sphericity where appropriate.

Behavioral results. (A) Percentage of correct classification for the four trial types (no change, talker change, phonetic change and phonetic+talker change) in the phonetic and talker tasks. (B) Reaction time to the four trial types (no change, talker change, phonetic change and phonetic+talker change) in the phonetic and talker tasks.

For arcsine transformed accuracy, there were significant main effects of task (phonetic task = 1.45; talker task = 1.5; F(1, 17) = 6.015, p = 0.025) and trial type (no change = 1.5; talker change = 1.43; phonetic change = 1.47; phonetic+talker change = 1.5; F(3, 51) = 4.931, p = 0.004), and a significant task by trial type interaction (F(3, 51) = 4.707, p = 0.006). Note that the focus of this study is on the interference condition. One-way ANOVA was conducted to examine the effect of trial type in each task and revealed a main effect of trial type in the phonetic task only (F(3, 68) = 6.277, p < 0.001). Pair-wise comparisons with Bonferroni correction for multiple comparisons show that the interference condition (talker change) was classified less accurately than the other three conditions – the control condition (no change) (1.36 vs. 1.51; p < 0.001), the relevant condition (phonetic change) (1.36 vs. 1.46; p = 0.035) and the coupled condition (phonetic+talker change) (1.36 vs. 1.47; p = 0.023) in the phonetic task. The difference between the coupled condition and the relevant condition was not significant (1.47 vs. 1.46; p = 0.999). In the talker task, the effect of trial type was not significant (F(3, 68) = 1.229, p = 0.306). Paired-samples t-tests were conducted to examine the effect of task in each trial type. The effect of task was only significant in the talker change trial (t(17) = −3.99, p < 0.001), which was classified less accurately in the phonetic task than in the talker task (1.36 vs. 1.5). The results indicate asymmetrical interference effects – unattended talker changes interfered with the accuracy of phonetic change detection (a decrease of arcsine transformed accuracy by 0.157 compared to the control condition), whereas the interference effect of unattended phonetic changes showed a non-significant trend (a decrease of arcsine transformed accuracy by 0.015 compared to the control condition).

For reaction time, there were a significant main effect of trial type (no change = 343 ms; talker change = 403 ms; phonetic change = 439 ms; phonetic+talker change = 422 ms; F(1.864, 31.696) = 24.977, p < 0.001), and a significant task by trial type interaction (F(1.556, 26.457) = 6.47, p = 0.008). One-way ANOVA conducted to examine the effect of trial type in each task found no significant effect in either task. Despite lack of significant effects, in the talker task, the interference condition showed a non-significant trend of longer reaction time than the relevant condition (421 ms vs. 372 ms; p = 1.0), whereas such a trend was not present in the phonetic task (interference condition = 434 ms; relevant condition = 456 ms). Paired-samples t-tests were conducted to examine the effect of task in each trial type. The only significant effect of task was found in the phonetic+talker change trial (t(17) = 2.802, p = 0.012), which was classified more slowly in the phonetic task than in the talker task (481 ms vs. 363 ms).

In summary, accuracy shows asymmetrical interference effects – a significant interference effect of unattended talker changes on phonetic change detection was found, whereas the effect of unattended phonetic change on talker change detection was not significant. Reaction time shows a trend of an interference effect of unattended phonetic changes on talker processing but the effect was not significant. There is no evidence that the coupled condition facilitates the processing. The coupled condition does not differ significantly from the relevant condition in either accuracy or reaction time.

Activation results

Contrast maps were obtained for main and interaction effects of task and trial type, and for comparisons of interest involving the interference condition, i.e., interference condition vs. relevant condition (phonetic change vs. talker change in the talker task, talker change vs. phonetic change in the phonetic task), and interference condition vs. control condition (phonetic change vs. no change in the talker task, talker change vs. no change in the phonetic task). Furthermore, contrast maps were obtained for the following comparisons: coupled condition vs. relevant condition (phonetic+talker change vs. phonetic change in the phonetic task, phonetic+talker change vs. talker change in the talker task), relevant condition vs. control condition (phonetic change vs. no change in the phonetic task, talker change vs. no change in the talker task), and coupled condition vs. control condition (phonetic+talker change vs. no change in the phonetic task, phonetic+talker change vs. no change in the talker task). For each comparison, significant clusters (FWE corrected p = 0.01, uncorrected p = 0.001) are reported in Table 4. Figure 4 shows the significant activation of contrasts involving the interference condition.

Interference > relevant (talker change > phonetic change in phonetic task)
Interference > control (phonetic change > no change in talker task)
Interference > control (talker change > no change in phonetic task)

Table 4.

Activated clusters (FWE corrected p = 0.01, uncorrected p = 0.001). MNI coordinates are reported for peak activation in LPI format. P = phonetic, T = talker, L = left, R = right.

Condition	Region	x	y	z	Size (cm³)
Main effect of task	/

Main effect of trial type	L superior temporal gyrus	−65	−32	7	12.096

	R superior temporal gyrus	66	−14	11	16.605

	L precentral gyrus	−37	3	35	1.755

	R insula	35	17	3	1.188

Interaction of task by trial type	/

Interference condition vs. relevant condition

P vs. T change in T task	/
T vs. P change in P task	L superior temporal gyrus	−62	−35	7	2.754
	R inferior frontal gyrus	38	27	−3	2.133
	R middle & superior temporal gyrus	60	−48	3	5.697
	R cerebellum	32	−40	−39	1.62

Interference condition vs. control condition

P vs. No change in T Task	L inferior frontal gyrus	−40	7	32	3.375
	L Heschl’s gyrus	−62	−17	11	1.296
	Left parahippocampal gyrus	−13	−32	−3	3.267
	R inferior frontal gyrus	44	10	32	1.62
	R superior temporal gyrus	60	−20	8	2.7
	R middle & superior temporal gyrus	60	−1	−5	1.188
T vs. No change in P Task	L superior temporal gyrus	−65	−32	7	2.079
	R Heschl’s gyrus	66	−11	11	1.701

Coupled condition vs. relevant condition

P+T vs. T change in T task	/
P+T vs. P change in P task	/

Relevant condition vs. control condition

P vs. No change in P Task	/
T vs. No change in T Task	L superior temporal gyrus	−65	−32	7	3.267
	L cerebellum	−13	−65	−16	1.404
	R superior temporal gyrus	66	−23	7	7.587

Coupled condition vs. control condition

P+T vs. No change in P Task	L superior temporal gyrus	−65	−32	7	1.512
P+T vs. No change in T Task	R Heschl’s gyrus	54	−20	11	1.404

Open in a new tab

Significant activation of superior temporal gyrus in the contrasts involving the interference condition (FWE corrected p = 0.01, uncorrected p = 0.001). MNI coordinates are reported.

Main effect of trial type

Four clusters were significantly activated, which were primarily located in the bilateral STG, left precentral gyrus and right insula.

Interference condition vs. relevant condition

For the phonetic change vs. talker change in the talker task, no significant activation was found. For the talker change vs. phonetic change in the phonetic task, four clusters were significantly activated by the interference condition, which were mostly located in the left STG, the right inferior frontal gyrus (IFG), the right MTG which extended into the right STG, and the right cerebellum.

Interference condition vs. control condition

For the phonetic change vs. no change in the talker task, six clusters were significantly activated by the interference condition: one cluster in the left IFG, one cluster with peak activation in the left Heschl’s gyrus extending into the STG, one cluster with peak activation in the left parahippocampal gyrus extending into the right thalamus, one cluster in the right IFG, one cluster in the right STG, and one cluster with peak activation in the right MTG extending into the anterior STG. For the talker change vs. no change in the phonetic task, two clusters were found for the interference condition, one in the left STG and the other with the peak activation in the right Heschl’s gyrus extending into the right STG.

Relevant condition vs. control condition

For the phonetic change vs. no change in the phonetic task, no significant activation was found. For the talker change vs. no change in the talker task, three clusters were significantly activated by the relevant condition, where were mainly located in the left STG, the left cerebellum, and the right STG.

Coupled condition vs. control condition

In the phonetic task, the coupled condition significantly activated one cluster in the left STG. In the talker task, the coupled condition activated one cluster with peak activation in the right Heschl’s gyrus, which extended into the right STG.

Discussion

Interference condition vs. relevant condition and interference condition vs. control condition

The main finding is that the interference condition (talker change) activated the left STG and the right STG (extending into right MTG) compared to the relevant condition (phonetic change) in the phonetic task. When listeners attended to phonetic changes in the stimuli, unattended talker changes activated the bilateral STG more than attended phonetic changes. Involvement of the bilateral STG in integral phonetic and talker processing is further shown by the contrast of the interference condition vs. control condition. Unattended talker changes in the phonetic task significantly activated the left STG and the right Heschl’s gyrus that extended into the right STG; unattended phonetic changes in the talker task significantly activated the right STG and the left Heschl’s gyrus that extended into the left STG. Thus, the bilateral STG were sensitive to the processing of unattended phonetic and talker changes. These findings were largely consistent with the previous study (von Kriegstein et al., 2010).

In addition, the right IFG and the right cerebellum were significantly activated more in the talker change vs. phonetic change in the phonetic task, which needs an explanation. Previous studies have found that the right IFG is activated in inhibiting responses to irrelevant trials in go/no-go tasks, associating the right IFG with response inhibition (Aron et al., 2014; Chikazoe et al., 2007; Hampshire et al., 2010; Lenartowicz et al., 2011). In this study, participants had to ignore irrelevant talker changes and avoid making a “different” response. It is likely that the inhibition of making ‘different’ responses to irrelevant changes activated the right IFG. As for the activation of the cerebellum, it may indicate that the automatic recognition or learning subserved by the cerebellum (e.g., Nicolson et al., 2001; Ito, 2000) is more sensitive to the talker changes than the phonetic changes. Because the acoustic changes were larger in talker changes (absolute difference: F0 = 101 Hz; F1 = 73 Hz; F2 = 440 Hz) than in phonetic changes (absolute difference: F0 = 56 Hz; F1 = 47 Hz; F2 = 156 Hz), it may be more difficult to suppress the automatic detection of talker changes, even though the selective attention is directed to the phonetic changes by the task.

For the contrast of phonetic change vs. no change in the talker task (interference condition vs. control), a few more brain regions were significantly activated, including the left IFG, the left parahippocampal gyrus (extending into the right thalamus), the right IFG and the right MTG (extending into the right anterior STG). The left IFG, whish is often activated in the processing of speech sounds (e.g., Salvata et al., 2012), was likely involved in the processing of phonetic changes in speech stimuli in this study. The left parahippocampal gyrus, which plays an important role in memory encoding (Wagner et al., 1998), likely mediated the encoding of speech stimuli with phonetic changes in this study. The right IFG likely mediated the inhibition of responses to irrelevant phonetic changes in the talker task, as discussed earlier. The right MTG and anterior STG were likely involved in processing phonetic changes in the stimuli.

Relevant condition vs. control condition

For the relevant condition vs. control condition, different activation patterns were found for the two contrasts. For the phonetic change vs. no change in the phonetic task, no brain region was significantly activated, whereas the bilateral STG and the left cerebellum were significantly activated for the talker change vs. no change in the talker task. It may suggest some differences between phonetic and talker processing. Firstly, the acoustic changes were larger in talker changes than in phonetic changes. Therefore significant activation of the bilateral STG (which extends into the bilateral Heschl’s gyri to some extent) may be found in the talker change condition but not in the phonetic change condition (cf. Zevin et al., 2010). Secondly, the automatic recognition or learning subserved by the cerebellum may be more sensitive to talker changes, as discussed earlier. Thirdly, talker changes carry paralinguistic information. It has been found that the parahippocampal gyrus is involved in the processing of paralinguistic elements of verbal communication such as sarcasm (Rankin et al., 2009). In the current study, we found that the right parahippocampal gyrus was activated in the talker change vs. no change in the talker task at a lower statistical threshold (FWE corrected p = 0.05, uncorrected p = 0.001).

Coupled condition vs. control condition

For the coupled condition vs. control condition, we found that the left STG was activated in the phonetic task and that the right Heschl’s gyrus (extending into the right STG) was activated in the talker task. It suggests that the activation was modulated by the top-down influence of tasks, showing differential weighting of left and right hemispheres in linguistic and non-linguistic tasks. It is likely that the phonetic+talker change condition is encoded more strongly in the left STG in the phonetic task, and more strongly in the right Heschl’s gyrus (extending into right STG) in the talker task.

ERP experiment