Revisiting the left ear advantage for phonetic cues to talker identification

Lee Drown; Betsy Philip; Alexander L Francis; Rachel M Theodore

doi:10.1121/10.0015093

. 2022 Nov 30;152(5):3107–3123. doi: 10.1121/10.0015093

Revisiting the left ear advantage for phonetic cues to talker identification

Lee Drown ^1,^a), Betsy Philip ¹, Alexander L Francis ², Rachel M Theodore ^1,^b),^✉

PMCID: PMC9715276 PMID: 36456295

Abstract

Previous research suggests that learning to use a phonetic property [e.g., voice-onset-time, (VOT)] for talker identity supports a left ear processing advantage. Specifically, listeners trained to identify two “talkers” who only differed in characteristic VOTs showed faster talker identification for stimuli presented to the left ear compared to that presented to the right ear, which is interpreted as evidence of hemispheric lateralization consistent with task demands. Experiment 1 (n = 97) aimed to replicate this finding and identify predictors of performance; experiment 2 (n = 79) aimed to replicate this finding under conditions that better facilitate observation of laterality effects. Listeners completed a talker identification task during pretest, training, and posttest phases. Inhibition, category identification, and auditory acuity were also assessed in experiment 1. Listeners learned to use VOT for talker identity, which was positively associated with auditory acuity. Talker identification was not influenced by ear of presentation, and Bayes factors indicated strong support for the null. These results suggest that talker-specific phonetic variation is not sufficient to induce a left ear advantage for talker identification; together with the extant literature, this instead suggests that hemispheric lateralization for talker-specific phonetic variation requires phonetic variation to be conditioned on talker differences in source characteristics.

I. INTRODUCTION

The acoustic speech signal simultaneously conveys information regarding who is speaking and what is being said. Traditionally, these two functions were considered to be supported by different aspects of the acoustic signal with indexical cues (e.g., fundamental frequency) used to support voice recognition and phonetic cues [e.g., voice-onset-time (VOT) and formant patterns] used to support linguistic processing. We now know that a strict functional delineation between phonetic and acoustic cues is not possible. For example, talkers show stable individual differences in how they implement phonetic cues (e.g., Allen et al., 2003; Chodroff and Wilson, 2017; Hillenbrand et al., 1995; Theodore et al., 2009), and listeners are sensitive to these differences (e.g., Allen and Miller, 2004; Ganugapati and Theodore, 2019; Myers and Theodore, 2017; Theodore et al., 2015; Theodore and Miller, 2010). Experience with a talker's voice support advantages for linguistic processing (e.g., Nygaard et al., 1994; Nygaard and Pisoni, 1998), and experience with a given language supports advantages for voice processing (e.g., Goggin et al., 1991; Orena et al., 2015; Perrachione et al., 2011)—providing further evidence that the processing of phonetic and indexical aspects of the speech stream are intertwined.

Although behavioral evidence points to a tight integration between phonetic and indexical cues, the extant neuroimaging literature suggests disassociate hemispheric dominance for these two aspects of speech processing, where left temporal regions are dominant for processing phonetic identity and right temporal regions are dominant for processing voice identity (e.g., Formisano et al., 2008; Bonte et al., 2014; Chang et al., 2010; Liebenthal et al., 2003; Myers, 2007; Belin and Zatorre, 2003; van Lancker et al., 1989). This may reflect different time scales for phonetic and indexical cues (e.g., Poeppel, 2003) and/or different functional tasks (e.g., von Kriegstein et al., 2003). Training studies have shown that perceptual learning of talker-specific phonetic detail can alter hemispheric processing of phonetic cues. For example, Myers and Theodore (2017) exposed listeners to two talkers who differed in their characteristic VOTs. Following exposure, neural activation was measured using functional magnetic resonance imaging (fMRI) as listeners completed a phonetic identification task for VOT variants that were typical or atypical of each talker. Their results showed that right temporoparietal regions, including the right middle temporal gyrus (rMTG), implicated in voice processing were sensitive to talker typicality. Moreover, a functional connectivity analysis showed greater connectivity between the rMTG and two regions in the left-hemisphere phonetic network (left postcentral gyrus, left middle temporal gyrus, and left superior temporal sulcus) for talker typical compared to talker atypical VOT variants.

In addition, Francis and Driscoll (2006) reported evidence indicating that short-term perceptual training could induce a left ear advantage for using VOT as a cue to talker identification that results in faster talker identification decisions for stimuli presented to the left compared to the right ear, which is consistent with right hemisphere dominance for talker processing. The transmission of sound from the peripheral to central nervous system consists of contralateral auditory pathways; that is, auditory fibers that carry sound from the ear to the brain decussate such that monaural stimulation results in relatively strong activation in the contralateral hemisphere with relatively weaker activation of the ipsilateral hemisphere (e.g., Pickles, 2008; Jancke et al., 2002). This is not to say that sound detected by the left ear is only processed by the right hemisphere but that the contralateral nature of the auditory pathway has been successfully exploited to measure hemispheric dominance using behavioral as opposed to neuroimaging methods (e.g., Kimura, 1967), as we describe further below. During the training phase of the study by Francis and Driscoll (2006), listeners heard two sets of tokens, one with characteristically short VOTs (30 ms) and one with characteristically longer VOTs (50 ms). On each trial, a token was presented binaurally and listeners were asked to identify which of two talkers produced that token. Feedback was provided to train listeners to associate the short VOTs as one talker and the longer VOTs as the other talker. Both sets of tokens were based on the speech of a single talker and, thus, indexical cues (e.g., fundamental frequency and vocal quality) were held constant between the two “talkers.” Consequently, the two talkers only differed in their characteristic VOTs. During the pre- and posttest phases, listeners completed the same talker identification task with two key exceptions: (1) no feedback was provided and (2) stimuli were presented monaurally (i.e., either to the left or right ear on a given trial). Francis and Driscoll (2006) hypothesized that a left ear (i.e., right hemisphere) processing advantage would emerge at posttest for listeners who learned to process the phonetic property (i.e., VOT) as a cue to talker identification. Consistent with this hypothesis, reaction times (RTs) to correct responses were, on average, 92 ms faster for stimuli presented to the left ear compared to that for the right ear at posttest. This left ear advantage was not present at pretest, suggesting that it emerged as a consequence of learning during the training phase.

Although broadly consistent with the extant neuroimaging literature, the finding from Francis and Driscoll (2006) is striking in the context of the dichotic listening literature. Specifically, they observed a laterality effect (i.e., a left ear advantage) for stimuli that were presented monaurally. Hemispheric dominance for processing different types of acoustic signals has been measured through behavioral dichotic listening paradigms. In a traditional dichotic listening task, relative contributions of the left and right hemispheres are segregated by presenting a target stimulus to either the left or right ear in conjunction with a competing stimulus to the ear that does not receive the target stimulus (Kimura, 1967). Therefore, during a dichotic listening task, there is simultaneous presentation of different stimuli to each ear with one ear receiving the target stimulus and the other ear receiving a competing stimulus. As reviewed in Hugdahl (2011), over 50 yrs of research using the dichotic listening paradigm has established its utility as a behavioral method to measure brain laterality effects, as well as a clear understanding of the importance of presenting dichotic stimuli—that is, a target and a competitor to opposite ears—to elicit laterality effects. Indeed, the latter point was established from the introduction of this paradigm (Kimura, 1967). Because Francis and Driscoll (2006) did not present a competing stimulus in the ear contralateral to the ear receiving the target stimulus, their task was not dichotic in nature and, thus, it is perhaps surprising that the left ear processing advantage for talker identification was observed using a monaural listening task. Their finding may suggest that a competing stimulus in the contralateral ear of interest is extraneous for a task of this nature. Consistent with this interpretation, González et al. (2010) demonstrated that a left ear processing advantage for repetition-priming effects in a talker identification task was strengthened when noise was presented in the contralateral ear, but the competing stimulus was not necessary to induce the left ear advantage.

In addition, the finding from Francis and Driscoll (2006) bears revisiting due to several methodological and empirical points. First, the left ear advantage was observed in a very small sample of participants (n = 8). Small sample sizes alone are not a determinant of either research quality or reproducibility. Indeed, the “small-N” design, in which an extremely large number of observations are made on only a few participants, has a rich precedent in the psychophysics domain (Smith and Little, 2018). In some cases, small-N designs may even promote better power and inferential validity compared to large-N designs (Smith and Little, 2018). However, for traditional designs, such as that used in Francis and Driscoll (2006), small sample sizes can increase the likelihood of false positives in the literature just as they can decrease the ability to detect true effects (e.g., Button et al., 2013). Second, these participants reflected those who met a learning criterion, defined as a minimum improvement in talker identification accuracy of 5% from pre- to posttest. While it is sensible to limit analyses to those who learned to associate VOT as a cue to talker identification given the nature of the hypothesis, no justification for the specific learning criterion was provided. Third, the statistical evidence for the key interaction between test phase and ear of presentation, while statistically significant, was only marginally so (p = 0.04). Fourth, the small sample (n = 8) who met the learning criterion reflected fewer than half of the total participants tested (n = 18). That is, most participants were not able to learn to associate VOT as a cue to talker identification, and this study did not reveal which factors may influence whether a given listener can learn to use phonetic properties to support talker identification.

For these reasons, the goal of the current work is twofold. First, in each of two experiments, we conducted a high-powered replication of the work of Francis and Driscoll (2006) to examine whether the left ear processing advantage for phonetic cues to talker identification would generalize to a larger sample. Second, all participants in experiment 1 completed four individual differences measures, in addition to the primary talker identification task, to identify potential predictors of talker identification performance. The talker identification task was modeled after the paradigm used in Francis and Driscoll (2006). In experiment 1, test stimuli were presented monaurally to either the left or right ear. In experiment 2, test stimuli were presented dichotically with noise presented to the contralateral ear of the target stimulus. The four individual differences measures consisted of a flanker task, a pitch perception task, a category identification task, and a within-category discrimination task. We assessed inhibitory control (using a flanker task) and pitch perception given previous evidence linking both of these constructs to talker identification ability (Theodore and Flanagan, 2020; Xie and Myers, 2015). For example, increased inhibitory control has been positively associated with talker identification accuracy (Theodore and Flanagan, 2020) and invoked as an explanatory mechanism for heightened talker identification abilities in bilingual compared to monolingual children (Levi, 2018). On this view, heightened talker identification may reflect a stronger ability to inhibit irrelevant information (e.g., phonetic or other linguistic content) to instead focus on other aspects of the signal (e.g., fundamental frequency) for the purposes of talker identification. Pitch perception has also been positively associated with talker identification and talker discrimination (Theodore and Flanagan, 2020; Xie and Myers, 2015). Using a flanker and pitch perception task to assess inhibitory control and auditory acuity, respectively, supports the examination of individual differences in nonspeech abilities as potential predictors of performance in the current speech perception task. In contrast, the category identification task assessed listeners' VOT voicing boundaries and identification slopes (the latter as a measure of how categorically listeners perceived the voicing contrast). The within-category discrimination task assessed listeners' perceptual acuity for VOT specifically, which is a logical precursor to learning to use VOT as a cue to talker identification.

If the left ear advantage for phonetic cues to talker identification generalizes beyond the original sample (Francis and Driscoll, 2006), then we predict that listeners who learn to associate VOT as a cue to talker identification will show faster RTs for stimuli presented to the left ear compared to stimuli presented to the right ear during the talker identification posttest. If increased inhibitory control and auditory acuity are associated with enhanced talker identification, then we predict a positive relationship between performance on the talker identification task and performance on the flanker, pitch perception, and within-category discrimination tasks. The relationship between talker identification and categorical perception was exploratory in the current work. Listeners who show early VOT voicing boundaries may not have perceived the VOT variants in the talker identification task as belonging to the same category, which would result in improved talker identification accuracy given that the two talkers would be perceived as saying different words (instead of saying the same word with different VOTs). Listeners who have shallower identification slopes may be more sensitive to within-category variation compared to listeners who have more categorical slopes; if so, then those with shallower identification slopes would show improved performance on the talker identification task.

II. EXPERIMENT 1

A. Methods

1. Participants

For session one, 140 participants were recruited from the Prolific participant pool (Palan and Schitter, 2018).¹ All of the participants were monolingual English speakers between 18 and 35 years of age currently residing in the United States (U.S.) with no history of language-related disorders. Forty-three participants were excluded due to failure to pass all three headphone screens (n = 28) or failure to meet the training accuracy criterion (n = 15), described in detail below. The final sample (n = 97) included 42 women and 55 men [mean age = 27 years of age, standard deviation (SD) = 4 yrs]. All of these participants were invited to participate in session two with 59 participants choosing to do so. The mean time between the two sessions was 11 days (SD = 12 days; range = 1–35 days).

2. Power analysis

The sample size was determined based on a priori power analyses using the simr package (Green and MacLeod, 2016) in R. First, trial-level data from Francis and Driscoll (2006) for the effect we aimed to replicate (i.e., the data underlying the interaction shown in their Fig. 1) were fit to a linear mixed effects model using the lmer( ) function from the lme4 package (Bates et al., 2015) in R. The dependent variable was log-transformed RT. The fixed effects were test (pretest = −0.5, posttest = 0.5), ear (left ear = –0.5, right ear = 0.5), and their interaction. The random effects structure consisted of random intercepts by subject and random slopes by subject for test and ear. Second, we created a data frame to reflect the structure of our design, given that power for the mixed effects model is linked to number of observations (in addition to sample size). As described below, each participant completed 80 trials in each test phase, and we only analyzed RTs for correct responses (as in Francis and Driscoll, 2006). We conservatively simulated accuracy at 60% correct, resulting in a simulated data set that consisted of 48 trials/participant at each test session. Third, the parameters of the original model were simulated in our data structure 500 times using the powerCurve( ) function in simr, which showed that 55 participants were required to achieve 80% power to detect the test by ear interaction observed in Francis and Driscoll (2006). Thus, we aimed to test 55 participants who met the learning criterion from Francis and Driscoll (2006) to perform an adequately powered replication. The recruited sample (n = 140) was based on estimated attrition rates (e.g., failure to pass headphone screens, failure to pass training criterion, failure to meet learning criterion) and resulted in 58 participants who met the learning criterion from Francis and Driscoll (2006), as we describe further below.

FIG. 1. — (Color online) The results of the talker identification task in experiment 1. (A) shows performance during the talker identification task for all participants (n = 97), and. (B) shows performance during the talker identification for those who met the learning criterion (n = 58). In (a) and (b), the distributions of participants' accuracy scores (mean proportion correct) for each test are shown at left, and the distribution of participants' mean response times to correct responses by test and ear of stimulus presentation are shown at right.

3. Stimuli

a. Talker identification.

Stimuli for the talker identification task were drawn from two VOT continua, one that perceptually ranged from gain to cane and one that perceptually ranged from goal to coal. Both of the continua were created by applying a linear predictive coding (LPC) synthesis procedure to natural productions of the voiced end points (i.e., gain, goal) elicited from a single female, monolingual speaker of American English. These stimuli are a subset of those used in Theodore and Miller (2010), to which the reader is referred for comprehensive details on stimulus construction.

From each of these continua, we selected two unique tokens for each of the two talkers, who were fictitiously referred to as Joanne and Sheila. The selected tokens were in the unambiguous, voiceless region of the original continua and, thus, perceptually cued the words cane and coal. The tokens were selected so that Joanne had characteristically shorter VOTs than Sheila. Specifically, the selected VOTs ranged between 84 and 89 ms for Joanne and between 165 and 170 ms for Sheila. With this procedure, the only difference between the two talkers' voices was their characteristic VOTs. These stimuli were used for the training and test phases because the left ear advantage in Francis and Driscoll (2006) only emerged for trained items. As described in Sec. II A 4, stimuli were presented binaurally during training and monaurally at test.

b. Individual differences measures.

Separate stimulus sets were used in each of the four individual differences tasks. Stimuli for the flanker task consisted of linear arrays of five arrows in which the middle arrow was either congruent (e.g., < < < < <) or incongruent (e.g., < < < > < <) with the flanking arrows. There were 80 arrays in total, 20 congruent and 20 incongruent arrows for each of 2 arrow directions (i.e., left vs right). Stimuli for the pitch perception task consisted of a subset of the local pitch perception stimuli from Xie and Myers (2015), to which the reader is referred for comprehensive details on stimulus construction. In brief, each stimulus consisted of a pair of six-tone sequences separated by 1000 ms of silence. There were 32 pairs in total, 16 of which contained 2 identical tone sequences (i.e., same trials) and 16 of which contained tone sequences that differed in pitch for 1 of the 6 tones of the sequence (i.e., different trials).

Stimuli for the category identification and within-category discrimination tasks consisted of tokens drawn from gain-cane and goal-coal VOT continua, which were created using the methods described for the talker identification stimuli. Critically, the stimuli used for the category identification and within-category discrimination tasks were produced by a different talker than was used in the main talker identification learning task to minimize potential transfer of learning between the two sessions. Stimuli for the category identification task consisted of ten tokens from each continuum consisting of VOTs that ranged between 21 and 99 ms; as a consequence, the selected VOTs perceptually cued both end points for each continuum (i.e., gain, cane, goal, coal). Stimuli for the within-category identification task consisted of 15 tokens from each continuum consisting of VOTs that ranged between 79 and 208 ms; accordingly, all of the selected tokens cued the voiceless end point (i.e., cane, coal). The selected tokens were arranged into same and different pairs; the word was held constant on a given pair. There were 12 unique same pairs that sampled the range of selected VOTs. There were 36 different pairs, reflecting 6 unique pairs for each of 3 step distances between pair members (reflecting a difference in VOT of 28, 54, or 80 ms, respectively) and 2 pair orders. As we describe in Sec. II A 4, we presented three repetitions of the same pairs and two repetitions of the different pairs during the within-category discrimination task to equate the number of same and different trials.

4. Procedure

a. Session 1.

All of the testing was completed online using Gorilla Experiment Builder;² Anwyl-Irvine et al., 2020). After providing informed consent, participants completed a series of headphone screens. These included two existing protocols that use dichotic listening tasks to screen for headphone compliance on web-based platforms (Milne et al., 2021; Woods et al., 2017). The third screen was a custom channel detection task in which listeners heard a tone presented to either the left or right ear and were asked to indicate via a button press in which ear they heard the tone. The channel detection task was used in the headphone screen battery because although the Woods et al. (2017) and Milne et al. (2021) dichotic listening tasks assess use of stereo headphones, they do not assess whether a participant has placed the left channel on the left ear (and, thus, the right channel on the right ear), which is a critical requirement for the present study. Participants who did not pass on all three screens were excluded from analyses; pass was defined as ≥5 correct responses (of six total trials) on the Woods et al. (2017) and Milne et al. (2021) tasks and ≥7 correct responses (of eight total trials) on the custom channel detection task.

After completing the headphone screens, participants completed the talker identification task. The talker identification task consisted of familiarization, pretest, training, and posttest phases. During familiarization, listeners heard two repetitions of the four tokens for each talker while seeing the name of the talker displayed on the screen. Stimuli during familiarization were presented binaurally and blocked by word and talker (i.e., they heard Joanne's two cane tokens and then Sheila's two cane tokens, followed by Joanne's two coal tokens and then Sheila's two coal tokens). Listeners were directed to listen to each word, view the talker's name, and try to learn each talker's voice; no responses were collected during familiarization.

Following familiarization, listeners completed the pretest phase. On a given trial, stimuli were presented monaurally to either the left or right channel. The pretest consisted of 80 trials (2 tokens × 2 words × 2 talkers × 2 channels × 5 repetitions) presented in a different randomized order for each participant. Participants were asked to indicate whether the word was produced by Joanne or Sheila as quickly as possible without sacrificing accuracy. Participants made their responses using the “a” and “l” keys and were instructed to keep their index fingers on these keys throughout the experiment to facilitate faster response times; a visual diagram was provided to demonstrate correct finger placement during the instructions. The instructions explicitly noted that this was not a sound localization task to help ensure that participants understood that they should be indicating which talker they heard on each trial and not ear of stimulus presentation. No feedback was provided during pretest.

After the pretest, participants completed the training phase. The training phase consisted of 400 trials (2 tokens × 2 words × 2 talkers × 50 repetitions) of talker identification following the task instructions described for pretest; trials were randomized separately for each participant. Stimuli were presented binaurally and feedback was provided on every trial in the form of a green checkmark (for correct responses) or a red “x” (for incorrect responses). Session one concluded with the posttest phase, which was identical to the pretest phase.

A progress bar was displayed on the bottom center of the screen throughout the entirety of the experiment and the interstimulus interval (ISI) was constant at 1000 ms (measured from the participant's response on each trial to the onset of the next stimulus). The entire procedure lasted approximately 35 min, and participants were compensated $5.83 for their participation.

b. Session 2.

All of the testing was completed online using Gorilla Experiment Builder (Anwyl-Irvine et al., 2020). After providing informed consent, listeners completed the headphone screens of Woods et al. (2018) and Milne et al. (2021). All of the participants who returned for session two (n = 59) passed all of the headphone screens at session two. After completing the headphone screens, participants completed the within-category discrimination, flanker, category identification, and pitch perception tasks in this fixed order. The within-category discrimination consisted of 72 same trials (2 words × 12 VOTs × 3 repetitions) and 72 different trials (2 words × 3 distances × 6 unique pairs × 2 pair orders). On each trial, participants were directed to indicate whether the two members of the pair were the same or different by clicking one of two appropriately labeled buttons. The flanker task consisted of 1 randomization of the 80 linear arrays (2 trial types × 2 directions × 20 repetitions). On each trial, participants were directed to indicate the direction of the central arrow as quickly as possible without sacrificing accuracy. Participants were asked to keep their index fingers on top of the response keys throughout the task and a visual diagram illustrating correct finger placement was provided during the instructions.

The category identification task consisted of 80 trials (2 continua × 10 VOTs × 4 repetitions). Participants were asked to indicate whether the word began with a “g” as in gain and goal or “k” as in cane and coal by clicking on an appropriately labeled button. The pitch perception task consisted of 64 trials (2 trial types × 16 unique pairs × 2 repetitions). On each trial, participants indicated whether the two members of the pair were the same or different by clicking on an appropriately labeled button. For all of the tasks, trials were presented in a separate randomized order for each participant and the ISI was 1000 ms. The entire procedure lasted approximately 35 min; participants were compensated $5.83 for their participation.

B. Results

1. Talker identification

a. Training.

Trial-level data (for all of the tasks) and a script (in R) are available on the Open Science Framework.³ Executing the script will reproduce all statistics reported in this manuscript in addition to generating all of the figures. For the training phase, the accuracy for each participant was calculated in terms of proportion correct responses across all of the training trials. We excluded 15 participants because they failed to meet the inclusion criterion for training accuracy (≥0.60). Mean accuracy across included participants (0.83, SD = 0.10, range = 0.61–0.98) was significantly above chance as confirmed by a one-sample t-test [t(96) = 32.495, p < 0.001], which was expected based on the inclusion criterion.

b. Test.

Accuracy and RT during the test phases were analyzed separately. For accuracy, trial-level responses (0 = incorrect, 1 = correct) were submitted to a generalized linear mixed effects model with the binomial response family as implemented with the glmer( ) function of the lme4 package (Bates et al., 2015) in R. The model included fixed effects of test (pretest = −0.5, posttest = 0.5), ear (left = −0.5, right = 0.5), and their interaction. The random effects structure included random intercepts by subject and random slopes by subject for test and ear. The model revealed a main effect of test [ $\hat{β}$ = 0.469, standard error (SE) = 0.069, z = 6.829, p < 0.001], indicating that accuracy improved from pretest (0.71, SD = 0.14) to posttest (0.79, SD = 0.12). There was no main effect of ear ( $\hat{β}$ = 0.034, SE = 0.043, z = 0.780, p = 0.435) nor an interaction between test and ear ( $\hat{β}$ = –0.090, SE = 0.078, z = –1.150, p = 0.250). The main effect of test is visualized in Fig. 1.

Trial-level RTs for correct responses during test were analyzed in a linear mixed effects model using the lmer( ) function of the lme4 package (Bates et al., 2015). The Satterthwaite approximation of degrees of freedom was used to evaluate statistical significance using the t distribution (Kuznetsova et al., 2017). RTs were log-transformed and trials exceeding 2.5 SDs of a participant's mean log RT were excluded (3.2% of correct RTs). The fixed and random effects structure was identical to that described for the accuracy analysis. The results of the model showed a main effect of test ( $\hat{β}$ = −0.073, SE = 0.020, t = −3.620, p < 0.001), with RTs decreasing from pretest (mean = 1051 ms, SD = 268) to posttest (mean = 958 ms, SD = 220). There was no main effect of ear ( $\hat{β}$ = 0.008, SE = 0.005, t = 1.489, p = 0.140) nor an interaction between test and ear ( $\hat{β}$ = –0.011, SE = 0.010, t = –1.075, p = 0.283). Figure 1 shows the distribution of participants' mean RTs by test session and ear.

Recall that in Francis and Driscoll (2006), the left ear advantage at posttest emerged only for listeners who met their learning criterion, defined as ≥ 5% improvement in talker identification accuracy from pretest to posttest. A parallel RT analysis was performed and limited to listeners in the current study who met this criterion (n = 58). The results converged with the full sample; RT decreased from pretest to posttest ( $\hat{β}$ = –0.083, SE = 0.031, t = –2.671, p = 0.010), but there was no main effect of ear ( $\hat{β}$ = 0.008, SE = 0.007, t = 1.152, p = 0.255) nor an interaction between test and ear ( $\hat{β}$ = –0.020, SE = 0.013, t = –1.541, p = 0.123).

2. Individual differences measures

a. Flanker.

Mean accuracy (proportion correct) across participants was near ceiling (0.97, SD = 0.03, range = 0.85–1.00). To ensure that the expected inhibition effect was observed across participants in the aggregate, trial-level log RTs for correct responses were analyzed in a linear mixed effects model following the methods outlined previously. RTs were log-transformed and trials exceeding 2.5 SDs of a participant's mean log RT were excluded (2.8% of correct RTs). Congruency was entered as a fixed effect (congruent = −0.5, incongruent = 0.5); the random effects structure consisted of random intercepts by participant and random slopes for congruency by participant. The results of the model confirmed a main effect of congruency ( $\hat{β}$ = 0.049, SE = 0.006, t = 7.648, p < 0.001) with RTs faster for congruent (mean = 434 ms, SD = 63) compared to incongruent trials (mean = 455 ms, SD = 61). For each subject, inhibition was calculated as the difference in RT between congruent and incongruent trials; thus, more negative scores indicate weaker inhibition. Performance on the four individual differences measures is shown in Fig. 2.

FIG. 2. — (Color online) The performance on the four individual differences measures in experiment 1. (A) shows the distribution of participants' mean response times by trial type (left) and the distribution of interference scores (right) for the flanker task. (B) shows the distribution of accuracy scores for same and different trials (left) and the distribution of sensitivity scores (right) for the pitch perception task. (C) shows the relationship between /k/ responses and VOT for each participant as determined by logistic regression (left) and the distribution of identification slopes (right) for the category identification task. (D) shows the distribution of accuracy scores for same and different trials (left) and the distribution of sensitivity scores (right) for the within-category discrimination task.

b. Pitch perception.

To quantify performance on the pitch perception task, we calculated sensitivity (d′) separately for each participant. Hit was defined as responding “same” for same tone sequence trials; false alarm was defined as responding same for different tone sequence trials. The mean sensitivity (d′) across participants was 1.96 (SD = 1.01), which was significantly greater than zero [t(58) = 14.847, p < 0.001].

c. Category identification.

Trial-level identification responses were fit to a logistic regression separately for each participant; in each regression, VOT was the independent variable and binary response (0 = /g/, 1 = /k/) was the dependent variable. Two parameters were derived from each regression: (1) the slope of the identification function and (2) the category boundary, defined as the VOT corresponding to 0.50 proportion /k/ responses. To derive the slope, we used the beta estimate for VOT from the regression model; with this metric, higher values indicate steeper identification slopes. The category boundary was derived using the model intercept and beta estimate for VOT from the regression model according to Eq. (1), where $\hat{β}$ ₀ is the intercept, $\hat{β}$ ₁ is the slope, and x is the category boundary:

{\hat{β}}_{0} + {\hat{β}}_{1 x} = \log (\frac{0.5}{1 - 0.5}); x = - \frac{{\hat{β}}_{0}}{{\hat{β}}_{1}} .

(1)

Two participants were excluded from subsequent category identification analyses because they did not show a statistically significant relationship between VOT and phonetic decisions; instead, their response functions were flat, suggesting that they did not perform the task as directed. Across participants, the mean slope of the identification function was 0.146 (SD= 0.076), and the mean category identification boundary was 54 ms (SD = 9 ms).

d. Within-category discrimination.

To quantify performance on the within-category discrimination task, we calculated sensitivity (d′) separately for each participant. Hit was defined as responding same for same trials; false alarm was defined as responding same for different trials. The mean sensitivity (d′) across participants was 1.20 (SD = 0.56), which was significantly greater than zero [t(58) = 16.348, p < 0.001].

3. Relationship between talker identification and individual differences measures

A series of correlations were performed to examine whether performance on the individual differences measures predicted performance on the talker identification task. Five measures of individual differences were considered, which included inhibition (congruent RT - incongruent RT), sensitivity (d′) for the pitch perception task, identification slope and category boundary for the category identification task, and sensitivity (d′) for the within-category discrimination task. Each of these five measures was correlated with four measures from the talker identification task, which included accuracy during training, accuracy at pretest, accuracy at posttest, and learning (accuracy at posttest - accuracy at pretest). These correlations are shown in Table I and visualized in Fig. 3.

TABLE I.

Pearson's correlation coefficient (r) and p-value (in parentheses) relating the five individual differences measures to each of four talker identification measures. Cells in bold indicate statistical significance with $α$ = 0.05. Cells with underline indicate statistical significance after applying the conservative Bonferroni correction to account for family-wise error rate (as described in the main text). The degrees of freedom were 57 for all of the measures except for the category identification measures, which had degrees of freedom equal to 55 (given that two participants were excluded due to failure to complete the task as directed). Parallel analyses were performed using Spearman's rank-order correlations and the results converged in all of the cases; these correlations can be viewed by executing the script provided in the OSF repository for this manuscript (footnote 3).

	Talker identification measure
Individual differences measure	Training	Pretest	Posttest	Learning
Inhibition	0.00 (0.971)	−0.15 (0.267)	−0.17 (0.209)	0.00 (0.971)
Pitch perception	0.41 (0.001)	0.28 (0.029)	0.40 (0.002)	0.09 (0.500)
Identification: Boundary	0.20 (0.132)	−0.09 (0.487)	0.25 (0.065)	0.36 (0.007)
Identification: Slope	0.19 (0.165)	0.25 (0.058)	0.16 (0.240)	−0.12 (0.363)
Discrimination	0.52 (<0.001)	0.32 (0.013)	0.45 (<0.001)	0.09 (0.477)

Open in a new tab

FIG. 3. — (Color online) Scatterplots illustrating the relationship between the five individual differences measures (by row) and the four measures of talker identification (by column) in experiment 1. Each point reflects an individual participant. The regression line indicates a linear model; the shaded region marks the 95% confidence interval.

There was no significant relationship between inhibition and any measure of talker identification. Pitch perception was positively associated with talker identification accuracy during training, pretest, and posttest; however, pitch perception was not related to the degree of learning. The location of the VOT voicing boundary was significantly associated with the degree of learning from pre- to posttest with longer category boundaries associated with better performance. Within-category discrimination was positively associated with talker identification accuracy during training, pretest, and posttest but was not associated with learning.

We note that when the conservative Bonferroni correction to account for family-wise error rate is applied (resulting in corrected $α$ = 0.0025 given $α$ = 0.05 and 20 comparisons), the only relationships that survive are the associations between the two measures of auditory acuity (i.e., pitch perception and within-category discrimination) and talker identification accuracy during training and posttest.⁴

III. EXPERIMENT 2

The results of experiment 1 revealed two primary findings. First, most listeners learned to use VOT as a cue to talker identification, which is consistent with the results of Francis and Driscoll (2006). That is, following a brief training phase, listeners improved in their ability to use a phonetic cue as an indicant of talker identity even in the absence of traditional indexical cues to voice identity (e.g., fundamental frequency). Second, auditory acuity was positively associated with talker identification, suggesting that heightened sensitivity to fine-grained acoustic information facilitated performance in the current task. Of note, we did not observe any evidence to suggest that ear of stimulus presentation influenced performance in the talker identification task. The goal of experiment 2 is to examine whether a left ear advantage is observed under conditions that are known to better facilitate behavioral observation of laterality effects. Following the conclusion of experiment 2, we present Bayes factors analyses to inform interpretation of null effects reported in this manuscript.

As reviewed in the Introduction, hemispheric laterality effects are more optimally observed in behavioral tasks under conditions in which a competing stimulus is presented to the contralateral ear of the target stimulus (Behne et al., 2005; Behne et al., 2006; Bless et al., 2015; González et al., 2010; Hugdahl and Anderson, 1984; Studdert-Kennedy and Shankweiler, 1970; Westerhausen, 2019). This is in contrast to the manipulation used in Francis and Driscoll (2006) and experiment 1, in which silence was presented to the contralateral ear. Dichotic stimulus presentation facilitates the observation of laterality effects because ipsilateral auditory pathways are suppressed when ears are presented with competing stimuli. Most of the literature on laterality effects for auditory verbal processing differs from the focus of the current investigation in that previous research has primarily examined laterality effects when processing the linguistic content of the stimuli, that is, the “what” of a talker's message. For example, the pioneering work of Kimura (1967) presented verbal productions of different digits to each ear and asked listeners to identify which digit(s) they heard. Likewise, the now classic consonant-vowel (CV) dichotic listening paradigm presents different CV syllables to each ear and requires listeners to identify which syllables(s) they hear (Hugdahl and Anderson, 1984; Studdert-Kennedy and Shankweiler, 1970). The extensive literature on linguistic processing of dichotic signals, thus, supports a cumulative science that can inform optimal design decisions for eliciting and measuring laterality effects for auditory verbal processing (e.g., Bless et al., 2013; Bless et al., 2015; Parker et al., 2021; Westerhausen, 2019).

In contrast, studies using behavioral tasks to assess hemispheric laterality for talker processing—that is, the “who” of a linguistic message—is relatively sparse, reflecting an emerging line of inquiry. To our knowledge, only three studies have provided behavioral evidence of a left ear advantage for talker identification. The first is Francis and Driscoll (2006), which directly motivates the current work, and as described previously, consisted of a monaural task (i.e., stimuli presented to either the left or right ear) instead of a dichotic listening task. The second comes from Perrachione et al. (2009). In their study, native English and native Mandarin listeners completed a talker identification task (with feedback) for voices speaking English and Mandarin during a training phase. On each trial, listeners heard two talkers produce the same sentence and were asked to identify the talker in the left ear on some trials and the talker in the right ear on other trials. Trials were blocked by ear and stimulus language; that is, listeners completed four training blocks formed by crossing monitoring ear (left vs right) and stimulus language (English vs Mandarin). Analysis of talker identification accuracy during training revealed a left ear benefit for both listener groups only when identifying talkers producing English sentences, which the authors speculate may reflect differences in the temporal modulation of frequency information between the two languages.

The third study comes from González et al. (2010). In their study, listeners completed a talker identification task with target stimuli presented to either the left or right ear. The construct of interest in this study was long-term repetition priming; accordingly, talker identification accuracy was compared between same sentence (i.e., a talker's repeated sentence) and different sentence (i.e., a talker's novel sentence) trials. Pink noise was presented in the contralateral ear to the target stimulus in their first experiment, whereas silence was presented in the contralateral ear in their second experiment. The results of the two experiments converged to show a left ear advantage for recognition memory in the talker identification task. Specifically, talker identification accuracy was higher for same compared to different sentence trials when stimuli were presented in the left ear, and no such benefit was observed for stimuli presented in the right ear. The laterality effect was observed in both experiments; however, it was stronger in the first compared to the second experiment, consistent with noise in the contralateral ear serving to suppress the influence of ipsilateral auditory pathways (Behne et al., 2005; Behne et al., 2006).

Drawing from these three studies, the specific dichotic manipulation used in experiment 2 was to present pink noise in the contralateral ear to the target stimulus, as in González et al. (2010). This manipulation allowed us to use otherwise identical procedures between experiments 1 and 2 and, hence, better isolate the influence of a dichotic listening environment on any observed differences between the two experiments (in contrast to, for example, adopting the blocked ear design used in Perrachione et al., 2009).