Abstract
Speech perception depends on the ability to generalize previously experienced input effectively across talkers. How such cross-talker generalization is achieved has remained an open question. In a seminal study, Bradlow & Bent (2008, henceforth BB08) found that exposure to just five minutes of accented speech can elicit improved recognition that generalizes to an unfamiliar talker of the same accent (N=70 participants). Cross-talker generalization was, however, only observed after exposure to multiple talkers of the accent, not after exposure to a single accented talker. This contrast between single- and multi-talker exposure has been highly influential beyond research on speech perception, suggesting a critical role of exposure variability in learning and generalization. We assess the replicability of BB08’s findings in two large-scale perception experiments (total N=640) including 20 unique combinations of exposure and test talkers. Like BB08, we find robust evidence for cross-talker generalization after multi-talker exposure. Unlike BB08, we also find evidence for generalization after single-talker exposure. The degree of cross-talker generalization depends on the specific combination of exposure and test talker. This and other recent findings suggest that exposure to cross-talker variability is not necessary for cross-talker generalization. Variability during exposure might affect generalization only indirectly, mediated through the informativeness of exposure about subsequent speech during test: similarity-based inferences can explain both the original BB08 and the present findings. We present Bayesian data analysis, including Bayesian meta-analyses and replication tests for generalized linear mixed models. All data, stimuli, and reproducible literate (R markdown) code are shared via OSF.
Keywords: speech perception, foreign-accented speech, adaptation, cross-talker generalization, replication, Bayesian analyses
Introduction
Talkers differ in the acoustic realization of the same word due to factors such as physiology, language background, and talker identity (e.g., Allen, Miller & DeSteno, 2003; Newman, Clouse & Burnham, 2001; Peterson & Barney, 1952). These differences are an important part of the speech signal, in that they encode information about social identity (for reviews, see Eckert, 2012; Foulkes & Hay, 2015). At the same time, inter-talker variability causes substantial computational challenges to speech perception. How human listeners achieve robust speech recognition despite this variability, and how they utilize previous experience to understand unfamiliar talkers, has been a perennial puzzle for language researchers (for review, see Kleinschmidt & Jaeger, 2015).
In this context, non-native (L2-accented) speech presents a case that is of both social and theoretical relevance. Of social relevance, because negative attitudes towards L2-accented speech can result in discrimination, with sometimes substantial social and socioeconomic consequences for the speaker (e.g., Fuertes, Gottdiener, Martin, Gilbert & Giles, 2012; Lippi-Green, 2012; Munro, 2003). Of theoretical relevance, because L2-accented speech deviates from native speech (henceforth L1-accented speech) in systematic ways, depending on the talker’s language background: talkers of the same first language (e.g., Mandarin) tend to share characteristics in their pronunciations of a second language (e.g., English). Research in speech perception has drawn on this property in order to understand the conditions under which listeners generalize previous experience to subsequently encountered talkers with the same L2 accent (for review, see Baese-Berk, 2018).
When exposed to a talker with an unfamiliar L2 accent, native listeners might initially experience substantial processing difficulty (e.g., Munro & Derwing, 1995; Schmale & Seidl, 2009). This initial difficulty can dissipate quickly with exposure to the talker (e.g., Clarke & Garrett, 2004; Xie, Weatherholtz, Bainton, Rowe, Burchill, Liu & Jaeger, 2018; Weil, 2001). Such talker-specific adaptation is now well-documented (for review, see Weatherholtz & Jaeger, 2016), and similar adaptation has been observed for native dialects and regional varieties (Best, Shaw, Mulak, Docherty, Evans, Foulkes, Hay, Al-Tamimi, Mair & Wood, 2015; Smith, Holmes-Elliott, Pettinato & Knight, 2014), as well as idiosyncratic sound-specific deviations from canonical pronunciations (e.g., Kraljic & Samuel, 2005; Norris, McQueen & Cutler, 2003; Sumner, 2011).
Listeners also seem to be able to generalize previous experience to other unfamiliar talkers of the same accent. Research on the perception of L2 accents has played a key role in showcasing this ability: because deviations from L1-accented speech are in part systematic across talkers of the same L2 accent, implicit knowledge of this cross-talker variability is predicted to facilitate speech perception (Foulkes & Hay, 2015; Kleinschmidt & Jaeger, 2015). There is evidence that this is indeed the case for automatic speech recognition systems, which benefit from training on speech from different dialects and accents (e.g., Soto, Siohan, Elfeky & Moreno, 2016; Tatman, 2016). Similarly, human listeners who are familiar with a regional or L2 accent via long-term experience tend to show better comprehension of that accent than listeners who are not (Porretta, Tucker & Järvikivi, 2016; Stuart-Smith, 2008; Witteman, Weber & McQueen, 2013). Long-term cumulative experience with an accent thus seems to facilitate generalization to unfamiliar talkers of the same accent and perhaps to L2-accented speech more generally (Bent & Bradlow 2003).
How this ability to generalize is gained via individual encounters with L2-accented talkers—i.e., how listeners incrementally and cumulatively come to improve their comprehension of an L2 accent—is, however, still largely unknown. This gap in our knowledge includes some of the basic conditions for successful cross-talker generalization, and we hope to fill this gap with the current investigation. Here, we seek to replicate a seminal study on this question (Bradlow & Bent, 2008). Both within research on speech perception and within research on learning more generally, this study is often cited as evidence for an important constraint on generalization. This constraint is variably characterized as a categorical limitation—the inability to generalize to unfamiliar talkers of an L2 accent following exposure to only a single talker of that accent—or a matter of degree—the relative advantage in generalization following exposure to multiple talkers of an accent (although these two interpretations differ, they are not always distinguished in practice). Both interpretations have been highly influential in shaping theories of speech perception and effective learning.
There are, however, reasons to revisit Bradlow and Bent’s study. This includes limitations of the experimental design employed in the original study—limitations that were clearly acknowledged by the authors but have since been forgotten. These limitations are particularly important in light of a small number of more recent studies with potentially conflicting results. Despite these new findings, the original findings of Bradlow and Bent’s experiment continue to be taken as ground truth (for a recent review, see Baese-Berk, 2018, p. 16–18). We thus present two large-scale replications of Bradlow and Bent’s experiment while removing the confounds of the original study. We address both the categorical question—whether generalization can be found after exposure to a single talker—and the gradient question—whether multi-talker exposure facilitates generalization over and above single-talker exposure. Beyond speech perception, the analysis approach and findings we present are of relevance to research on effective training, therapy, and teaching.
Bradlow and Bent (2008)
Bradlow and Bent (2008; henceforth BB08) made several important contributions to the field of speech perception. Here we focus on their Experiment 2, which investigates cross-talker generalization. In this pioneering experiment, BB08 examined listeners’ ability to generalize following initial exposure to unfamiliar L2 accents. In particular, BB08 asked how the type of exposure—multiple talkers or a single talker—affects listeners’ comprehension of unfamiliar talkers of the same L2 accent.
L1-English listeners were exposed to 160 spoken sentences from either L1 talkers of American English or Mandarin-accented talkers of English—about five minutes worth of speech, spread over two exposure sessions on two consecutive days. Both during exposure and the subsequent test, the participants’ task was to transcribe the sentences they heard. Performance was measured by the number of correctly transcribed words. Between-participants, BB08 manipulated the speech heard during exposure. In the control condition, participants heard five L1-accented talkers of English and were tested on the speech of a Mandarin-accented talker. In the talker-specific condition, participants heard the same Mandarin-accented talker during exposure and test. In the single-talker condition, participants heard one Mandarin-accented talker during exposure and a different Mandarin-accented talker during test. Finally, in the multi-talker condition, participants heard five different Mandarin-accented talkers, followed by a test against a different (sixth) Mandarin-accented talker.
BB08 found that exposure to speech from one accented talker leads to improved transcription performance on novel sentences from the same talker during test (talker-specific adaptation): participants in the talker-specific condition achieved about 10% better performance during test (92 RAU) than participants in the control condition (82 RAU; RAU are rationalized arcsine-transformed proportions; for the present purpose, they closely approximate percent correct). BB08 further found that listeners generalized this learning to unfamiliar talkers of the same L2 accent (cross-talker generalization). Critically, listeners only did so following exposure to multiple talkers (90 RAU, +8 above the control condition). No generalization was observed following exposure to only a single talker (82 RAU, ±0 above the control condition).1
Both the existence of cross-talker generalization after multi-talker exposure and the complete lack of cross-talker generalization after single-talker exposure are striking findings. Together, these findings seem to suggest clear limits on generalization during speech perception. BB08 proposed that the cross-talker variability observed during multi-talker exposure allows listeners to distinguish between accent- and talker-specific characteristics, essentially learning the structure of the accent. The findings also seem to rule out the alternative explanations in terms of similarity-based generalization from episodic (Goldinger, 1996, 1998) or exemplar accounts of speech perception (Johnson, 1997; Pierrehumbert, 2001). These accounts—which we return to in the general discussion—would predict that cross-talker generalization can occur after exposure to a single talker, provided that the exposure and test talker are sufficiently similar (e.g., Eisner & McQueen, 2005; Goldinger, 1996; Reinisch & Holt, 2014; Xie & Myers, 2017). In short, listeners’ ability to generalize to unfamiliar talkers after single-talker exposure speaks directly to the mechanism supporting cross-talker generalization and the types of representations listeners maintain about previously experienced input. It is thus not surprising that this particular set of findings—generalization across talkers, but only after exposure to multiple talkers—has been influential in theoretical work on speech perception (for reviews, Kleinschmidt & Jaeger, 2015; Sumner, 2011), second language learning (see references in Pajak, Fine, Kleinschmidt & Jaeger, 2016) and research on learning more generally (e.g., Potter & Saffran, 2017; Schmale & Seidl, 2009).
There are, however, reasons to call for caution in interpreting the findings of BB08: while the test talker for the single- and multi-talker exposure conditions in BB08 was identical, the exposure talkers differed between the two conditions. Specifically, BB08 used four different exposure talkers across four different (between-participant) single-talker conditions. The multi-talker condition employed five exposure talkers (for all participants in that condition). Critically, only two of the exposure talkers in the multi-talker condition were also used in the single-talker condition. Thus, if generalization to the test talker depends at least in part on similarities between the exposure and test talkers—as predicted by, for example, exemplar and episodic accounts—this could explain the results of BB08. The results of BB08 are thus ambiguous with regard to whether multi-talker exposure is necessary for, or even facilitates, cross-talker generalization. Yet, this alternative interpretation of BB08’s findings has not been investigated in subsequent studies.
A second, more general, reason to revisit BB08 is the use of only one test talker. This is not uncommon for studies on speech perception, including the perception of L2-accented speech (e.g., Clarke & Garrett 2004; Reinisch, Wozny, Mitterer & Holt 2014; Janse & Adank 2012; including our own work, e.g., Xie et al. 2018; Xie, Earle & Myers 2018). It raises questions, however, about the extent to which the results of studies with a single test talker generalize across the population of talkers. Indeed, a few more recent studies have returned mixed evidence for cross-talker generalization after single-talker exposure, though it is worth pointing out that these studies, too, were based on only one test talker (Xie & Myers 2017; Xie et al. 2018; as well as unpublished thesis work in Clarke 2003; Weil 2001; we return to these and related studies in the general discussion).2 Reliance on only one test talker is arguably particularly problematic for investigations of cross-talker generalization because some accounts of speech perception predict that generalization depends on the similarity between the exposure and test talker (e.g., Goldinger, 1998; Johnson, 1997; Kleinschmidt & Jaeger, 2015; Kraljic & Samuel, 2007). In short, there is to this day no study with multiple test talkers that tests, and compares, cross-talker generalization after both single- and multi-talker exposure within the same experiment.3 The present study seeks to address this gap in the literature.
The need for such a replication is concisely exemplified by a recent review of the field (Baese-Berk, 2018), summarizing BB08: “[. . . ] listeners who heard a different single talker at test and training do not perform better than the individuals who were trained on the task but with native English speakers. However, [. . . ] with increased variation in number of talkers, listeners demonstrate talker-independent adaptation for Mandarin. That is, listeners exposed to multiple talkers during training were able to generalize to a novel talker from the same language background at test.” (Baese-Berk, 2018, p. 16). This is, of course, a correct summary of BB08. The question we ask here is whether these results replicate when multiple test talkers are used, all combinations of exposure and test talkers are adequately counter-balanced across conditions, and response data from a large number of participants is collected.
The present study
We present two large-scale replications of BB08 with 320 participants each. These are presented as Experiments 1a and 1b. We seek to contribute to three questions about cross-talker generalization, summarized in Table 1.
Table 1.
Three research questions about cross-talker generalization.
Generalization | Question | |
---|---|---|
(1) | Multi-talker | Is generalization to an unfamiliar L2 talker possible following |
exposure to multiple L2 talkers of the same accent? | ||
| ||
(2) | Single-talker | Is generalization to a novel L2 talker possible following |
exposure to a single L2 talker of the same accent? | ||
| ||
(3) | Multi- vs. single-talker | Does exposure to multiple L2 talkers facilitate |
generalization to a novel talker of the same accent beyond exposure to a single L2 talker? |
Our replication closely follows BB08, with some exceptions, while removing the aforementioned confounds. Following BB08, we employ the same exposure-test paradigm. We compare listeners’ transcription accuracy for an L2-accented talker presented during test following exposure to L1-accented English (control condition), exposure to the same L2 talker (talker-specific condition), exposure to another L2 talker of a same accent (single-talker condition), and exposure to multiple L2 talkers of the same accent (multi-talker condition). By comparing transcription accuracy during test for the multi-talker condition against the control, we address Question 1. Of the three questions we ask here, this is the one for which there is by now the most convincing evidence (Sidaras et al., 2009; Tzeng et al., 2016; Alexander & Nygaard, 2019). By comparing accuracy for the single-talker condition against the control, we address Question 2. For this question, some studies have returned affirmative answers while other studies—including BB08—have returned a negative answer. However, all of these studies have employed a single test talker. Finally, by comparing accuracy for the multi-talker against the single-talker condition, we address Question 3.
To address the potential confounds of the original BB08 study, our experiments compare generalization of adaptation for multiple test talkers (four in each experiment) and multiple exposure talkers (six in each experiment). This results in a large number of exposure-test talker combinations—the single-talker condition in Experiments 1a and 1b contains 20 unique combinations of exposure and test talkers (compared to four in BB08; see also Table 3). This contrasts with BB08 and other previous work, which has generally assessed adaptation or generalization for only one test talker (e.g., Gass & Varonis 1984; Wade, Jongman & Sereno 2007; Weil 2001; but see Sidaras et al. 2009; Tzeng et al. 2016). Unlike BB08, we also counterbalance all exposure-test talker combinations so as to hold talker identity constant across the single- and multi-talker conditions. This removes the potential confound—and thereby the alternative interpretation—of the results observed in BB08. The design of our experiments is visualized in Figure 1 and described in more detail under Procedure.
Table 3.
Number of participants and talker-pairs in each exposure condition remaining for analysis (Experiment 1a and 1b). For comparison, we list the same information for Bradlow and Bent (2008, Experiment 1). For example, in the single-talker condition, BB08 paired four possible exposure talkers with one test talker for a total of 4 talker pairs. By comparison, we paired 5 exposure talkers with 4 test talkers to form 20 talker pairs for each of Experiments 1a and 1b.
Bradlow & Bent (2008)
|
Current Study (Exp 1a and 1b) |
|||
---|---|---|---|---|
Condition | Exposure-test Talker combinations | Participants | Exposure-test Talker combinations | Participants |
Control | 1 | 10 | 4 | 80 (each) |
Single-talker | 4 (1 test) | 40 | 20 (4 test) | 80 (each) |
Multi-talker | 1 | 10 | 4 | 80 (each) |
Talker-specific | 1 | 10 | 4 | 80 (each) |
| ||||
Total | 70 | 320 (each) |
Figure 1.
Design balancing both exposure and test talkers across participants. Unlike in BB08, the exposure talkers that occurred across participants within the single-talker condition were identical to the exposure talkers in the multi-talker condition. Each shape represents a different talker—i.e., the square is always the same talker (only two out of four test talkers are shown). The exposure talkers in the control condition are the only L1-accented talkers of American-English (and therefore represented by shapes different from those representing Mandarin-accented talkers in the other three conditions). Background colors indicate the exposure condition. The resulting large number of lists (see Procedure) and participants (320 each in Experiments 1a and 1b) motivate the use of a web-based crowd-sourcing paradigm.
We also halve the amount and duration of exposure, compared to BB08 (80 instead of 160 exposure sentences). Beyond simply saving costs, this deviation was motivated by theoretical considerations. Specifically, it allows us to rule out alternative explanations for cross-talker generalization after single-talker exposure, if observed. Bradlow and Bent propose that generalization from single talkers might be possible after prolonged exposure to one talker because of increased exposure to within-talker variability (BB08, p. 723). This, Bradlow and Bent point out, would explain an unpublished result by Weil (2001) finding generalization after three days of single-talker exposure. By using substantially less exposure than in BB08’s study, we aim to rule out a similar explanation should we observe cross-talker generalization following single-talker exposure.
Finally, we address concerns about spurious significance (Type I error) and power (Type II error) by increasing the number of participants per condition well above the minimum recommended in the statistical literature (from 10 in BB08 to 80 participants per condition in each of Experiments 1a and 1b; Simmons, Nelson & Simonsohn 2011 recommend a minimum of 20 participants per between-participant condition). Lack of power does not only increase the rate of Type II errors (failure to detect an effect), it also makes it more likely that an experiment yields effect estimates that are zero or the opposite direction from the true effect. It is therefore possible that lack of power contributed to the observed ±0 effect of single-talker exposure in BB08 (Question 2), and—as a consequence—the observed benefit of multi-talker over single-talker exposure (Question 3).
The 320 participants we recruit in each experiment are the minimal number required to fully counter-balance the design of Experiments 1a and 1b, due to the large number of between-participant experimental lists (up to 80 per exposure condition, for a total of 128 unique lists in each experiment; see Procedure). Data collection for this large number of participants is facilitated through a web-based crowdsourcing paradigm. We have previously replicated similar lab-based speech perception paradigms over the web (Kleinschmidt & Jaeger, 2012; Kleinschmidt, Raizada & Jaeger, 2015; Liu & Jaeger, 2018; Xie et al., 2018), including paradigms that employed transcription to detect talker-specific adaptation to L2 accents (e.g., two experiments with multiple between-participant conditions in Burchill, Liu & Jaeger, 2018).4 Power analyses presented in Appendix C confirm that the present study had substantially higher statistical power than BB08.
We analyze our findings with Bayesian Generalized Linear Mixed Models (GLMMs). While frequentist analyses returned the same results, Bayesian analyses are particularly well-suited for the present purpose: Bayesian hypothesis tests provide coherent gradient measures of the strength of evidence for each of the three questions we seek to address (Raftery, 1995; Wagenmakers, 2007). This reduces the temptation to think about findings in categorical terms (significant vs. not; for helpful discussion, see Vasishth, Mertzen, Jäger & Gelman, 2018), a feature that we find particularly helpful in the context of replication. In addition to separate analyses of Experiments 1a and 1b, we present both a Bayesian meta-analysis and a Bayesian replication test for GLMMs (extending Verhagen & Wagenmakers, 2014). The choice of Bayesian analyses has further practical advantages, such as the ability to fit GLMMs with full random effect structures, avoiding the need for ad-hoc recipes that have become the standard in some parts of the field (see discussion in Baayen, Vasishth, Bates & Kliegl, 2016). The ability to model rich random effect structures also lets us assess the generalizability of our findings not only beyond the particular sample of stimuli and subjects, but beyond the particular sample of exposure and test talkers (following a recent call by Yarkoni, 2019).
All analyses as well as additional visualization and tables are available on OSF (Xie, Liu & Jaeger, 2020) at https://osf.io/brwx5/ (DOI 10.17605/OSF.IO/BRWX5). This includes all data as well as all R code in form of an R Markdown document (henceforth supplementary information or SI).
Why two replications?
We present two replication experiments of identical design, both addressing Questions 1–3. Experiment 1a was conducted in 2016, and Experiment 1b was conducted in 2018, following feedback on earlier presentations of this work. Specifically, some procedural decisions—most notably, the fact that recruitment for some between-participants conditions was completed before recruitment for other conditions began—made some results of Experiment 1a vulnerable to inflated Type I errors. We thus pre-registered Experiment 1b as an exact replication of Experiment 1a and all of its analyses (OSF https://osf.io/u74vr/register/5771ca429ad5a1020de2872e). Below we present the two experiments side by side and discuss when their results differ from each other or BB08.
Methods
Participants
Participants were recruited on Amazon Mechanical Turk, an online crowdsourcing platform. All participants were self-reported L1 speakers of American English. Participants were paid $1.50 ($6 hourly rate) for the experiment, which was estimated to take about 15 minutes.
Experiments 1a and 1b each recruited 80 successful participants after exclusions in each of the four exposure conditions. This number allowed us to balance exposure conditions, exposure and test talkers, and stimulus order across lists (see Procedure). For Experiment 1a, we include 32 participants each for the control and talker-specific conditions from a pilot experiment that only included those two conditions (reported in Appendix B). All remaining participants were recruited in parallel. Since we found talker-specific adaptation in the pilot experiment (Appendix B), this makes the comparison of the talker-specific against the control condition anti-conservative for Experiment 1a.5 This was one of the motivations for Experiment 1b: all 320 participants in Experiment 1b were recruited in parallel across the four conditions, and no previously collected data was included. All procedures were performed in compliance with the guidelines of the University of Rochester Research Subjects Review Board.
Aggregate demographics.
The demographic distributions for Experiments 1a and 1b were comparable. Of those participants who volunteered demographic information, 42% and 48% reported to be female in Experiments 1a and 1b, respectively (0% declined to report); 8% and 7% reported to be of “Hispanic” ethnicity (0% declined to report); 6% and 6% reported their race as “Black or African American”, 3% and 2% reported as “Asian”, 0% and 2% as “American Indian/Alaska Native”, 5% and 4% as “More than one race”, and 83% and 83% reported as “White” (2% and 3% declined to report). The mean age of participants was 34.4 (SD = 10.0) and 35.0 (SD = 10.0) in Experiments 1a and 1b, respectively (2% and 3% declined to report). Demographic categories were determined by NIH reporting requirements. (We conducted no analyses for effects of demographic variables. Any potential confounds of such variables in our results are, however, avoided through a control predictor introduced below.)
Exclusions.
To achieve the targeted number of 320 participants each, we recruited 343 participants for Experiment 1a (6.7% exclusion rate) and 379 participants for Experiment 1b (15.6% exclusion rate). As we seek to test how participants learn regularities from an unfamiliar accent, we excluded participants from analysis who reported a high degree of familiarity with Chinese or Chinese-accented English in the post-experiment questionnaire. These participants reported that a close family member or friend spoke with a Chinese or similar sounding accent, that they heard that accent all of the time, and/or that they spoke Chinese themselves. Table 2 summarizes the exclusions.
Table 2.
Total number of participants in each exposure condition before and after exclusions. We observe a higher rate of exclusions in Experiment 1b than in 1a because more participants report having familiarity with a similar accent to the one used in our study.
Exposure condition | analyzed | Participants recruited | excluded |
---|---|---|---|
Experiment 1a | |||
Control | 80 | 86 | 6 (7.0%) |
Single-talker | 80 | 86 | 6 (7.0%) |
Multi-talker | 80 | 90 | 10 (11.1%) |
Talker-specific | 80 | 81 | 1 (1.2%) |
| |||
Experiment 1b | |||
Control | 80 | 94 | 14 (14.9%) |
Single-talker | 80 | 95 | 15 (15.6%) |
Multi-talker | 80 | 95 | 15 (15.6%) |
Talker-specific | 80 | 95 | 15 (15.6%) |
We observe consistently higher rates of exclusions in Experiment 1b than in 1a because more participants reported familiarity with an accent similar to the one used in our study. We do not know the reason for this increase, but note that almost 3 years passed between the end of recruiting for Experiment 1a and the start of recruiting for Experiment 1b.
Materials
The materials from BB08 were not available to us because of the permissions under which they were elicited. Speech recordings were thus taken from Northwestern University’s Archive of L1 and L2 Scripted and Spontaneous Transcripts and Recordings (ALLSSTAR, Bradlow, Ackerman, Burchfield, Hesterberg, Luque & Mok, 2010). ALLSSTAR contains recordings from talkers of various L1 backgrounds performing speech production tasks in English, along with intelligibility ratings for certain talkers. This includes 11 male Mandarin-accented talkers.
As a fully counter-balanced design for all 11 talkers would have required thousands of participants, we sought to select those test talkers suitable for our purpose. Based on theoretical considerations, we expected the two cross-talker generalization conditions (single-talker, multi-talker) to yield performance that would fall between the control and talker-specific conditions (this is confirmed by our results). In order to adequately power our experiments, it was thus important to select test talkers for which we could reliably detect talker-specific adaptation—otherwise a failure to detect cross-talker generalization on the test talkers would be uninformative. For example: for test talkers that are too easy to understand, it would be hard to detect benefits of talker-specific adaptation or cross-talker generalization.
We initially selected 6 talkers based on their intelligibility scores (provided by the ALLSSTAR database). Following Bradlow and Bent (2008), we aimed to select test talkers with mid-range intelligibility scores (between 71% to 86% percentage transcription accuracy according to ALLSSTAR database). Next, we conducted a pilot experiment (reported in Appendix B) to assess which of the six talkers yielded the clearest evidence for talker-specific adaptation (i.e., a benefit of talker-specific exposure compared to control exposure). We then selected the four talkers for which the pilot experiment found significant talker-specific adaptation as the four test talkers for Experiments 1a and 1b. This approach resulted in talker-specific adaptation of similar average magnitude as in BB08. Specifically, transcription accuracy for the only test talker in BB08 was about 10% higher in the talker-specific condition compared to the control condition (92 vs. 82 RAU). For the four test talkers selected for the present study, transcription accuracy in the pilot experiment’s talker-specific condition was on average 11.75% higher than in the control condition (range: 7.5%−16%; see Figure B3). We note that such non-random selection of test talkers—the standard in the field—can inflate the effect of talker-specific adaptation, compared to randomly selecting test talkers from the population of L2-accented speakers. Here, it serves our purpose to increase the statistical power for our key questions, about the relative benefit of single- and multi-talker exposure (Questions 1–3).
For each of the four test talkers, the five other L2 talkers out of the original set of six talkers were used as exposure talkers in the singe-talker and multi-talker conditions (i.e., across subjects, all six talkers used in the pilot experiment occurred during the exposure phase of EExperiments 1a and 1b). Following BB08, we also selected 5 male L1-accented talkers of American English to serve as the exposure talkers for the control condition. 120 hearing-in-noise test (HINT) sentence recordings were available for each of these talkers. All sentence recordings in the database had been leveled to 65dB. HINT sentences are simple declarative sentences containing 2–4 keywords (e.g., A boy fell from the window), making them similar in complexity and structure to the sentences used in Bradlow and Bent (2008).
We selected two sets of 16 sentences (32 total) from the HINT sentences for each talker to serve as exposure and test stimuli. Following BB08, these sentences were selected to avoid obvious disfluencies or errors (e.g. false starts, incorrect readings) in the recordings for all selected talkers. The two sets of 16 sentences contained a total of 51 and 52 keywords, respectively. One additional sentence not included in either set was selected to use as a practice sentence. This practice sentence was produced by a female L1-accented speaker of American English and served to familiarize participants with the task for the experiment. Following BB08, all sentence stimuli were convolved with white noise at a +5 signal-to-noise ratio to avoid ceiling effects (speech signal: 65dB; noise: 60dB).
Procedure
We implemented a novel web-based paradigm that otherwise closely followed the procedure of BB08. Participants were told that they should complete the experiment in a quiet room while wearing headphones. Participants then transcribed two words produced by a male L1 speaker of American English in order to set their volume to a comfortable level, which they were asked to not adjust for the remainder of the experiment. Following that, participants transcribed the L1-accented practice sentence.
The main part of the experiment consisted of an exposure phase and a test phase, during both of which participants listened to and transcribed sentences. During the exposure phase, participants listened to and transcribed 80 sentences, one per trial. This constituted 50% of the exposure in BB08, and occurred over the course of one day, as opposed to two days. Participants were asked to transcribe the sentence that they heard to the best of their abilities. Each sentence was only played once. During test, participants transcribed 16 sentences, as in BB08. At the end of experiment, participants completed a short questionnaire that assessed their audio equipment type and familiarity with Mandarin-accented English. This full questionnaire is provided in Appendix A.
The 80 sentence recordings during the exposure phase consisted of 5 repetitions of the 16 sentences from one of the two sentence sets. Presentation of the exposure stimuli was blocked by repetition of the sentences: participants heard all 16 sentences (k − 1)- times, before they heard any sentence for the kth time (where k ranged from 1 to 5). Unintended by us, this structure differs from BB08, who blocked the exposure sentences by talker (Bradlow, email communication, 2/6/2017). Available evidence suggests that this departure from BB08’s design is expected to facilitate cross-talker generalization: ordering exposure stimuli in a way that increases trial-to-trial variability across stimuli has been found to lead to increased cross-talker generalization (Tzeng et al., 2016), increasing the statistical power for our design.
In the control and multi-talker conditions, participants heard all five exposure talkers within each bin of 16 sentences (four talkers three times and one talker four times). The order of the talkers was pseudo-randomized such that the same talker never produced two consecutive sentences in the same trial bin. Across all 80 sentence recordings, participants heard all 16 sentences produced by all five talkers so that each sentence-talker combination occurred exactly once. In the talker-specific and single-talker conditions, each of the six Mandarin-accented talkers served as the exposure talker across participants. This means that each participant heard 16 recordings from the same Mandarin-accented talker, each repeated five times. The ordering of the sentences was the same as in the control and multi-talker conditions.
During the test phase, participants were asked to transcribe the other set of 16 sentences, produced by one Mandarin-accented talker. The task during the test phase was identical to the task during the exposure phase. Each of the four of the Mandarin-accented test talkers served as the test talker equally often (counterbalanced across participants) within each combination of exposure condition and list order. This resulted in 20 participants for each of the four test talkers in each of the four exposure conditions in each of the two experiments.
The fully counter-balanced assignment of exposure and test talkers within and across each of the four exposure conditions is shown in Figure 1. For participants in the control condition, the test talker was the only Mandarin-accented talker they heard in the experiment. For participants in the talker-specific condition, the test talker was the same Mandarin-accented talker as the exposure talker. For participants in the single-talker condition, the test talker was Mandarin-accented but different from the exposure talker. Across participants in the single-talker condition, all four test talkers occurred equally often with each of the remaining five Mandarin-accented talkers as exposure talker. Thus, each of the 20 unique combinations of exposure and test talker were seen by four participants in the single-talker condition (for the total of 80 participants in that condition), equally distributed over the four test talkers (as in all exposure conditions). For participants in the multi-talker condition, the five exposure talkers were the five Mandarin-accented talkers that were not the test talker. By making sure that each exposure talker occurred equally often with each test talker, including across the single- and multi-talker conditions, we remove the design flaw present in BB08.
The assignment of the two sentence sets across exposure and test was counterbalanced across participants: half of all participants heard one sentence set during exposure, and half heard the other sentence set during exposure. One pseudo-random presentation order was created for exposure. This presentation order was reversed to create one additional list for each condition. The same orderings of sentences were heard in the talker-specific and control conditions: only the talkers producing the sentences changed across the two conditions. This resulted in a total of 80 lists in the single-talker condition (5 Mandarin-accented exposure talker pairs × 4 Mandarin-accented test talkers × 2 assignments of sentence sets to exposure vs. test × 2 presentation orders within blocks) and 16 lists each in the multi-talker, talker-specific, and control conditions (4 Mandarin-accented test talkers × 2 assignments of sentence sets to exposure vs. test × 2 presentation orders within blocks), for a total of 128 between-participant lists. One successful participant per list in the single-talker condition, and five participants per list in the other conditions, resulted in the 80 participants per condition for the fully balanced design shown in Table 3.
Analysis approach
We first present an overview of the analysis approach we used to address Question 1–3 from Table 1. We then present separate analyses of Experiments 1a and 1b. Additional analyses assess the extent to which the results of the two experiments support the same conclusions, and how much results depend on the different test talkers.
Bayesian mixed-effects logistic regression.
We employed Bayesian generalized linear mixed models with a Bernoulli (logit) link, predicting proportion of keywords correct (correct = 1 vs. incorrect = 0) as a function of exposure condition as well as the maximal random effect structure justified by the design: by-participant random intercepts and by-item random intercepts and slopes for condition (for introductions to mixed-effects logistic regression, see Jaeger, 2008; Johnson, 2009). We used sliding difference coding for the exposure condition, an orthogonal coding scheme that compares the talker-specific against the multi-talker condition, the multi-talker against the single-talker condition, and the single-talker against the control condition. We then used Bayesian hypothesis testing to ask our three research questions (for further details and code, see SI, §4–5).
Model fitting.
All analyses were fit using the library brms (Bürkner, 2017), which provides a Stan interface for Bayesian generalized multilevel models in R (R Core Team, 2018, version 3.5.2). Stan (Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li, Riddell & others, 2016) allows the efficient implementation of Bayesian data analysis through No-U-Turn Hamiltonian Monte Carlo sampling. We follow common practice and use weakly regularizing priors to facilitate model convergence. For fixed effect parameters, we use Student priors centered around zero with a scale of 2.5 units (following Gelman et al., 2008) and 3 degrees of freedom.6 For random effect standard deviations, we use a Cauchy prior with location 0 and scale 2, and for random effect correlations, we use an uninformative LKJ-Correlation prior with its only parameter set to 1 (Lewandowski, Kurowicka & Joe, 2009), describing a uniform prior over correlation matrices.
To achieve robust estimates of Bayes Factors for the Bayesian hypothesis tests (in particular, the replication test, presented below), each model was fit using eight chains with 10,000 post-warmup samples per chain, for a total of 80,000 posterior samples for each analysis. Each chain used 2,000 warmup samples to calibrate Stan’s No U-Turn Sampler. All analyses reported here converged (e.g., all . Bayesian hypothesis testing was conducted via the hypothesis function of brms (Bürkner, 2017).
How we report results.
Rather than p-values, we report Bayes Factors and posterior probabilities for each hypothesis. The Bayes Factor (henceforth BF, Jeffreys, 1961; Kass & Raftery, 1995; Wagenmakers, 2007) quantifies the odds of the hypothesis tested (e.g., that the difference between talker-specific and control exposure is > 0) compared to the alternative hypothesis (e.g., that the difference between talker-specific and control exposure < 0). BFs of 1 to 3 are considered “weak” evidence, BFs > 3 “positive” evidence, BFs > 20 “strong” evidence, and BFs > 150 “very strong evidence” (Raftery, 1995). Bayes Factors provide a coherent measure of support: if the Bayes Factor for a hypothesis is x, then the Bayes Factor against the hypothesis (and for its alternative) is 1/x. A perhaps even more intuitive measure of support is the posterior probability of the hypothesis tested (henceforth pposterior): if we assume that both hypotheses are equally likely a priori, then pposterior = BF / (1 + BF). Posterior probabilities of > .95 can thus be considered the closest equivalent to the conventional significance criterion in null hypothesis significance testing in the psychological sciences (though this interpretation might be seen as not in the Bayesian spirit).
Controlling for individual differences.
All analyses further included an offset term to remove possible confounds due to differences in audio equipment, prior accent experience, task engagement, proficiency with transcription tasks, or other participant-specific differences. For any of these reasons, a participant may achieve higher transcription accuracy independent of the exposure condition. We thus estimated individual differences in performance at the onset of exposure and corrected for those differences in our analysis of the test responses. Specifically, we fit a Bayesian mixed-effects logistic regression to the exposure data from both Experiments 1a and 1b, predicting the proportion of keywords correct during exposure. Predictors included exposure condition, Experiment, trial bock, and all interactions. The model further included the maximal random effect structure: random intercepts by participant, by item and by the current talker, as well as by-item random slopes for exposure condition. The by-talker random intercept captures inter-talker differences in intelligibility, as the stimuli in the first trial bin during exposure can come from different talkers, depending both on the condition and the specific list. Further details, visualization, and the full results of the exposure analysis are presented in the SI (§3). The SI also contain additional analyses that relax the assumption of linear changes in performance across trial blocks during exposure (using monotonic effects instead, §3.3.2). All results presented here replicate under those relaxed assumptions.
We coded trial bin as 0 for the first block, so that the random by-participant intercepts reflect differences in performance at the beginning of the experiment (relative to the average performance in the respective exposure condition). These by-participant intercepts were indeed highly predictive of participants’ performance during test (see SI, §3.4): participants who achieved high transcription accuracy during exposure relative to other participants in the same exposure condition, also achieved comparatively higher transcription accuracy during test relative to other participants in the same condition (BFs for all exposure conditions > 150, all pposterior > .9999). The magnitude of this effect was highly similar across all four exposure conditions, and across both experiments. This suggests that individual differences in audio equipment, task engagement, or other factors strongly affected performance during test. For the analysis of test responses reported next, we thus include the by-participant random intercepts from the exposure analysis as an offset term (i.e., we set the coefficient for this individual difference term to be the constant 1).7
Results
The full summaries of the mixed-effects logistic regressions are provided in the SI (§4–5). Figure 2 shows the estimated difference between conditions (in log-odds) for each comparison relevant for Question 1–3. Table 4 summarizes the corresponding Bayesian hypothesis tests. A positive estimate in Table 4 means that participants in the first exposure condition transcribed more words correctly than participants in the second condition. We illustrate this for the comparison between the talker-specific condition and the control condition, which we include for the sake of comparison. This comparison estimates the benefit of talker-specific adaptation. In Experiment 1a, for example, the estimated median difference between the talker-specific and the control condition is .81 log-odds. The posterior probability of the hypothesis that the benefit of talker-specific exposure is larger than zero is estimated to be at least 0.9999 for Experiment 1a (and would therefore meet standards well beyond the traditional significance criterion).
Figure 2.
Posterior density estimates (over log-odds of correct transcription) for the comparisons associated with research questions 1–3 for both Experiment 1a (top) and 1b (bottom). For comparison, we also show the effect of talker-specific adaptation. Points represents the median of the estimates, 50% of the estimates fall within the thick lines, and 95% within the thin lines.
Table 4.
Summary of Bayesian hypothesis testing for Questions 1–3 for Experiments 1a and 1b. For comparison, we include the estimates for talker-specific adaptation. CNTL: control exposure; ST: single-talker MT: multi-talker; TS: talker-specific. The first four columns show the median estimate for each comparison (in log-odds), its estimated standard error, and 95% credible intervals. The evidence ratio (Bayes Factor, BF) in support of each one-sided hypothesis is expressed in the fifth column. BFs > 1 indicate support for the hypothesis; BFs < 1 indicate support against the hypothesis. The final column gives the estimated posterior probability of the hypothesis.
Hypothesis | Est. | SE | CIlower | CIupper | BF | pposterior |
---|---|---|---|---|---|---|
Adaptation: TS > CNTL | 0.81 | 0.161 | 0.55 | 1.08 | >79999.0 | >0.9999 * |
Question 1: MT > CNTL | 0.51 | 0.140 | 0.28 | 0.74 | 6665.7 | 0.999 * |
Question 2: ST > CNTL | 0.30 | 0.147 | 0.05 | 0.54 | 41.3 | 0.976 * |
Question 3: MT > ST | 0.21 | 0.137 | −0.01 | 0.44 | 16.3 | + |
(a) Experiment 1a | ||||||
Hypothesis | Est. | SE | CIlower | CIupper | BF | pposterior |
Adaptation: TS > CNTL | 0.48 | 0.149 | 0.24 | 0.73 | 2050.3 | 0.999 * |
Question 1: MT > CNTL | 0.27 | 0.131 | 0.05 | 0.49 | 49.0 | 0.988 * |
Question 2: ST > CNTL | 0.14 | 0.124 | −0.07 | 0.34 | 6.6 | 0.868 |
Question 3: MT > ST | 0.13 | 0.126 | −0.07 | 0.34 | 5.7 | 0.850 |
(b) Experiment 1b |
Results that meet traditional thresholds for significance or marginal significance+ are marked.
There are a number of clear similarities between the two experiments. Foremost of all, all critical comparisons show effects in the same direction for both Experiment 1a and 1b. Support for these effects is positive or stronger in all comparisons, and the relative benefits from three exposure conditions rank consistently across both experiments, with the best test performance in the talker-specific exposure, followed by the multi-talker exposure and then by single-talker exposure. The relative support for each of the three questions, too, ranks consistently across the two experiments.
Both Experiment 1a and 1b provide very strong evidence of talker-specific adaptation, replicating previous work including BB08. Both Experiment 1a and 1b also provide at least strong evidence for generalization to an unfamiliar L2 talker after exposure to multiple L2 talkers of the same accent (Question 1), again replicating BB08 and other studies showing the benefit of talker-specific or multi-talker exposure (e.g., Baese-Berk et al., 2013; Sidaras et al., 2009; Tzeng et al., 2016). Unlike in BB08, both experiments provide support for generalization after exposure to a single talker of the same accent (Question 2), although the degree of support for this hypothesis differs across the two experiments. Whereas Experiment 1a provides strong support, Experiment 1b only provides positive support for the hypothesis that cross-talker generalization is possible after exposure to a single talker. Finally, both experiments provide positive support for the hypothesis that multi-talker exposure facilitates cross-talker generalization beyond single-talker exposure (Question 3). This qualitatively replicates BB08, but unlike BB08 the evidence from Experiments 1a and 1b does not reach the threshold of the traditional significance criterion.
Strikingly, all differences between conditions were smaller in Experiment 1b compared to Experiment 1a. Next, we discuss possible reasons for this difference and ask what can be concluded from our data.
What can be concluded from Experiments 1a and 1b together?
We begin with a visual comparison of Experiments 1a and 1b, and then present additional Bayesian analyses that assess the extent to which the two experiments support the same conclusions. We recognize that it would be preferable to assess the degree of replication between the present experiments and the original study (BB08). However, only aggregate statistics are available from the original study, which was published before reproducibility became a more broadly recognized standard.
Figure 3 shows by-participant transcription accuracy during test side by side for both experiments (see SI, §6, for same plots in empirical logits). This reveals a striking difference between Experiments 1a and 1b: transcription accuracy in the control condition of Experiment 1b was much higher than in the control condition of Experiments 1a. That is, there is a striking difference between Experiments 1a and 1b in terms of how well participants performed who did not receive exposure to the L2 accent within our experiment.
Figure 3.
Transcription accuracy during test following different exposure conditions (Experiments 1a and 1b). Small points show by-participant means. Solid larger points show averages across the by-participant means, and error bars represent 95% confidence intervals bootstrapped over by-participant means.
We can only speculate as to the causes for this difference. One possibility is that, by chance, the participants recruited in the control condition in Experiment 1a had lower performance than would have been representative of the population they were recruited from. Another possibility is that both the participants in Experiment 1a and the participants in Experiment 1b were representative of the population we were recruiting from at that moment in time. Experiments 1a and 1b were conducted almost 3 years apart between 2015–2018. Although we excluded participants who reported prior familiarity with Chinese or Chinese-accented English, it is possible that overall exposure to Chinese or Chinese-accented English in the population we recruited from increased sufficiently much during that time to explain the overall increase in performance in the control condition.8
Regardless of the reasons, higher baseline performance in the control condition in Experiment 1b reduces the power to detect increases in performance in the other conditions (for demonstration, see Dixon, 2008): the closer the average performance in the baseline condition is to ceiling performance, the harder it is to detect facilitation. For the multi-talker and single-talker conditions—which are expected to elicit equal or poorer performance than the talker-specific condition—ceiling performance is defined by the talker-specific condition (about 90% in both Experiment 1a nd 1b). For Experiment 1b, this means that there was not that much space for improvement above the control condition (86.7% accuracy). Indeed, the estimated effect of talker-specific adaptation in Experiment 1b is about half the size of Experiment 1a (see Table 4). Figure 3 suggests that this is primarily, though perhaps not completely, due to increased performance in the control condition.
What then should we conclude from Experiments 1a and 1b together? Bayesian data analysis provides us with two tools to address this question. The first analysis treats the two experiments as exchangeable and assess whether each effect is present or not using the pooled data from Experiments 1a and 1b. The second analysis gives up the (in this case, questionable) assumption of exchangeability, and instead asks whether the effect sizes found in Experiment 1a are replicated in Experiment 1b. The two analyses thus address related, but different, questions. Together they delineate what we can conclude from Experiments 1a and 1b.
Meta-analysis Bayes Factor.
When the data from both experiments are combined and analyzed with the exact same approach as in the previous section (SI, §6.1), they provide strong evidence for affirmative answers to all of Questions 1–3. The relative size of effects order in the same way as in the separate analyses of Experiments 1a and 1b. Specifically, the two experiments together provide very strong support both for talker-specific adaptation (BF > 79999.0, pposterior > 0.9999) and for cross-talker generalization after multi-talker exposure (Question 1: BF > 79999.0, pposterior > 0.9999). Support for cross-talker generalization after single-talker exposure is strong (Question 2: BF = 109.0, pposterior > 0.991), as is support for the advantage of multi-talker over single-talker exposure (Question 3: BF = 45.6, pposterior = 0.979). As in the individual analyses of Experiments 1a and 1b, support for the difference between single- and multi-talker exposure (Question 3) remains the weakest of the effects investigated here (though still above the traditional significance criterion).
The meta-analysis takes advantage of all available data, which helps to alleviate issues with the increased baseline performance in the control condition in Experiment 1b. However, the assumption of exchangeability is a potentially problematic assumption. Recall, for example, that we initially recruited participants for just the control and talker-specific conditions in Experiment 1a (to ascertain that the paradigm was able to detect talker-specific adaptation). Experiment 1b instead used fully random assignment of participants to any of the four exposure conditions. For the present purpose, the meta-analysis approach thus likely provides an inflated estimate of the overall available evidence (in particular, with regard to comparisons involving the talker-specific condition). This motivates the next analysis.
Replication Bayes Factor.
The Bayesian replication test we present assesses whether Experiment 1b is more likely to reflect a replication of the effect sizes observed in Experiment 1a or a null effect. This is a particularly stringent test: even when the effect of the replication study (here Experiment 1b) goes into the same direction as the original study (Experiment 1a), this does not necessarily mean that the two studies support the same effect size, compared to the alternative hypothesis of a null effect.
Our approach follows Verhagen & Wagenmakers (2014). Verhagen and Wagenmakers (p. 1461) describe a Bayesian replication test that pitches the hypothesis that the effect observed in an experiment constitutes a replication of the effect observed in an original experiment, against the skeptic’s hypothesis of a null effect. The test calculates the Bayes Factor of the replication hypothesis over the null hypothesis:
(1) |
where Yrep is the vector of responses in the replication experiment, Yorig is the original data, and δ is the size of the effect for which we seek to test replication. The numerator in Equation 5 describes the marginal likelihood of the replication data given prior beliefs (i.e., given prior uncertainty) about the effect size after having observed the original data. That is, we use the original data to estimate the posterior probability of different effect sizes, which then constitute our estimate of the prior probability of different effect sizes under the proponent’s hypothesis of replication. We then estimate the likelihood of the replication data under the different effect sizes each weighted by their relative prior probability. The denominator in Equation 1 is the likelihood of the replication data if the effect is zero.
Verhagen & Wagenmakers (2014) illustrate this approach for t-tests. We apply it to generalized linear mixed model (here, with a Bernoulli response distribution, though our code can be applied to other types of GLMMs). As described in Appendix D, the approach captures uncertainty about any of the other predictors in the model, when assessing replication success for a particular predictor. The test provides a principled quantitative measure of the support for the replication hypothesis, and readily extends to generalized linear mixed models with other link functions. Appendix E summarizes simulations validating the test.
We use this test to ask whether Experiment 1b replicates the specific answers to Questions 1–3 provided by Experiment 1a. The results are shown in Figure 5. With a replication Bayes factor (BFHrepH0) of 584.4, there is very strong evidence for talker-specific adaptation effects (talker-specific exposure vs. control) across both Experiments 1a and 1b (i.e., pposterior > 0.998). The Bayes Factor suggests that it is over 500 times more likely to observe the data of Experiment 1b under the replication hypothesis than under the hypothesis of a null effect.
Figure 5.
Results of replication test assessing whether the effects observed in Experiment 1b are more likely to result under the hypothesis of replication vs. the hypothesis of a null effect. Panels on the right zoom in on the region around . The colored vertical lines indicate the ordinates of the prior and posterior at the skeptic’s null hypothesis that the effect size is zero. The Bayes factor for the replication test, BFHrepH0, is identical to the ordinate of the prior divided by the ordinate of the posterior (Verhagen & Wagenmakers, 2014). A BFHrepH0 > 1 thus indicates how much less likely the null hypothesis is after seeing the data from Experiment 1b, compared to before. Note that limits of the y-axis vary across the rows of the plot (not shown), so as to facilitate comparison of the prior and posterior within each row.
There is strong support for a replication with regard to Question 1 (multi-talker exposure vs. control, BFHrepH0 = 24.5, pposterior = 0.96). This result confirms that Experiment 1a and 1b consistently yield generalization after multi-talker exposure. For Questions 2 and 3, the evidence is non-decisive. For Question 2, there is positive evidence for a replication of cross-talker generalization following single-talker exposure (BFHrepH0 = 4.4, pposterior = 0.81). For Question 3, there is only weak evidence for the hypothesis that Experiment 1b replicates Experiment 1a (BFHrepH0 = 1.6, pposterior = 0.62).
Both the meta-analysis and the replication test agree that all of Questions 1–3 have affirmative answers. At first blush, it might be surprising that the replication test does not return more convincing evidence of replication for Questions 2–3 despite a total of 640 participants. However, it is important to keep in mind that the replication test, too, is affected by the high baseline performance in Experiment 1b. Recall that the statistical power to detect facilitation decreases as baseline performance approaches ceiling performance. Similarly, the ability to find decisive evidence that Experiment 1b replicates Experiment 1a also decreases as baseline performance in Experiment 1b approaches ceiling. This is confirmed by computational analyses presented in Appendix E. Put differently, the replication test has low sensitivity for data like that of Experiments 1b.
Altogether, our results suggest that cross-talker generalization can occur after both single- and multi-talker exposure, while also suggesting that the specific sizes of these effects vary substantially across participants (see also Figure 3). This is a clear signal that the field should maintain uncertainty about the effects of exposure on cross-talker generalization—uncertainty that is often lost when data are summarized in terms of significances (e.g., Kruschke, 2014; Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, 2013; Vasishth et al., 2018; Wagenmakers, 2007). To further highlight the consequences of this uncertainty, we explore our results across the four test talkers.
To what extent do the results depend on the test talker?
Unlike BB08 (and most work on speech perception), Experiments 1a and 1b employed multiple test talkers. Following suggestions by reviewers, we present post-hoc comparisons of exposure effects for these different talkers. We emphasize, however, that even 320 participants per experiment are likely not enough to expect reliable results at the by-talker level. Only 20 participants per experiment were available for analysis when each unique combination of exposure condition and test talker is considered—barely meeting the suggested minimum for between-participant tests (Simmons et al., 2011). The results presented here should thus be interpreted with caution.9
To investigate effects by test talker, we refit the analyses reported above including random effects by test talker (intercepts and slopes for exposure condition) in addition to the random effects by subject and items already included in the analysis. Compared to the alternative approach of separate analyses by test talker, the approach taken here reduces, but does not eliminate, the risk of over-fitting (which would exaggerate differences between talkers). The random effects by test talker act as a regularizing prior that ‘pulls’ differences across test talkers towards zero—i.e., it pulls the means of individual talkers towards the mean of the conditional by-talker means (sometimes referred to as “shrinkage”; see e.g., Kliegl, Masson & Richter, 2010, Figure 5 for visualization). Conceptually, this regularization implements a form of Occam’s razor.
Figure 6 summarizes the posterior distribution of effect sizes for Questions 1–3 obtained from the updated analysis, separately by test talker. The full model summaries are reported in the SI (§7.4). Overall, Experiments 1a and 1b continue to show a remarkable degree of agreement even at the level of individual test talkers. As can be seen by comparing the first three rows of each panel, the four exposure conditions rank identically across all four test talkers in both experiments: talker-specific exposure always provides at least as much benefit as multi-talker exposure, which provides at least as much benefit as single-talker exposure, which provides at least as much benefit as control exposure (all median estimates for Question 2 are larger than zero). Bayesian hypothesis tests confirm that for both experiments, support for effects of exposure was strongest for test talkers 032 and 035, and weakest for test talker 043 (see BFs in Figure 6; for details, see SI §7.1). Additionally, the support for an affirmative answer to the three research questions ranks consistently across all four test talkers and both experiments, with one exception (the relative ordering of Questions 2 and 3 for test talker 037 in Experiment 1b).
Figure 6.
Same as Figure 2 but split by test talker based on analysis that additionally contained random effects by-test talker. Posterior density estimates for the comparisons associated with research questions 1–3. Points represents the median of the estimates, 50% of the estimates fall within the thick lines, and 95% within the thin lines. BFs quantify the support for the hypothesis that the effect is larger than zero.
Talker-specific adaptation is the only effect for which there was at least strong support (BFs ≥ 34.7, pposteriors ≥ .972) for all four test talkers in both experiments. Indeed, talker-specific adaptation is the only effect that receives at least strong support for any test talker in Experiment 1b. Cross-talker generalization after multi-talker exposure receives strong support for three test talkers in Experiment 1a (Question 1: BFs ≥ 20.7, pposteriors ≥ 0.954); this support is positive for the remaining test talker in Experiment 1a (Talker 043, BF > 14.0, pposterior = 0.93) and all test talkers in Experiment 1b (BFs ≥ 5.8, pposteriors ≥ 0.053). Support for cross-talker generalization after single-talker exposure is strong for two test talkers in Experiment 1a (Question 2: BFs ≥ 36.7, pposteriors ≥ 0.974); this support is positive for the remaining two test talkers in Experiment 1a and all test talkers in Experiment 1b (BFs ≥ 4.2, pposteriors ≥ 0.809), except for Talker 037, for which support was weak (BF > 1.4, pposterior = 0.575). Finally, support for the hypothesis that multi-talker exposure facilitates generalization beyond single-talker exposure is not strong for any test talker in either experiment (all BFs < 20). The strongest support comes from Talker 037 in Experiment 1b (BF > 17.8, pposterior = 0.947). A look at Figure 6 reveals that this is caused by the very small estimate for the effect of single-talker exposure for this test talker in Experiment 1b.10
The four test talkers in Experiments 1a and 1b were selected to have similar baseline intelligibility and large effects of talker-specific adaptation (in the pilot experiment reported in Appendix B). It is thus encouraging, but not entirely surprising, that we find a fair amount of agreement across test talkers—with just 20 participants per test talker. However, even for this rather homogeneous set of test talkers, it is clear that we might have arrived at different conclusions if Experiments 1a and 1b had employed only one test talker. For example, there is little evidence for any difference between single- and multi-talker exposure in Talker 043 in either experiment (see Figure 6). This contrasts with Talker 035, for which both experiments find a consistent advantage of multi-talker over single-talker exposure. Similarly, we could have gotten ‘unlucky’ by solely relying on test talker 037, for which we find less consistent results across the two experiments—an inconsistency that, as it turns out, is due to substantially larger by-participant variability for this talker in both experiments (see footnote 10). In the Bayesian analyses presented here, these differences between test talkers show as varying degrees of support (the Bayes Factors). In common practices of summarizing null hypothesis significance testing, however, the same differences could have resulted in seemingly qualitative difference in significance (see also discussion of the “significance filter”, Vasishth et al. 2018).
General Discussion
We set out to replicate key findings of an influential study on cross-talker generalization during speech perception (Bradlow & Bent, 2008). This seminal study by Bradlow and Bent established listeners’ ability to generalize previously experienced speech input across talkers of the same accent, while also suggesting clear limits to this ability. Bradlow and Bent found that exposure to multiple L2 talkers facilitated comprehension for L2-accented speech from a different talker of the same accent. Participants’ transcription accuracy on the unfamiliar test talker after multi-talker exposure was, in fact, indistinguishable from their performance after the same amount of exposure to that specific test talker (talker specific adaptation). This ability to generalize to an unfamiliar talker after multi-talker exposure stands in stark contrast to BB08’s findings for single-talker exposure. After exposure to the same total number of trials from just a single L2 talker, BB08 did not observe any facilitation for the unfamiliar test talker. Bradlow and Bent interpreted these results as evidence that listeners can learn talker-independent representations about an L2 accent, provided they are exposed to sufficient variability (as in the multi-talker condition). Both in its categorical form (multi-talker exposure as a requirement for generalization) and in more gradient interpretations (multi-talker exposure as an additional benefit beyond single-talker exposure), this hypothesis has been influential within research on speech perception (e.g., Kleinschmidt & Jaeger 2015; Baese-Berk 2018) and beyond (e.g., Schmale, Seidl & Cristia 2015; Potter & Saffran 2017; Paquette-Smith, Cooper & Johnson 2020).
The design of the original study did, however, contain a critical confound: while all between-participant conditions employed the same L2 test talker, the single- and multi-talker conditions employed different L2 exposure talkers. As a consequence of this confound, there is an alternative explanation for the lack of cross-talker generalization after single-talker exposure in BB08. As we detail below, this alternative explanation focuses on the objective similarity of talkers’ speech—specifically with regards to the talker-specific mappings from linguistic categories (like phonological segments or words) onto the speech signal—and thus appeals to different mechanisms and theoretical constructs than the explanation advanced in BB08. Under this alternative explanation, the lack of a cross-talker generalization in BB08’s single-talker condition is a consequence of the particular talkers that were employed in that condition, and the fact that they differed from the talkers in the multi-talker condition. Finally, further complicating the interpretation of BB08’s finding is the fact that the study employed a small number of participants (10 per condition), following the standards of the field at that time.
The present large-scale replications with a total of 640 participants aimed to closely follow BB08’s paradigm while avoiding the confound of the original study. Across the participants in our replications, the single- and multi-talker conditions employed the exact same exposure talkers. We further aimed to assess the replicability of the key findings across a variety of combinations of exposure and test talkers. Experiments 1a and 1b each employed four different test talkers, with fully balanced designs within each test talker, including 20 unique combinations of exposure and test talkers in the single-talker condition.
Replicating both BB08 and other studies across a variety of paradigms, we find clear support that exposure to a few dozen sentences of L2-accented speech facilitates subsequent comprehension of speech from the same talker (talker-specific adaptation to an L2 talker, e.g., Clarke & Garrett 2004; Weil 2001; Xie et al. 2018; Paquette-Smith et al. 2020; Gordon-Salant, Yeni-Komshian, Fitzgibbons & Schurman 2010; for review, see Baese-Berk 2018). This result received strong support across all four test talkers in both Experiment 1a and 1b. With regard to our three research questions about cross-talker generalization, we replicate BB08 and find clear support for cross-talker generalization after exposure to multiple talkers of the same L2 accent (Question 1; see also Alexander & Nygaard 2019; Sidaras et al. 2009; Tzeng et al. 2016). We replicate this result for all four test talkers in both Experiment 1a and 1b, though the strength of the support for this finding varied across test talkers and experiments (Experiment 1a vs. 1b). Unlike BB08, we also find support for cross-talker generalization after exposure to a single talker (Question 2). We find positive evidence of cross-talker generalization across all four test talkers though the strength of the support again varied across test talkers and experiments. As we discuss below, the degree of cross-talker generalization seems to depend on the specific combination of exposure and test talker. Overall, support for this finding was weaker than for Question 1. Finally, also unlike BB08, we find less clear evidence that multi-talker exposure facilitates cross-talker generalization beyond exposure to a single talker (Question 3). While multi-talker exposure led to better transcription accuracy than single-talker exposure for all four test talkers in both Experiment 1a and 1b, these effects were often very small.
Both differences between the present replications and the original BB08 study relate to participants’ performance after single-talker exposure. In the remainder of this discussion, we focus on this difference and its relevance for theories of speech perception. We first discuss procedural differences between our replications and BB08 (other than the intended difference in design). We conclude that procedural differences are unlikely to account for the difference in results. We then return to the most likely explanation for the different findings: the removal of the design confound coupled with increased statistical power. We discuss findings since BB08’s seminal study that speak to the mechanisms underlying cross-talker generalization—both from research on the perception of L2-accented speech and other lines of work (e.g., Kraljic & Samuel, 2006; Reinisch & Holt, 2014; Xie & Myers, 2017; Xie et al., 2018). Drawing on these works, we discuss an account in terms of similarity-based generalization, and contrast it with the previous focus on variability as the explanatory variable in generalization. We lay out how similarity-based generalization can explain both BB08’s failure to find cross-talker generalization after single-talker exposure and the results of our replication experiments. The same account also offers explanations for the relative performance in the talker-specific vs. multi-talker conditions (identical in BB08, but different in our replications), and the differences between the 20 unique exposure-test talker combinations in the single-talker condition in our experiments (which we visualize below). Finally, the proposed interpretation of our results has consequences for the theoretical role of variability in learning, training, teaching, and therapy. We thus close with a brief discussion of why variability during learning often seems to facilitate generalization beyond the specific training.
Procedural differences between original and replication studies
Experiments 1a and 1b largely follow the general design and procedure of BB08 very closely. Like BB08, we used an exposure-test paradigm with four exposure condition manipulated between-participants. We used the same target language (American English) and the L2-accented talkers had the same L1 background (L1 Mandarin). The task on each trial was the same as in BB08 (transcription), and the stimuli were of similar complexity (short sentences of 3–4 content words with uninformative context), using the same level of noise-masking. Like BB08, we held the test talker(s) constant across exposure conditions, and used the same number of test tokens. Like BB08, all exposure and test talkers were male.
One difference from BB08 is that our replications were conducted over the web. This decision was motivated by practical, rather than theoretical, considerations, in order to facilitate the recruitment of a large number of participants (see also Liu & Jaeger, 2018, 2020; Harrington Stack, James & Watson, 2018). However, web-based recruitment entails i) a more heterogeneous participants group, ii) less control over the way participants approach the task, and iii) more variability in auditory equipment across participants. The first two properties are not necessarily undesirable. For example, participants might act more naturally when they do not feel observed, and results from more heterogeneous participant groups are more likely to generalize to a broader population. That said, all three properties of web-based experiments can in theory inject additional variability into participants’ responses, reducing the statistical power. In the present case, this is unlikely to be of concern, as the vastly larger number of participants we recruited is more than likely to make up for this potential loss in power (see Germine, Nakayama, Duchaine, Chabris, Chatterjee & Wilmer (2012); Hilbig (2016) for example, for other works showing a comparable amount of performance variability between web-based and lab-based perceptual experiments). And, critically, any hypothetical reduction in power would not explain why we do find evidence for cross-talker generalization after single-talker exposure, whereas BB08 did not.
The second procedural difference between the replications and BB08 pertains to the presentation order during exposure. For the control and multi-talker conditions, BB08 blocked stimuli by talker (first all 16 recordings from talker 1, then all recordings from talker 2, etc.). However, in Experiments 1a and 1b in our study, each block consisted of 3–4 sentences from each of the five exposure talkers for the total of 16 sentence recordings per block. Across all blocks, all 16 recordings from each of the five talkers were played. This difference in stimulus order was unintended and did not serve any practical or theoretical purpose. There is some evidence that this exposure structure facilitates generalization across talkers for multi-talker exposure (Tzeng et al., 2016), in line with similar evidence from motor (Knock, Ballard, Robin & Schmidt, 2000) or visual category learning (Carvalho & Goldstone, 2014).11 It is therefore possible that the effects of multi-talker exposure were enhanced in Experiments 1a and 1b, compared to BB08. However, for the single-talker condition and the talker-specific condition, the original and replication study did not differ in their exposure order: in both approaches, all 16 recordings were seen within each block, and then repeated five times. Therefore, differences in stimulus order cannot explain why we observed cross-talker generalization after single-talker exposure.
A third and final difference between our replications and the original study is the amount of exposure. BB08 assessed generalization following two days of exposure, with twice the total number of exposure sentences compared to our replications (see also Baese-Berk et al., 2013). Theoretically, it is possible that listeners initially exhibit single talker generalization, and that this ability to generalize decreases with extended exposure or over time. Two previous findings suggest that this is an unlikely explanation for our finding. First, there is evidence that sleep-mediated consolidation facilitates generalization following single-talker exposure. Xie et al. (2018) exposed listeners to Mandarin-accented English of one talker, and then tested categorization on another talker of the same accent both immediately after exposure and a day later. Between participants, two different Mandarin-accented test talkers were used. Immediately after exposure, cross-talker generalization was only found for one of the two test talkers. One day after exposure, cross-talker generalization was found for both talkers. This suggests that sleep consolidation can facilitate generalization, but is not necessary for cross-talker generalization (for related results from talker-specific adaptation to L1-accented talkers, see Eisner & McQueen 2006). Second, an unpublished study from Weil’s thesis found cross-talker generalization to a talker of the same accent after three separate exposure sessions to a single Marathi-accented talker of English distributed across three days (Weil, 2001). These results suggest that multi-day exposure and intermittent sleep promote, rather than interfere with, cross-talker generalization. If anything, we would thus expect even stronger evidence for an effect of single-talker exposure had we followed BB08 and assessed generalization after two days of exposure. This is also expected based on the hypothesis—proposed by Bradlow and Bent (BB08, p. 723) to explain the results of Weil (2001)—that longer exposure to a single talker might somehow allow listeners to distinguish between idiosyncratic and accent-specific aspects of the talker’s speech, facilitating cross-talker generalization. This theoretical proposal by BB08 was also the very reason we decided on a single session replication. The fact that we observe single talker generalization after exposure to just five repetitions of the same 16 recordings highlights that cross-talker generalization can occur after very little evidence of within-talker variability and zero evidence of cross-talker variability (see also Xie et al. 2018 with 30 critical words of exposure; Xie et al. 2018 with 18 sentences of exposure).
In short, none of the procedural differences between our replications and BB08 explain why we do observe cross-talker generalization after single-talker exposure, whereas BB did not. Rather, as we discuss next, the primary reason for the key difference in results is likely the removal of the design confound present in BB08.
Theoretical considerations about the mechanism underlying cross-talker generalization
The remainder of this section considers a theoretical perspective that offers an explanation for both the present results and those of BB08. This perspective is inspired by—and empirically supported by—a number of previous works (in particular, Alexander & Nygaard, 2019; Kraljic & Samuel, 2006; Eisner & McQueen, 2005; Reinisch & Holt, 2014; Xie & Myers, 2017), and makes testable predictions for future work. We start with the common assumption that listeners store distributional representations of previously experienced input, and draw on them during the categorization of subsequent input. This assumption is shared by most contemporary accounts of speech perception and spoken word recognition, including Bayesian (Clayards, Tanenhaus, Aslin & Jacobs, 2008; Feldman, Griffiths & Morgan, 2009; Norris & McQueen, 2008; Kleinschmidt & Jaeger, 2015), episodic (Goldinger, 1996, 1998), exemplar (Hay, 2018; Johnson, 1997; Pierrehumbert, 2002), Fuzzy Logical (Oden & Massaro, 1978) and other accounts (Luce & Pisoni, 1998; Lancia & Winter, 2013). In all of these accounts, speech is recognized based on the statistical mapping between linguistic categories (e.g., allophones, phones, or words) onto phonetic cues (e.g., the voice onset time of a stop sound) in previously experienced speech similar to the current input. Here we argue that this process is sufficient to explain existing findings about cross-talker generalization, without any appeal to additional theoretical constructs, such as talker-independent representations or a special role of high variability input (unlike in Bradlow & Bent 2008).
Under the proposed account, the mechanisms underlying cross-talker generalization are the same that underlie talker-specific adaptation. Just as the benefits of talker-specific exposure derive from the similarity between the previously experienced input and the subsequent input during test, cross-talker generalization is predicted to be successful—and thus lead to increased comprehension accuracy during test—when the relevant phonetic distributions inferred from exposure are similar to those during test. By “relevant phonetic distributions” we mean the mapping from linguistic categories to phonetic cues (category-cue mappings), specifically for those categories that are observed during test. Evidence for the relevant category-cue mappings might be observed directly during exposure. For example, exposure to repeated instances of a category with atypical cue realization (e.g., /b/s with atypically long or atypically variable voice onset times) can affect how listeners interpret that cue during test (e.g., Clayards et al., 2008; Kraljic & Samuel, 2006; Munson, 2011; Theodore & Monto, 2019). Evidence for the relevant distribution might also be inferred based on implicit knowledge listeners have about the correlational structure of cues across categories (see Chodroff, Golden & Wilson, 2019; Idemaru & Holt, 2011; Kleinschmidt & Jaeger, 2015, Part II). For example, talkers who produce long VOTs for /p/s tend to produce long VOTs for /t/s (e.g., Chodroff & Wilson, 2017; Chodroff et al., 2019), and F0 tends to be higher in voiceless stops than in voiced stop (references in Idemaru & Holt, 2011). Whether through direct observation or inferences, exposure is expected to facilitate comprehension during test only if the category-cue mappings during exposure lead to sufficiently correct expectations about the actual category-cue mappings during test. In other words, exposure is helpful to the extent that it suggests category cue mappings that are at least similar to the actual category-cue mappings during test. We will refer to this as similarity-based generalization.
When is exposure to a single talker expected to result in successful cross-talker generalization?
Large-scale tests of similarity-based generalization are so far lacking. There are, however, a number of results that lend plausibility to this hypothesis. One line of research has investigated the role of talker-to-talker similarity in atypical L1 speech (e.g., Eisner & McQueen, 2005; Kraljic & Samuel, 2006, 2007). In these works, listeners are exposed to one or more talkers with atypical realizations of a specific sound (e.g., an /f/ shifted somewhat towards the typical pronunciation of an /s/ in Eisner & McQueen 2005). During a subsequent test, listeners categorize sounds along a phonetic continuum (e.g., ranging from /f/ to /s/ in Eisner & McQueen 2005). If exposure and test employ the same talker, this type of paradigm elicits what is known as perceptual recalibration: after exposure to atypical /f/, for example, more sounds along the /f/-/s/ continuum will be categorized as /f/ (e.g., Babel, McAuliffe, Norton, Senior & Vaughn, 2019; Eisner & McQueen, 2006; Kraljic & Samuel, 2005; Drouin, Theodore & Myers, 2016; Norris et al., 2003; Vroomen, van Linden, De Gelder & Bertelson, 2007). If the exposure and test talker differ, however, such perceptual recalibration does not always carry over to the test talker. Rather, it seems that the phonetic similarity between the exposure and test talker determines whether perceptual recalibration is observed during the test (Kraljic & Samuel, 2006, 2007; Reinisch & Holt, 2014). Some evidence further suggests that it is specifically the similarity with regard to the phonetic realization of the tested category, rather than whether the speakers are perceived to be, for example, of the same gender (Eisner & McQueen, 2005; Reinisch & Holt, 2014; Van der Zande, Jesse & Cutler, 2014).
More recent work has demonstrated that similar effects are at work during the comprehension of L2-accented speech (Alexander & Nygaard, 2019; Reinisch & Holt, 2014; Xie & Myers, 2017). Reinisch and Holt found that perceptual recalibration to fricatives was observed even when the shifted sound was embedded in globally L2-accented speech, and this perceptual recalibration carried over to an unfamiliar talker only if the exposure and test talkers were sufficiently similar in their productions of the relevant sound contrast. Xie & Myers (2017) forewent the use of artificially manipulated speech altogether. Listeners were exposed to Mandarin-accented English that either did or did not contain instances of (naturally produced L2-accented) syllable-final /d/. When listeners’ perception of the /d/-/t/ continuum was tested on speech from the same talker, exposure to the L2-accented /d/ facilitated accurate comprehension during test, compared to the control exposure without /d/ (see also Eisner, Melinger & Weber, 2013; Xie, Theodore & Myers, 2017). When listeners were tested on speech from another Mandarin-accented talker, however, the benefit of /d/ exposure did not always carry over to the test talker. Of the two test talkers employed by Xie & Myers (2017), a benefit was found only for the test talker who produced /d/ with similar phonetic cue distributions to the exposure talker.
Taken together, these findings indicate that talker-to-talker generalization of individual phonetic categories depends on how similar the exposure and test talkers are with respect to relevant category-cue mappings (see also Alexander & Nygaard, 2019). The idea that phonetic similarity during exposure affects generalization between learned stimuli and novel stimuli has received supportive evidence in other work on speech sound learning more generally (e.g., Cristia, Mielke, Daland & Peperkamp, 2013; Mitterer, Scharenborg & McQueen, 2013; Reinisch et al., 2014; Reinisch & Mitterer, 2016). Of note, like in the present experiments, adaptation and generalization effects occur rapidly in these studies–typically within a single experimental session. These findings leave open whether similarity-based generalization is sufficient to explain all aspects of cross-talker generalization after exposure to naturally produced L2 speech. An account along these lines can, however, at least qualitatively account for both BB08 and our replications, as we outline next.
For paradigms like BB08 and our replications, similarity-based generalization predicts successful cross-talker generalization when the talker(s) during exposure exhibit similar mappings from linguistic categories to phonetic cues as the talker during test. This can explain why we observed cross-talker generalization after single-talker exposure. It also can explain why we observed considerable variability in the extent of cross-talker generalization across the different test talkers (recall Figure 6). Similarity-based generalization further predicts that cross-talker generalization depends not only on the test talker, but on the combination of exposure and test talker. Experiments 1a and 1b employed 20 unique combinations of exposure and test talkers. While any comparison of these combinations should be interpreted with caution (we only have 8 participants for each of the 20 combinations across Experiments 1a and 1b), it seems clear from Figure 7 that the benefits of an exposure talker can vary within each test talker. For some combinations of exposure and test talker, there do not seem to be any benefits over control exposure. For other combinations, single-talker exposure seems to match talker-specific exposure. The same exposure talker can even yield cross-talker generalization for one test talker but not for another—compare, for example, the effect of exposure talker 016 for test talkers 032 and 037 (in both cases, the leftmost blue pointrange in the panel).
Figure 7.
Fitted transcription accuracy during test for the combined data from Experiments 1a and 1b depending on the particular combination of exposure and test talker. The four panels represent test performance for the four test talkers. Within each panel, exposure talkers in the single-talker condition are sorted alphanumerically (from left to right: 016, 021, 032, 035, 037, 043). These estimates are obtained from a Bayesian GLMM that adds random effect by exposure talkers in addition to those by those by test talkers, participants, and items. Additional plots in the SI (§7.2) show models fit separately to Experiments 1a and 1b.
This suggests an explanation as to why BB08 did not find an effect of single-talker exposure. Both BB08 and our replications employed test talkers whose L2-English speech was rated to be of medium intelligibility to L1-English speakers (BB08: 74 RAU baseline intelligibility; replications: 71 to 79 RAU; see also SI, §11). For our replication studies, the same applied to all six exposure talkers used in all of our L2 exposure conditions ( 71 to 86 RAU; see also SI, §11). This contrasts with BB08. While the multi-talker condition in BB08 employed five talkers of medium intelligibility (79 to 88 RAU), the four talkers in the single-talker conditions differed vastly in their intelligibility, ranging from low to high (43 to 92 RAU). While similar intelligibility does not necessarily imply similarity in terms of the relevant phonetic distributions, it is plausible that the four exposure talkers in BB08’s single-talker conditions were overall not sufficiently similar to BB08’s test talker to yield cross-talker generalization. The hypothesis of similarity-based generalization predicts that any sufficiently high-powered replication of BB08 will detect an effect of single-talker exposure, provided that (i) the single- and multi-talker conditions counter-balance the same exposure talkers (across participants) and (ii) the multi-talker condition yields an effect. This makes a testable prediction for future work: an effect of single-talker exposure should be found even for the materials from BB08 as long as the same five exposure talkers from the multi-talker condition are used in the single-talker conditions.
When is exposure to multiple talkers expected to facilitate cross-talker generalization beyond exposure to a single talker?
Under the hypothesis of similarity-based generalization, the primary benefit of multi-talker exposure is that it increases the probability that at least some of the exposure talkers exhibit a mapping from linguistic categories to phonetic cues that is similar to that employed by the test talker. Multi-talker exposure should thus facilitate cross-talker generalization beyond single-talker exposure when multiple exposure talkers exhibit relevant similarities to the test talker, and in particular when the different exposure talkers exhibit different similarities with the test talker (so that the additive effect of these similarities goes beyond the effects of the single-talker conditions). This is a prediction that can be tested in future work.
One important trade-off to keep in mind in such comparisons relates to the relative amount of exposure from each talker. For the same amount of total exposure, multi-talker exposure provides less information about each individual talker. Especially for relatively short exposure, this constitutes a potential disadvantage of multi-talker exposure. Consider, for example, a scenario in which only one of the exposure talkers is at all similar to the test talker’s speech. In this scenario, having more exposure talkers means less exposure per talker, including the only exposure talker that is similar to the test talker. The hypothesis of similarity-based generalization then predicts that the degree of cross-talker generalization can decrease if additional non-similar talkers are included in the exposure while keeping the total amount of exposure constant. This is worth emphasizing, as it is the exact opposite prediction of accounts that focus on variability as the direct cause for successful learning and generalization (e.g., Bradlow & Bent, 2008; Wade et al., 2007). In contrast, in the theoretical perspective considered here, variability is not a direct cause for learning and generalization. Variability only indirectly facilitates learning and generalization, namely when it affords information about the relevant phonetic distributions. When exposure is not informative about the test materials (or for that matter, when listeners are unable to make use of the informativity), the hypothesis of similarity-based generalization predicts no benefit to increased variability during exposure.
It also follows from these considerations that multi-talker exposure is predicted to not yield cross-talker generalization if none of the exposure talkers provide relevant evidence. Indeed, if future studies find that the effect of multi-talker exposure cannot be completely reduced to the similarity in the category-to-cue mapping between exposure and test tokens, this would constitute strong evidence that listeners have learned talker-independent representations (as considered by BB08). In addressing this question, it will be important to consider similarity in the relevant phonetic space. It is, for example, an open question whether the learning processes hypothesized by distributional learning and exemplar accounts (Hay, 2018; Johnson, 1997; Kleinschmidt & Jaeger, 2015, 2016) operate over relatively unprocessed acoustic input or over context-normalized cues (for relevant discussion, see Cristia et al., 2013; McMurray & Jongman, 2011; Tamminga, Wilder, Lai & Wade, 2020; Xie, Buxó-Lugo & Kurumada, 2020).
When is exposure to multiple talkers expected to elicit cross-talker generalization indistinguishable from talker-specific adaptation?
Throughout this paper, we have focused on cross-talker generalization and the relative benefit of single- and multi-talker generalization. The hypothesis of similarity-based generalization also predicts that the comparison of multi-talker and talker-specific exposure—like everything else—depends on the specific exposure and test talkers. This is reflected both in BB08 and our replications. In BB08, multi-talker exposure (+8 RAU above control exposure) provided almost the same benefit as talker-specific exposure (+10 RAU). The difference between these two conditions did not reach statistical significance. In Experiments 1a and 1b, the difference between multi-talker and talker-specific exposure was somewhat more pronounced (see, e.g., Figure 7). In Experiment 1a, there was strong evidence that the two conditions differed (BF= 62.2, pposterior > 0.98; see SI, §4.1.2). In Experiment 1b, there was positive evidence (BF= 5.7, pposterior = 0.85; see SI, §5.1.2). For the relatively homogeneous group of exposure and test talkers in our replications, we observe only relatively small differences between the test talkers. For the combined data shown in Figure 7, it seems that multi-talker and talker-specific exposure yield almost identical benefits for test talkers 035 and 037, and a clearer advantage of talker-specific exposure for test talkers 032 and 043.
A final prediction of similarity-based generalization is that prolonged exposure to the same number of talkers should ameliorate the disadvantage of multi-talker relative to the same total amount of talker-specific exposure. Once enough data is observed for each exposure talker, the benefit of additional exposure will be minimal, reducing the advantage that talker-specific exposure has. It is possible that this explains why the difference between multi-talker and talker-specific exposure was less pronounced in BB08.
Towards stronger tests of similarity-based generalization.
Moving forward, a strong test of similarity-based generalization requires expansions beyond existing studies. In the SI, we describe two initial attempts to quantitatively predict cross-talker generalization after single-talker exposure by linking measures of talker similarity to degrees of generalization. As we discuss in the SI, some characteristics of our design (which was not intended to address this question) limit our ability to find conclusive evidence in this regard. Future experiments in which exposure and test talkers are selected to exhibit a wide range of talker-to-talker similarity would make a particularly valuable contribution to the field. In planning such experiments, it is important to keep in mind that successful talker-to-talker generalization requires that the talkers are objectively similar in their realization of linguistic categories (e.g., phonemes and/or words, prosodic accents and boundaries). This requires either annotation and quantification of the relevant phonetic distributions of candidate talkers prior to data collections (as in Kraljic & Samuel, 2006; Xie & Myers, 2017), or active manipulations of these exposure distributions (as in Eisner & McQueen, 2005; Reinisch & Holt, 2014). Either approach constitutes a substantial amount of upfront work and has so far only been attempted for one specific phonetic contrast at a time. Advances in automatic speech synthesis (e.g., Descript/Lyrebird, www.descript.com) allow the generation of natural speech while providing full control over phonetic distributions; this may prove an interesting and feasible path forward in overcoming these challenges.
An increased focus in future work on the relevant phonetic distributions also might shed light on seemingly conflicting results on cross-accent generalization. Research on this question has asked whether the type of cross-talker generalization observed in BB08 and our replications is specific to the L2 accent during exposure (accent-specific generalization) or extends to other types of L2 accents (accent-general generalization). In the original experiment by BB08, a second post-exposure test assessed cross-talker generalization for a novel L2 accent (Slovakian-accented English). Neither single- nor multi-talker exposure resulted in a detectable advantage for this novel accent beyond that afforded by task familiarity. This apparent accent-specificity of cross-talker generalization was another important contribution of BB08. One limitation of our replication is that it does not speak to this question (see footnote 1).
There are, however, some other studies that have since addressed this question. For example, Alexander & Nygaard (2019) exposed L1-English listeners to either L1-accented, Spanish-accented, or Korean-accented English, or a mix of the two L2 accents. During test, listeners transcribed either Spanish- or Korean-accented English speech. Accuracy was highest when the L2 accent during test was the same as during exposure and intermediate for mixed-accent exposure. Like in BB08, exposure to a different L2 accent did not yield an overall benefit during test compared to control (L1-accented exposure). Critically, however, detailed phonetic analyses found that the perception of some vowels was improved after exposure to a different L2 accent compared to control. For some vowels, the improvements were, in fact, indistinguishable from the improvements after exposure to the same accent. This would suggest that exposure to one L1 accent can sometimes lead to cross-accent generalization.
As Alexander and Nygaard conclude from their phonetic analyses (p. 3395), whether a benefit is found for a particular vowel seemed to depend in part on the similarity in the realization of that vowel during exposure and test. Similarly, enhanced cross-accent cross-talker generalization after exposure to talkers of multiple accents (Baese-Berk et al., 2013) might not necessarily reflect the induction of accent-independent models but rather might be derived from similarity-based generalization. These are testable hypotheses: if the presence or absence of cross-accent generalization is explainable by similarity-based generalization, we should find generalization precisely when the relevant phonetic distributions are similar across the accents.
Closing considerations on the role of variability in learning and generalization
Research on generalization in speech perception, language learning and beyond has put a strong emphasis on the role of “variability” during exposure in determining the success of learning (Aguilar, Plante & Sandoval, 2018; Bradlow, 2008; Giannakopoulou, Brown, Clayards & Wonnacott, 2017; Leong, Price, Pitchford & van Heuven, 2018; Sadakata & McQueen, 2014). One key insight of these works has been that increased variability during exposure often furthers, rather than hinders, learning and generalization. These insights have important real-life consequences for pedagogy (Barriuso & Hayes-Harb, 2018), therapy (Aguilar et al., 2018), and beyond. The present findings—and the theoretical perspective they support—suggest a specific reason why and when increased variability during exposure (or, depending on the field, “training”, “practice”, or “therapy”) is expected to facilitate generalization: when it increases the informativity of exposure with regard to the inputs and tasks that are targeted in later tests (or real life, i.e., the training, teaching, or therapy goals). To appreciate this insight—which is likely obvious to any trainer, teacher, or therapist—it serves to illustrate how focused some subfields of the cognitive sciences have become on the notion of variability.
Bradlow and Bent (2008) frame the key insights from their study in terms of the benefit of increased variability. For example, “[t]he finding that talker-independent adaptation to Chinese-accented English required exposure to multiple talkers of the accent is consistent with studies demonstrating the efficacy of a high variability approach to non-native phoneme contrast training.” (p. 722) and “. . . there appears to be ever-mounting evidence that exposure to highly variable training stimuli promotes, rather than interferes with, perceptual learning for speech be it at the level of phoneme or dialect/accent category representation” (ibid.). This framing has been highly influential both before and after the publication of BB08. While BB08’s description carefully avoids attributing a causal role to variability, subsequent works sometimes use terminology that seems to suggest a direct causal role of variability in learning and generalization (Potter & Saffran, 2017; Schmale et al., 2015). The hypothesis of similarity-based generalization instead holds that variability often has an indirect causal role: the reason high-variability training/therapy facilitates generalization is because it increases the probability that the input contains relevant information that will transfer to subsequent tests.
To use another example from speech perception, exposure to highly variable non-native vowel productions has been found to facilitate comprehension of non-native vowel productions, compared to exposure to vowel productions with the same mean but smaller variance (Wade et al. 2007, see also Sadakata & McQueen 2014), which can be seen as demonstrating the benefit of high variability exposure. However, in this study, the test items followed the exact same high variability vowel distributions as in the high variability exposure. In other words, the high variability condition exposed participants to input that was objectively more similar to the subsequent tests than the input in the other conditions. Similarly, prolonged exposure to L2-accented speech in Weil (2001) might have led to cross-talker generalization not because there was overall more variability (the explanation advanced in BB08), but because the exposure and test talker were sufficiently similar. In short, one way through which increased variability during exposure can facilitate learning and generalization is simply by providing a more accurate picture of the relevant statistics.12 Critically, this predicts that high variability exposure can actually be harmful. For example, when listeners are tested on vowel productions with low variability, previous exposure to high-variability vowel production is expected to hinder comprehension, compared to low-variability exposure.
None of this changes that high-variability training/therapy is often helpful in real-life, but it explains why: the ‘test’ of real-life almost always involves a broader set of stimuli and tasks than even those employed in high-variability training/therapy.
Conclusion
Our experiments provide new support for the generalizability of talker accent adaptation. We find evidence that even a short exposure to a single L2-accented talker is helpful for subsequent perception of an unfamiliar talker of the same accent. The degree of this cross-talker generalization depended on the specific combination of exposure and test talker. We also find evidence that exposure to multiple talkers of a L2 accent facilitates generalization beyond that associated with single-talker exposure, though the difference between single- and multi-talker exposure we find is substantially less pronounced than in the original study (Bradlow & Bent, 2008). Bayesian replication analyses suggest that there is considerable variability in the strength of evidence for cross-talker generalization depending on the exposure and test talkers, and even across comparatively high-powered replication instances.
Context of research
This project is part of a series of studies on the adaptive mechanisms underlying speech perception. The long-term goal of this research program is to test whether the same implicit distributional learning underlies the perception of talker-specific differences in L1 speech, adaptation to sociolects, regional dialects, and L2-accented speech, as well as acquisition of L2 phonetic contrasts during second language acquisition (Kleinschmidt & Jaeger, 2015; Pajak et al., 2016; Xie & Myers, 2017). These topics continue to be largely studied in isolation (though see, e.g. Bruggeman & Cutler, 2020; Kim, Clayards & Kong, 2020; Reinisch & Holt, 2014; Xie et al., 2017; Wade et al., 2007). And so it remains an open question whether, e.g., adaptation and generalization to L2-accented speech can be explained by the same mechanisms that seem to drive perceptual recalibration during L1 speech perception (for discussion, see Baese-Berk, 2018; Foulkes & Hay, 2015; Hay, 2018; Samuel & Kraljic, 2009; Samuel, 2011). In this context, the present work speaks to two questions, (i) whether adaptive processes during the perception of L2-accented speech can occur as rapidly as during L1 speech perception, within a single session with only minutes of exposure (see also Clarke & Garrett, 2004; Xie et al., 2018), and (ii) whether exposure to multiple talkers necessarily involves adaptive mechanisms that are qualitatively different than those that seem to explain adaptation to a single talker.
Supplementary Material
Figure 4.
Posterior density estimates (over log-odds of correct transcription) for the comparisons associated with research questions 1–3 for the combined data from Experiments 1a and 1b. For details, see caption of Figure 2.
Acknowledgments
Experiment 1a and the pilot in the Appendix were first presented at the 2016 CUNY and LabPhon conferences (Weatherholtz, Liu, & Jaeger, 2016a,b). Initial analyses of Experiments 1a and 1b were presented at CUNY 2017 (Liu, Xie, Weatherholtz, & Jaeger, 2016). The authors are grateful for helpful feedback on earlier presentations, in particular by Sarah Brown-Schmidt, Michael K. Tanenhaus, Ehsan Hogue, and Davy Temperley, as well as members of the Human Language Processing lab at the University of Rochester. We thank, in particular, Kodi Weatherholtz who designed and implemented the pilot and Experiment 1a and got the whole project started. We also owe many thanks to Paul Bürkner, Stephen Martin, and Henrik Singmann for invaluable help with the implementation of the Bayesian replication test we present. This test would not have been possible without their help (all mistakes are, of course, our own). This work also benefited from discussions with Wednesday Bushong, Crystal Lee, Leslie Li, Lauren Oey, Emily Simon, and Lily Steiger. All data and analysis code are available via OSF https://osf.io/brwx5/. This research was funded by NIH R01 grant HD075797. The views expressed here do not necessarily reflect those of the funding agency.
Appendix A. Post-experiment questionnaire
The following post-experiment questionnaire was shown to participants at the end of the experiment. Each question was presented on its own page, with no option for participants to go back and modify their previous responses.
- Did any of the audio clips jump, stall, or skip during the experiment?
- Yes, very frequently.
- Yes, a handful of times.
- Yes, once or twice.
- No, they all played smoothly.
- Did any of the video clips jump, stall, or skip during the experiment?
- Yes, very frequently.
- Yes, a handful of times.
- Yes, once or twice.
- No, they all played smoothly.
- What kind of audio equipment did you use for the experiment?
- In-ear headphones
- Over-the-ear headphones
- Laptop speakers
- External speakers
- How does your audio equipment sound when you watch (high quality) music videos on YouTube, watch movies on Netflix, or engage in other similar activities involving audio?
- Poor (most words cannot be understood)
- Okay (many sounds are distorted)
- Good (occasionally some minor distorted sounds)
- Excellent (crystal clear with no distorted sounded)
- Professional quality
- I don’t know or don’t do any of the above.
- You might have noticed that the last talker you heard had an accent. How often do you hear talkers with a similar accent, whether in person or in movies/TV shows, etc.?
- A few times a day or more
- Perhaps once a day
- Perhaps a few times a week
- Perhaps a few times a month
- Perhaps a few times a year
- I don’t recall ever hearing this accent before
- If you somewhat regularly hear listeners with a similar accent, please tell us in what context you encounter these talkers (select all that apply).
- In my family
- Among my close friends
- At work
- My own accent sounds like that
- In movies, shows, etc.
- I never heard this accent before
Based on his accent, where do you think the last talker you heard was from? (Free response)
- Please select the choice that best describes your language background.
- I speak American English only.
- I speak an East Asian language (e.g., Chinese, Korean)
- I speak other languages, but none from East Asia.
Enter any other comments that you have below (optional).
Appendix B. Pilot experiment to determine suitable test talkers
As mentioned under Materials in the main text, we initially recruited participants for only the control and talker-specific conditions of Experiment 1a. Specifically, we initially considered all six L2-accented talkers employed in the exposure phase of Experiment 1a as possible test talkers. The goal of this initial pilot phase was to ensure that our paradigm was capable of detecting at least the effects of talker-specific adaptation (compared to control exposure) for the test talkers to be used for the remainder of the experiment, as those talker-specific effects are expected to be larger than the generalization effects we were ultimately interested in. The resulting design of the pilot experiment is shown in Figure B1.
Figure B1.
Design and lists of pilot experiment. Each shape represents a different talker. The exposure talkers in the control condition are the only American-English accented talkers. All other talkers are Mandarin-accented talkers.
Method
Participants.
We recruited 104 participants so as to have 48 participants for the analysis for each of the two exposure conditions (control and talker-specific). The choice of 48 participants per condition followed from our desire to have more power than previous work, and the number of participants required to balance across lists, as described under Procedure. The recruiting procedure was identical to Experiment 1.13
Table B1.
Total number of participants in pilot experiment before and after exclusions.
Exposure condition | analyzed | Participants recruited | excluded |
---|---|---|---|
Control | 48 | 51 | 3 (5.9%) |
Talker-specific | 48 | 53 | 5 (9.4%) |
Exclusions.
The same criteria as in Experiments 1a and 1b were applied. Exclusions are summarized in Table B1. After exclusions, the targeted number of 48 participants remained in each condition (compared to 10 participants in each condition in BB08).
Materials.
The materials were identical to Experiments 1a and 1b.
Procedure.
The procedure was identical to Experiments 1a and 1b. List generation and counter-balancing, too, was identical to Experiments 1a and 1b (see Figure B1). This resulted in a total of 24 lists in the control condition (6 Mandarin-accented test talkers × 2 assignments of sentence sets to exposure vs. test × 2 presentation orders within blocks) and 24 lists in the talker-specific condition (6 exposure-test talker pairs × 2 assignments of sentence sets to exposure vs. test × 2 presentation orders within blocks). Two participants per list (after exclusions) resulted in the 48 participants per condition reported in Table B1.
Results
Analysis approach.
The analysis approach was identical to Experiments 1a and 1b. We first conducted a talker-independent analysis to determine whether we could replicate the basic adaptation effect found in BB08, namely whether exposure to an accented talker (as compared to exposure to an unaccented talker) would lead to improved understanding of novel sentences from that same accented talker during the test block. We then examined differences in talker-specific adaptation by test talker.
As in Experiments 1a and 1b, we used Bayesian mixed-effects logistic regression to predict the proportion of correct keyword transcriptions from the exposure condition (sum-coded: talker-specific = 1 vs. control = −1), while including a control predictor for a priori performance. The R code and full model outputs are provided in the SI (§11).
Talker-independent analysis (paralleling BB08).
As expected, participants in the talker-specific condition performed significantly better (higher transcription accuracy) during the test block than participants in the control condition (, BF = 3999.0, pposterior > 0.999). This is consistent with accent adaptation—exposure to sentences from an accented talker leads to improved understanding of novel sentences from that same talker. This is illustrated in Figure B2. In separate analyses reported in the SI (§11.2), we identified a significant effect of a priori performance, such that participants with a higher a priori performance (as estimated at the beginning of the exposure phase) tended to transcribe more words correctly during the test block ( = 0.88, BF > 7999.0, pposterior > 0.999), regardless of exposure condition.
Figure B2.
Comparison of transcription accuracy during test block of the pilot experiment.
Talker-dependent analysis.
Because each of all six of our Mandarin-accented talkers served as the test talker for some participants, we can assess the benefit of talker-specific exposure for each individual test talker, i.e. we can assess how comprehension of each test talker differs following exposure (Figure B3). Different talkers—even when they share the same L2 accent—may result in different degrees of adaptation following the same amount of exposure. Taking the performance of the control condition during test as an estimate of the a priori intelligibility of the different test talkers (from the perspective of a native listener), we see in Figure B3 that our talkers ranged in intelligibility from 75% to 94% transcription accuracy (mean: 86%, SD = 7%). Certain talkers (e.g., Talker 1 and 2) may be more intelligible to listeners because their pronunciations are more acoustically similar to native pronunciations.
Figure B3.
Comparison of transcription accuracy during test block by test talker (pilot experiment). Conditions are the control, L1-accented (left, red) and talker-specific, L2-accented (right, blue) exposure conditions. Panels show the six test talkers. Each shape represents a different accented talker, heard during the test block. The percentages at the bottom of each panel represent the difference in performance from the talker-specific condition to the control condition (i.e., benefit of talker-specific exposure). Four of the test talkers show a benefit of exposure (talkers 032, 035, 037, 043), while others do not (016, 021). Error bars represent 95% confidence intervals bootstrapped over by-participant means.
Table B2.
Results of Talker-dependent Bayesian Analysis in the pilot experiment. Proportion of the samples of the posterior density for which the estimated effect , along with 95% CIs of , and the evidence ratio (Bayes Factor) for each comparison. This ratio represents how likely it is the comparison shows an exposure benefit, over the alternate hypothesis that is does not.
Talker 016 |
Talker 021 |
Talker 032 |
Talker 035 |
Talker 037 |
Talker 043 |
||
---|---|---|---|---|---|---|---|
Adaptation: TS vs. CNTL |
pposterior() | 0.69 | 0.35 | 0.96 | 0.98 | 0.99 | >0.99 |
95% CIlower() | −0.34 | −0.59 | −0.03 | 0.01 | 0.05 | 0.23 | |
95% CIupper() | 0.57 | 0.42 | 0.86 | 0.90 | 0.94 | 1.16 | |
2.25 | 0.53 | 26.78 | 42.48 | 66.8 | 499 |
Planned follow-up analysis found that the benefit of talker-specific exposure differed significantly between talkers. We conducted a simple effects analysis to assess whether the talker-specific exposure resulted in a benefit over the control exposure for each test talker. This allowed us to assess whether, and if so how much, the benefit of talker-specific exposure differed between talkers. This talker-dependent analysis was based on 8 subjects per combination of condition and test talker (cf. 10 participants per condition in the talker-independent analysis in BB08, and 48 in the talker-independent analysis reported above).
The estimated coefficients of the condition comparison (talker-specific vs. control) are shown in Table B2 for each test talker. A positive coefficient indicates that participants in the talker-specific condition transcribed more words correctly than the control (direction consistent with talker-specific adaptation). High posterior probabilities that >0 indicate more evidence that the talker yields himself to talker-specific adaptation benefits.
Discussion
We find that exposure to a Mandarin-accented talker leads to improved comprehension (measured via transcription accuracy) on novel sentences from that same talker. This talker-specific adaptation is found above and beyond any effects of practice with the transcription task, since participants in both exposure conditions transcribed equally many sentences prior to test. Our web-based paradigm thus successfully replicates the lab-based paradigm employed in BB08.
We find the largest adaptation effects for talkers 032, 035, 037, and 043. These talkers were thus chosen as the test talkers for Experiments 1a and 1b, so as to maximize the statistical power to distinguish between effects of the different generalization conditions (single- and multi-talker exposure). We note that these initial estimates of the talker-specific benefit of adaptation are based on a relatively small sample of participants.
For talkers 016 and 021, we find little evidence that listeners benefit from exposure. The perhaps most likely reason for this is that there was little room for improvement for talkers 016 and 021, as their baseline intelligibility was already high (Talker 016, mean: 89%; Talker 021: 94%; Figure B3). This emphasizes the same point we make in the general discussion with regard to cross-talker generalization: already at 90% baseline performance, it can be difficult to detect robust facilitation effects (talker-specific adaptation has been replicated many times).
Alternative explanations of the lack of adaptation for talkers 016 and 021 are possible. One possibility, discussed in BB08, is that to achieve the same degree of adaptation, listeners require longer exposure to less intelligible talkers than to more intelligible talkers. Another possibility is that certain L2-accented talkers may be easier to learn because their pronunciations are more systematic (i.e., lower within-category variance; see Wade et al., 2007). For example, a talker who consistently pronounces the /i/ vowel as /ε/ would be easier to learn than one who pronounces /i/ as many different vowels. A closely related third related possibility is that certain talkers have greater overlap in their pronunciation of different phonemes (as has been observed for L1-accented talkers, Newman et al., 2001). This would make it more difficult for listeners to distinguish phonemes from that talker even after perfect adaptation (Kleinschmidt & Jaeger, 2015; Xie & Myers, 2017).
Appendix C. Power simulations
We present a series of (frequentist) power analyses. We note that the standard non-Bayesian notion of statistical power is defined for null hypothesis significance testing: an estimate of the statistical power is an estimate of the proportion of times that a repetition of the experiment would yield significance. A variety of more or less related measures have been proposed for Bayesian analyses (e.g., Gelman & Carlin, 2014; Kruschke & Liddell, 2018).
We use non-Bayesian mixed-effects logistic regression for this purpose for reasons of computational feasibility—for the amount of data we have, the Bayesian regressions presented in the main text take many hours to complete, so that large numbers of iterations necessary for ‘power’ estimates were not feasible.
We conducted three simulations. Simulation 1 compared the power of Experiment 1a to that of BB08 under the assumption that the data from both experiments were drawn from the same underlying population of participant and item, and thus had the same underlying by-participant and by-item variability. We set the effect sizes for the different exposure conditions to be identical in BB08 and our replication. Specifically, we set the effect sizes to the (median) estimate of the Bayesian analysis in Experiment 1a. This ignores uncertainty in these estimates, and might over-estimates the effect size (an issue addressed in Simulations 2 and 3). The standard deviations of the by-subject and by-item random effects, too, were set to the median estimate obtained from the Bayesian analysis of Experiment 1a.
Ideally, power simulations for BB08 should be based on estimates of the effect sizes and by-participant and by-item variability for BB08 (incl. knowledge about the appropriate level of uncertainty over those estimates). However, this information is not reported in BB08 (reporting such information was neither required, nor common, in 2008).
For both experiments, we assumed 51 data points per subject, as both BB08 and the present experiments employed 16 sentences during test, each with 3–4 keywords. The number of participants was set to 70 for BB08 (10 each in the control, multi-talker, and talker-specific condition, plus 10 talkers for each of four single-talker conditions) and 320 for our Experiment 1a (80 each per condition). We then generated 1,000 random data set for each of BB08 and Experiment 1a. Each of these simulated data sets was then analyzed with a mixed-effects logistic regression (implemented in the glmer function of the lme4, Bates, Mächler, Bolker & Walker 2015) using the same condition coding and random effect structure employed in the analyses presented in the main text. We followed common practice and refit non-converged models with reduced random effect structure. Specifically, we first removed the by-item slope for exposure condition. If that model still did not converge, we removed the by-subject intercepts (100% of these models converged, so that no further reduction needed to be considered). On 10 cores of a 2013 MacPro (2.7 GHz 12-Core Intel Xeon E5, 64 GB RAM), all three simulations together took about 2–3 days to complete.
Table C1.
Results of non-Bayesian power simulations. CNTL: control exposure; ST: single-talker MT: multi-talker; TS: talker-specific. Each cell shows percentage of 1000 GLMM power simulations that found a significant effect in the correct direction. Parentheses show percentage of GLMMs that converged with full random effect structure (regardless of successful detection of the effect).
Simulation 1 | Simulation 2 | Simulation 3 | ||
---|---|---|---|---|
BB08 | Exp 1a |
Exp 1a (2σ) |
Exp 1a () | |
| ||||
Adaptation: TS vs. CNTL | 57.9% (12%) | 98.8% (36%) | 99.1% (36%) | 63.4% (80%) |
| ||||
Question 1: MT vs. CNTL | 38.4% (12%) | 96.4% (36%) | 95.6% (36%) | 50.1% (80%) |
| ||||
Question 2: ST vs. CNTL | 25.9% (12%) | 56.7% (36%) | 54.8% (36%) | 19.0% (80%) |
| ||||
Question 3: MT vs. ST | 21.0% (10%) | 45.9% (42%) | 44.7% (42%) | 14.1% (86%) |
The results of Simulation 1 are shown in Table C1. Power was estimated as the proportion of data sets (out of 1,000 each) in which the effect for a condition reached significance in the correct direction (i.e., when z > 1.96) and the model converged (non-converged models should not be interpreted). Table C1 summarizes the percentage of models that converged with the full random effect structure. Simulation 1 found 21–38.4% power for Questions 1–3 for the design and number of subjects employed in BB08. Power for our replication Experiment 1a was estimated at 45.9–96.4%. If the same underlying effect size and variability are assumed, Experiment 1a thus had substantially more statistical power than BB08. We also see that the rate of models that converged with the full random effect structure was about three to four times higher in Experiment 1a, compared to BB08.
To be conservative in our comparison to BB08 and in our estimation of the power of our replication, we conducted Simulation 2. We calculated the power of Experiment 1a under the assumption that the true standard deviation (SD) of the by-participant and by-item intercepts was actually twice as high as the estimate obtained in our analyses of Experiment 1a. Increased variability could result, for example, from the use of a web-based paradigm in the present study and the concomitant increase in variability in audio equipment across participants (though see Oey, Lee & Simon, 2018, described in the general discussion). Even under the assumption of doubled by-participant and by-item SDs, we find that the replication study would have substantially more statistical power than BB08 (see Table C1).
Finally, we considered the possibility that the effects of exposure in Experiment 1a would be substantially smaller than in BB08. To be clear, Experiment 1a suggests that this is not the case. In particular for the single-talker condition, we obtained larger effects than BB08. Still, prior to conducting Experiment 1a, we did not know this. Small effects in Experiment 1a, compared to BB08, were a possibility, for example, because the present study employed only half the amount of exposure sentences, compared to BB08 (see Methods). Simulation 3 thus compared power under the assumption that the true effect for the replication study was actually half the size of the effect observed in Experiment 1a. This simulation also provides an lower bound estimate of the power for Experiment 1b, which had substantially smaller effect sizes than Experiment 1a (though still more than half the size of the effects of Experiment 1a). Under the rather conservative assumption of halved effect sizes, Simulation 3 found that the statistical power of our replication would be higher than in BB08 for Question 1, but lower than in BB08 for Question 2. For Experiment 1a, the power estimates are mostly based on GLMMs that converged with the full random effect structure (80–86%), whereas very few of the GLMM analyses of the simulated BB08 data converged with full random effect structures (10–12%). If only models with full random effect structure are considered, then power in Experiment 1a was at least twice that of BB08 for Question 1 (Experiment 1a: 45.4% vs. BB08: 16.7%) and Question 3 (12.1% vs. 5.6%), but still somewhat smaller for Question 2 (11.1% vs. 16.7%).
In summary, power analyses suggest that Experiments 1a and 1b each have similar, or substantially larger, statistical power compared the original study of BB08.
Appendix D. A Bayesian replication test for Generalized Linear Mixed Models (GLMMs)
We first describe the general approach, and then describe the specifics of the analysis conducted. The R code for the replication analysis is provided in the SI (§6.2).
Overview
Consider a GLMM analysis with outcome Y and k predictors x1, ..., xk plus random by-subject and by-item intercepts and slopes for all predictors. We are interested whether the effect of a specific predictor xj replicated between the original experiment and its replication. We thus apply the following five steps.
Step 1. Fit a the ‘full’ Bayesian GLMM to the data from the original experiment: Y ∼ 1 + x1 + ... + xk + (1 + x1 + ... + xk|subject) + (1 + x1 + ... + xk|item). It is important to specify well-formed priors.
Step 2. From the posterior distribution of parameter estimates obtained in Step 1, derive a new set of priors for all parameters that will be used for the analysis of the replication data. This includes the priors for the intercept, the coefficients of the fixed effect estimates β1, ..., βk for, and the variances and covariances of the random effects. Specifically, we set the priors for all parameters to a Normal distribution with the mean and standard deviation of the posterior distribution of the parameter.14
Step 3. Fit the same model as in Step 1 to the replication data but with the priors derived in Step 2 (instead of the weakly regularizing uninformative priors). These revised priors capture our knowledge (and uncertainty) about all effects after observing the original experiment.
Step 4. For each level of the three sliding difference contrasts, repeat Step 3 for the ‘reduced’ model with all the same predictors and priors as in Step 3, except with xj. Following recommendations for the assessment of significance for fixed effects in GLMMs, Step 4 should keep the random slopes for xj. I.e., we use Yrep ∼ 1 + x1 + ... + xj−1 + xj+1 + ... + xk + (1 + x1 + ... + xk|subject) + (1 + x1 + ... + xk|item).
Step 5. Assess the marginal likelihood of the replication data Yrep under a) the full model from Step 3 and b) the reduced model from Step 4.
Step 6. Calculate the replication Bayes Factor BFr0 Verhagen & Wagenmakers (2014) as the ratio of the marginal likelihood of the replication data under the proponent’s hypothesis (obtained from Step 5a) and the likelihood under the skeptic’s hypothesis (obtained from Step 5b).
Details for Steps 1–5
We elaborate on the Steps 1–5.
For Step 1. We fit the same GLMM used in the main analysis to the data from Experiment 1a, except that we used three manually created numerical predictors to code the contrasts between the four exposure conditions. The resulting model has a different R formula than the analyses reported above, but provides identical predictions and parameter estimates (except for minor random differences due to the random nature of the sampling process used to fit brm objects). Separate numerical predictors were used because we want to remove individual parameters from the model in Step 4.
Specifically, we implemented two versions of this model. The first model used numerical predictors that correspond to treatment coding of the four exposure conditions with the control condition as reference level. The three numerical predictors in this model correspond to the contrast between talker-specific and control exposure, multi-talker and control exposure (Question 1), and single-talker and control exposure (Question 2). To address Question 3, we also fit a model with numerical predictors corresponding to the treatment coding with the single-talker condition as reference level. One of the numerical predictors in this model codes the difference between multi-talker and single-talker exposure (Question 3).
In step 1, we used always the same weakly regularizing uninformative priors as in the main analysis. Also following the main analysis, we use brm (Bürkner, 2017) to fit the GLMM, again with 10,000 post-warmup samples for each of 8 chains (and the default 1,000 warmup samples).
For Step 3. Whereas parameter estimation arrives at robust results with comparatively few MCMC samples, robust estimates of the marginal likelihoods estimated under Step 5 require large numbers of samples. It is recommended to use at least 10 times as many posterior samples as would be required for robust parameter estimation (Gronau, Sarafoglou, Matzke, Ly, Boehm, Marsman, Leslie, Forster, Wagenmakers & Steingroever, 2017).
For our data, parameter estimation was robust with 4 chains of 1,000 post-warmup samples (and 1,000 warmup samples). To err on the save side, we fit the new model with 8 chains with 10,000 post-warmup samples per chain (and the default 1,000 warmup samples). As required for Step 5, we set save_all_pars = TRUE for the call to brm.
For Step 4. We fit a total of four separate ‘reduced’ models to the replication data, one for each replication test. Each of these reduced models was obtained by removing from the model the numerical predictor that coded the contrast in question. For example, to assess whether Experiment 1b replicates the answer to Question 3 provided by Experiment 1a, we removed the numerical predictor that coded the contrast between the multi- and single-talker condition in the sliding difference variant of the full model described under Step 1. Each model in Step 4 was fit with the same priors and arguments as in Step 3. The random effects were always the same as for the full model in Step 3. As in Step 3, we fit each model with 8 chains with 10,000 post-warmup samples per chain (and the default 1,000 warmup samples). As required for Step 5, we set save_all_pars = TRUE for the call to brm.
For Step 5. We use bridge sampling (Meng & Wong, 1996; Gronau et al., 2017) as implemented in the bridge_sampler function from the R package bridgesampling (Gronau, Singmann & Wagenmakers, 2020) to estimate the marginal likelihoods of the replication data under both the full model from Step 3 and the reduced model form Step 4. The former represents the likelihood of the replication data under the proponent’s hypothesis that the effect observed in Experiment 1b replicates with the effect from Experiment 1a. The latter represents the likelihood of the replication data under the skeptic’s hypothesis of a null effect. Specifically, we use warp bridge sampling (method = “warp3”), which uses a standard multivariate normal distribution as proposal distribution and “warps” the posterior distribution so that it has the same mean vector, covariance matrix, and skew as the samples (Meng & Schilling, 2002). To assess the robustness of the estimate of the marginal likelihood, we repeated each estimation five times. The resulting estimates only differed in the first or second digit, which never affected any conclusions.
We conducted a total of four replication tests, each involving Steps 1–5: three tests to assess replication of Questions 1–3, and one test to assess replication of the result for talker-specific adaptation. The latter test was conducted for comparison’s sake. As was the case for the main analysis, all analyses converged .
Appendix E.
Effects of soft ceiling effects on the ability to obtain decisive evidence of replication Just as the statistical power for non-Bayesian mixed-effects logistic regression decreases for mean outcome proportions close to 0 or 1, the ability to obtain decisive evidence for an effect in a Bayesian analysis decreases if baseline performance (here the performance of the control group) increases. From this it follows that weaker evidentiary support (smaller BFs) with regard to Questions 1–3 is expected in Experiment 1b, compared to Experiment 1a.
To assess whether such differences in baseline performance also affect the replication test, we conducted simulations similar to the frequentist power simulations reported in Appendix C. We repeatedly generated ‘original’ and ‘replication’ data using the estimated from Experiments 1a and 1b, and then applied Steps 1–5 of the replication test. The original and the replication data were generated from a ground truth that was identical except for the baseline performance. Specifically, both original and the replication data were generated using the effect size (e.g., for the ST vs. CNTL comparison), by-subject and by-item random effect estimated obtained Experiment 1a. However, where the baseline for the generation of the original data was set to the intercept from Experiment 1a (1.66 log-odds), the baseline for the generation of the replication data was set to the intercept of Experiment 1b (2.00 log-odds). That is, the underlying effect (e.g. ST vs. CNTL) was set to be identical in the original and replication data, this effect was added on top of different CNTL performances for the original and replication data.
These simulations are computationally demanding and, despite our best efforts to automate them, they had to be manually restarted after every iteration. We thus conducted only a few dozen iterations of each simulation, enough to see that stable trends emerged (this took several weeks). We ran simulations for the ST vs. CNTL comparison and the MT vs. CNTL comparison. The comparison between these two simulations serves to illustrate how the relative effect size of the ST vs. CNTL compared to the MT vs. CNTL effect in Experiment 1a is expected to affect our ability to obtain decisive evidence of replication (when the ground truth is that the effect is identical).
As shown in Figure E1, we expected to find support for the MT vs. CNTL comparison, in most cases strong or very strong support (most BFHrepH0 > 20, indicating posterior probabilities of replication success above .95). This differed markedly for the ST vs. CNTL comparison, for which only a small proportion of simulations indicated at least strong support for the replication hypothesis. In short, we had substantially less ‘power’ to find strong evidence of replication with regard to Question 2, compared to Question 1.
Figure E1.
Distribution of replication Bayes Factors for the comparisons of the multi-talker vs. control conditions (top) and the single-talker vs. control condition (bottom). Bayes Factors in support of replication (BFHrepH0 > 1) are shaded in gray. Due to the computationally demanding nature of these simulations, each density is based on only a few dozen iterations. The ground truths for the original and replication data were identical except for intercept (for details, see text).
Additionally, we repeated the latter comparison (MT vs. CNTL), while keeping the baseline identical across the original and replication data, using the intercept of Experiment 1a in both cases. This simulation, visualized in Figure E2, demonstrates that the increased performance in the control condition in Experiment 1b, compared to 1a, indeed decreased the probability of obtaining decisive (strong or very strong) support for the replication hypothesis (compared to top panel of Figure E1).
Figure E2.
Distribution of replication Bayes Factors for the comparisons of the multi-talker vs. control conditions when not only the effect, but also the baseline (intercept) were identical for the ‘original’ and ‘replication’ data. For details, see caption of Figure E1.
Footnotes
A second test had all participants transcribe speech from an unfamiliar talker of another L2 accent (Slovakian-accented English). The results of this second test suggest that the cross-talker generalization observed after multi-talker exposure was largely specific to the exposure accent (Mandarin-accented English; for related results and discussion, see Baese-Berk, Bradlow & Wright 2013; Baese-Berk 2018). Here we replicate only the first test, assessing cross-talker generalization within the same L2 accent. Our results thus do not speak to whether the adaptation and generalization we observe is specific to the L2 accent under investigation.
A separate line of work—also typically employing a single test talker—has investigated cross-talker generalization for noise-vocoded speech (Huyck, Smith, Hawkins & Johnsrude, 2017) or perceptual recalibration in L1 speech (e.g., Eisner & McQueen, 2005; Kraljic & Samuel, 2006; Kraljic & Samuel, 2007). It is an open question whether findings from these works extend to accent adaptation where numerous phonetic features deviate from listeners’ expectations.
While some work has employed multiple L2 test talkers, these studies have exclusively investigated multi-talker exposure (Sidaras, Alexander & Nygaard, 2009; Tzeng, Alexander, Sidaras & Nygaard, 2016; Alexander & Nygaard, 2019).
The large number of lists is also the reason we limit our replication to the first of the two post-exposure tests in BB08. The inclusion of a second test with four additional test talkers of another L2 accent would have increased the number of between-participant lists required for a fully balanced design from 128 to 512, requiring over 1000 participants per experiment. Since the second test in BB08 always followed the test of interest to us, our decision to omit the second test cannot explain differences in results between BB08 and the present study.
All effects found for Experiment 1a also hold if participants from the pilot experiment are excluded from analysis.
The use of 3, instead of 1 (as in Gelman et al., 2008), degrees of freedom follows current recommendations (https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations, retrieved on 11/24/2019).
The pre-registration of Experiment 1b proposed a predictor specifically to control for differences in audio-equipment. Since the approach employed here—developed after pre-registration—subsumes effect of audio equipment, we did not conduct the pre-registered analysis of audio-equipment.
From 2000–2010, the number of people with Chinese ethnicity reported by the US Census as living in the US increased by 36% from 2,314,537 to 3,137,061 (https://tinyurl.com/w4c324t). No census numbers are yet available for later dates. It is also an open question to what extent workers on Mechanical Turk proportionally represent the US population, even when recruitment is limited to US IP addresses. Mandarin has consistently been reported to be an infrequent language background on Mechanical Turk (Munro & Tily, 2011; Pavlick, Post, Irvine, Kachaev & Callison-Burch, 2014).
The general discussion presents additional visualizations for all 20 unique combinations of exposure and test talkers in the single-talker condition. Further quantification of the generalizability of our results across exposure and test talkers is available in the SI (§7.3), addressing a question of increasing interest across the psychological sciences (see discussion in Yarkoni, 2019).
This seemingly contrasts with Experiment 1a, which did not find that a drastically reduced effect of single-talker exposure for Talker 037. At first blush, this difference between Experiment 1a and 1b might be puzzling. We hypothesized that both experiments do, in fact, exhibit consistent evidence in that something about Talker 037 leads to increased variability in the effect of single-talker exposure across participants. To test this hypothesis, we refit the main analysis to each test talker separately to estimate the standard deviation of the random intercepts by-participant ( , participant) for each test talker and experiment. These analyses reveal that both experiments exhibit particularly high by-talker variability for Talker 037 (, participant for Experiment 1 = .85; Experiment 1b = .80) compared to all other test talkers (mean of all other , participants = .58, range = .53 to .73). For both experiments, the , participant for Talker 037 were outside of, or at the edges of, the 95% highest posterior density intervals of the , participants for all other talkers. In short, the effects observed for Talker 037 varied substantially across participants, and did so in both experiments. This type of dependency of the results on interactions between listeners (participants) and talkers emphasizes the need for experiments with multiple talkers.
In the SI (§12), we report results from an additional small-scale replication of only the single- and multi-talker conditions (69 participants). For this replication, we drastically increased variability during exposure by playing 80 different sentence recordings, rather than five instances of 16 recordings. Compared to Experiment 1a, we indeed find an increase in cross-talker generalization, but only for the multi-talker condition and not the single-talker condition. We note that in our manipulation (unlike in Tzeng et al., 2016) more variability implied objectively more evidence about the phonetic distributions of the exposure talker, rather than the same input in more varied order.
Variability during exposure can have additional effects—for example, by recruiting attention and increasing task engagement. It is, however, worth noting that some of the results taken to argue for such effects involve tests that are more similar in task demands to those of the high variability exposure (e.g., Tzeng et al., 2016). It is possible that even some of the effects of variability that seem to originate in increased task engagement are ultimately caused by the similarity between the task demands during exposure and test.
As reported in the main text, those 32 participants that were exposed to the four test talkers that were selected for the remainder of the experiment are included in the 80 participants for the control and talker specific conditions of Experiment 1a.
We also explored more sophisticated priors, such as Inverse-Gamma priors for the random effect variances. As this did not affect results, but sometimes led to convergence problems, we decided to employ the simpler approach described here.
Contributor Information
Xin Xie, University of Rochester, Department of Brain and Cognitive Sciences.
Linda Liu, University of Rochester, Department of Brain and Cognitive Sciences.
T. Florian Jaeger, University of Rochester, Departments of Brain and Cognitive Sciences and Computer Science.
References
- Aguilar JM, Plante E, & Sandoval M (2018). Exemplar variability facilitates retention of word learning by children with specific language impairment. Language, speech, and hearing services in schools, 49 (1), 72–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexander J & Nygaard L (2019). Specificity and generalization in perceptual adaptation to accented speech. The Journal of the Acoustical Society of America, 145 (6), 3382–3398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allen JS, Miller JL, & DeSteno D (2003). Individual talker differences in voice-onset-time. The Journal of the Acoustical Society of America, 113 (1), 544–552. [DOI] [PubMed] [Google Scholar]
- Baayen H, Vasishth S, Bates DM, & Kliegl R (2016). The Cave of Shadows: Addressing the human factor with generalized additive mixed models. Journal of Memory and Language. [Google Scholar]
- Babel M, McAuliffe M, Norton C, Senior B, & Vaughn C (2019). The goldilocks zone of perceptual learning. Phonetica, 76 (2–3), 179–200. [DOI] [PubMed] [Google Scholar]
- Baese-Berk M (2018). Perceptual Learning for Native and Non-native Speech, volume 68. Elsevier Ltd. [Google Scholar]
- Baese-Berk MM, Bradlow AR, & Wright BA (2013). Accent-independent adaptation to foreign accented speech. The Journal of the Acoustical Society of America, 133 (3), EL174–EL180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barriuso TA & Hayes-Harb R (2018). High variability phonetic training as a bridge from research to practice. CATESOL Journal, 30 (1), 177–194. [Google Scholar]
- Bates D, Mächler M, Bolker B, & Walker S (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67 (1), 1–48. [Google Scholar]
- Bent T & Bradlow AR (2003). The interlanguage speech intelligibility benefit. The Journal of the Acoustical Society of America, 114 (3), 1600–1610. [DOI] [PubMed] [Google Scholar]
- Best CT, Shaw JA, Mulak KE, Docherty G, Evans BG, Foulkes P, Hay J, Al-Tamimi J, Mair K, & Wood S (2015). Perceiving and adapting to regional accent differences among vowel subsystems. In Proceedings of the 18th International Congress of Phonetic Sciences (ICPhS 2015), 10–14 August 2015, Glasgow, Scotland, UK. [Google Scholar]
- Bradlow A, Ackerman L, Burchfield L, Hesterberg L, Luque J, & Mok K (2010). Allsstar: Archive of l1 and l2 scripted and spontaneous transcripts and recordings. In Proceedings of the International Congress on Phonetic Sciences, (pp. 356–359). [PMC free article] [PubMed] [Google Scholar]
- Bradlow AR (2008). Training non-native language sound patterns: lessons from training japanese adults on the english. Phonology of Second Language Acquisition, 36, 287–308. [Google Scholar]
- Bradlow AR & Bent T (2008). Perceptual adaptation to non-native speech. Cognition, 106 (2), 707–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bruggeman L & Cutler A (2020). No l1 privilege in talker adaptation. Bilingualism: Language and Cognition, 23 (3), 681–693. [Google Scholar]
- Burchill Z, Liu L, & Jaeger TF (2018). Maintaining perceptual information during accent adaptation. PLOS ONE. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bürkner P (2017). Advanced Bayesian multilevel modeling with the R package brms. arXiv preprint arXiv:1705.11123. [Google Scholar]
- Carpenter B, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker MA, Guo J, Li P, Riddell A, et al. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 20 (2), 1–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carvalho PF & Goldstone RL (2014). Putting category learning in order: Category structure and temporal arrangement affect the benefit of interleaved over blocked study. Memory & cognition, 42 (3), 481–495. [DOI] [PubMed] [Google Scholar]
- Chodroff E, Golden A, & Wilson C (2019). Covariation of stop voice onset time across languages: Evidence for a universal constraint on phonetic realization. The Journal of the Acoustical Society of America, 145 (1), EL109–EL115. [DOI] [PubMed] [Google Scholar]
- Chodroff E & Wilson C (2017). Predictability of stop consonant phonetics across talkers: Between-category and within-category dependencies among cues for place and voice. Linguistics Vanguard. [Google Scholar]
- Clarke CM (2003). Processing time effects of short-term exposure to foreign-accented English. PhD thesis, The University of Arizona. [Google Scholar]
- Clarke CM & Garrett MF (2004). Rapid adaptation to foreign-accented english. The Journal of the Acoustical Society of America, 116 (6), 3647–3658. [DOI] [PubMed] [Google Scholar]
- Clayards M, Tanenhaus MK, Aslin RN, & Jacobs RA (2008). Perception of speech reflects optimal use of probabilistic speech cues. Cognition, 108 (3), 804–809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cristia A, Mielke J, Daland R, & Peperkamp S (2013). Similarity in the generalization of implicitly learned sound patterns. Laboratory Phonology, 4 (2), 259–285. [Google Scholar]
- Dixon P (2008). Models of Accuracy in Repeated Measures Designs. Journal of Memory and Language, 59 (4), 447–456. [Google Scholar]
- Drouin JR, Theodore RM, & Myers EB (2016). Lexically guided perceptual tuning of internal phonetic category structure. The Journal of the Acoustical Society of America, 140 (4), EL307–EL313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eckert P (2012). Three Waves of Variation Study: The Emergence of Meaning in the Study of Sociolinguistic Variation. Annual Review of Anthropology, 41, 87–100. [Google Scholar]
- Eisner F & McQueen JM (2005). The specificity of perceptual learning in speech processing. Perception & Psychophysics, 67 (2), 224–238. [DOI] [PubMed] [Google Scholar]
- Eisner F & McQueen JM (2006). Perceptual learning in speech: Stability over time. The Journal of the Acoustical Society of America, 119 (4), 1950–1953. [DOI] [PubMed] [Google Scholar]
- Eisner F, Melinger A, & Weber A (2013). Constraints on the transfer of perceptual learning in accented speech. Frontiers in Psychology, 4, 148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feldman NH, Griffiths TL, & Morgan JL (2009). The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. Psychological review, 116 (4), 752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foulkes P & Hay JB (2015). The emergence of sociophonetic structure. In The handbook of language emergence (pp. 292–313). John Wiley & Sons, Inc. [Google Scholar]
- Fuertes JN, Gottdiener WH, Martin H, Gilbert TC, & Giles H (2012). A meta-analysis of the effects of speakers’ accents on interpersonal evaluations. European Journal of Social Psychology, 42 (1), 120–133. [Google Scholar]
- Gass S & Varonis EM (1984). The effect of familiarity on the comprehensibility of nonnative speech. Language Learning, 34 (1), 65–87. [Google Scholar]
- Gelman A & Carlin J (2014). Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 9 (6), 641–651. [DOI] [PubMed] [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, & Rubin DB (2013). Bayesian data analysis. CRC press. [Google Scholar]
- Germine L, Nakayama K, Duchaine BC, Chabris CF, Chatterjee G, & Wilmer JB (2012). Is the web as good as the lab? comparable performance from web and lab in cognitive/perceptual experiments. Psychonomic bulletin & review, 19 (5), 847–857. [DOI] [PubMed] [Google Scholar]
- Giannakopoulou A, Brown H, Clayards M, & Wonnacott E (2017). High or low? comparing high and low-variability phonetic training in adult and child second language learners. PeerJ, 5, e3209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldinger SD (1996). Words and voices: episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22 (5), 1166. [DOI] [PubMed] [Google Scholar]
- Goldinger SD (1998). Echoes of echoes? an episodic theory of lexical access. Psychological Review, 105 (2), 251. [DOI] [PubMed] [Google Scholar]
- Gordon-Salant S, Yeni-Komshian GH, Fitzgibbons PJ, & Schurman J (2010). Short-term adaptation to accented english by younger and older adults. The Journal of the Acoustical Society of America, 128 (4), EL200–EL204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gronau QF, Sarafoglou A, Matzke D, Ly A, Boehm U, Marsman M, Leslie DS, Forster JJ, Wagenmakers EJ, & Steingroever H (2017). A tutorial on bridge sampling. Journal of Mathematical Psychology, 81, 80–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gronau QF, Singmann H, & Wagenmakers E-J (2020). bridgesampling: An R package for estimating normalizing constants. Journal of Statistical Software, 92 (10), 1–29. [Google Scholar]
- Harrington Stack CM, James AN, & Watson DG (2018). A failure to replicate rapid syntactic adaptation in comprehension. Memory & Cognition, 1–14. [DOI] [PubMed] [Google Scholar]
- Hay J (2018). Sociophonetics: The role of words, the role of context, and the role of words in context. Topics in cognitive science, 10 (4), 696–706. [DOI] [PubMed] [Google Scholar]
- Hilbig BE (2016). Reaction time effects in lab-versus web-based research: Experimental evidence. Behavior Research Methods, 48 (4), 1718–1724. [DOI] [PubMed] [Google Scholar]
- Huyck JJ, Smith RH, Hawkins S, & Johnsrude IS (2017). Generalization of perceptual learning of degraded speech across talkers. Journal of Speech, Language, and Hearing Research, 60 (11), 3334–3341. [DOI] [PubMed] [Google Scholar]
- Idemaru K & Holt LL (2011). Word recognition reflects dimension-based statistical learning. Journal of Experimental Psychology: Human Perception and Performance, 37 (6), 1939–1956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaeger TF (2008). Categorical data analysis: Away from anovas (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59 (4), 434–446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janse E & Adank P (2012). Predicting foreign-accent adaptation in older adults. The Quarterly Journal of Experimental Psychology, 65 (8), 1563–1585. [DOI] [PubMed] [Google Scholar]
- Jeffreys J (1961). Theory of probability (3rd ed.). Oxford University Press. [Google Scholar]
- Johnson DE (2009). Getting off the goldvarb standard: Introducing rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass, 3 (1), 359–383. [Google Scholar]
- Johnson K (1997). Speech perception without speaker normalization: An exemplar model. In Johnson K & Mullennix J (Eds.), Talker Variability in Speech Processing (pp. 145–165). Academic Press. [Google Scholar]
- Kass RE & Raftery AE (1995). Bayes factors. Journal of the american statistical association, 90 (430), 773–795. [Google Scholar]
- Kim D, Clayards M, & Kong EJ (2020). Individual differences in perceptual adaptation to unfamiliar phonetic categories. Journal of Phonetics, 81, 100984. [Google Scholar]
- Kleinschmidt DF & Jaeger TF (2012). A continuum of phonetic adaptation: Evaluating an incremental belief-updating model of recalibration and selective adaptation. In Proceedings of the 34th Annual Meeting of the Cognitive Science Society (CogSci12). [Google Scholar]
- Kleinschmidt DF & Jaeger TF (2015). Robust speech perception: recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122 (2), 148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleinschmidt DF & Jaeger TF (2016). What do you expect from an unfamiliar talker? In Proceedings of the 38th Annual Meeting of the Cognitive Science Society (CogSci16). [Google Scholar]
- Kleinschmidt DF, Raizada RD, & Jaeger T (2015). Supervised and unsupervised learning in phonetic adaptation. In Noelle D, Dale R, Warlaumont A, Yoshimi J, Matlock T, Jennings CPP,M, & (Eds.), Proceedings of the 37th Annual Meeting of the Cognitive Science Society (CogSci15), (pp. 1129–1134). Cognitive Science Society. [Google Scholar]
- Kliegl R, Masson MEJ, & Richter EM (2010). A linear mixed model analysis of masked repetition priming. Visual Cognition, 18 (5), 655–681. [Google Scholar]
- Knock TR, Ballard KJ, Robin DA, & Schmidt RA (2000). Influence of order of stimulus presentation on speech motor learning: A principled approach to treatment for apraxia of speech. Aphasiology, 14 (5–6), 653–668. [Google Scholar]
- Kraljic T & Samuel AG (2005). Perceptual learning for speech: Is there a return to normal? Cognitive psychology, 51 (2), 141–178. [DOI] [PubMed] [Google Scholar]
- Kraljic T & Samuel AG (2006).Generalization in perceptual learning for speech. Psychonomic bulletin & review, 13 (2), 262–268. [DOI] [PubMed] [Google Scholar]
- Kraljic T & Samuel AG (2007). Perceptual adjustments to multiple speakers. Journal of Memory and Language, 56 (1), 1–15. [Google Scholar]
- Kruschke J (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press. [Google Scholar]
- Kruschke JK & Liddell TM (2018). The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, 25 (1), 178–206. [DOI] [PubMed] [Google Scholar]
- Lancia L & Winter B (2013). The interaction between competition, learning, and habituation dynamics in speech perception. Laboratory Phonology, 4 (1), 221–257. [Google Scholar]
- Leong CXR, Price JM, Pitchford NJ, & van Heuven WJ (2018). High variability phonetic training in adaptive adverse conditions is rapid, effective, and sustained. PloS one, 13 (10), e0204888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewandowski D, Kurowicka D, & Joe H (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100 (9), 1989–2001. [Google Scholar]
- Lippi-Green R (2012). English with an accent: Language, ideology and discrimination in the United States. Routledge. [Google Scholar]
- Liu L & Jaeger TF (2018). Inferring causes during speech perception. Cognition, 174, 55–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu L & Jaeger TF (2020). Talker-specific pronunciation or just a speech error? discounting (or not) atypical pronunciations during speech perception. Journal of Experimental Psychology: Human Perception and Performance, 45 (12), 1562–1588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luce PA & Pisoni DB (1998). Recognizing spoken words: The neighborhood activation model. Ear and hearing, 19 (1), 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMurray B & Jongman A (2011). What information is necessary for speech categorization? harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological review, 118 (2), 219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng X-L & Schilling S (2002). Warp bridge sampling. Journal of Computational and Graphical Statistics, 11 (3), 552–586. [Google Scholar]
- Meng X-L & Wong H (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6, 831–860. [Google Scholar]
- Mitterer H, Scharenborg O, & McQueen JM (2013). Phonological abstraction without phonemes in speech perception. Cognition, 129 (2), 356–361. [DOI] [PubMed] [Google Scholar]
- Munro MJ (2003). A Primer on Accent Discrimination in the Canadian Context. TESL Canada Journal, 20 (2), 38. [Google Scholar]
- Munro MJ & Derwing TM (1995). Processing time, accent, and comprehensibility in the perception of native and foreign-accented speech. Language and speech, 38 (3), 289–306. [DOI] [PubMed] [Google Scholar]
- Munro R & Tily H (2011). The start of the art: An introduction to crowdsourcing technologies for language and cognition studies. In Workshop on Crowdsourcing Technologies for Language and Cognition Studies. [Google Scholar]
- Munson B (2011). The influence of actual and imputed talker gender on fricative perception, revisited (l). The Journal of the Acoustical Society of America, 130 (5), 2631–2634. [DOI] [PubMed] [Google Scholar]
- Newman RS, Clouse SA, & Burnham JL (2001). The perceptual consequences of within-talker variability in fricative production. The Journal of the Acoustical Society of America, 109 (3), 1181–1196. [DOI] [PubMed] [Google Scholar]
- Norris D & McQueen JM (2008). Shortlist b: a bayesian model of continuous speech recognition. Psychological review, 115 (2), 357. [DOI] [PubMed] [Google Scholar]
- Norris D, McQueen JM, & Cutler A (2003). Perceptual learning in speech. Cognitive psychology, 47 (2), 204–238. [DOI] [PubMed] [Google Scholar]
- Oden GC & Massaro DW (1978). Integration of featural information in speech perception. Psychological review, 85 (3), 172. [PubMed] [Google Scholar]
- Oey L, Lee C, & Simon E (2018). How we comprehend foreign-accented speech: Learning to generalize across talkers. Journal of Undergraduate Research. [Google Scholar]
- Pajak B, Fine AB, Kleinschmidt DF, & Jaeger TF (2016). Learning additional languages as hierarchical probabilistic inference: insights from first language processing. Language Learning, 66 (4), 900–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paquette-Smith M, Cooper A, & Johnson EK (2020). Targeted adaptation in infants following live exposure to an accented talker. Journal of Child Language, 1–25. [DOI] [PubMed] [Google Scholar]
- Pavlick E, Post M, Irvine A, Kachaev D, & Callison-Burch C (2014). The language demographics of amazon mechanical turk. Transactions of the Association for Computational Linguistics, 2, 79–92. [Google Scholar]
- Peterson GE & Barney HL (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24 (2), 175–184. [Google Scholar]
- Pierrehumbert J (2001). Lenition and contrast. Frequency and the emergence of linguistic structure, 45, 137. [Google Scholar]
- Pierrehumbert J (2002). Word-specific phonetics. Laboratory phonology, 7. [Google Scholar]
- Porretta V, Tucker BV, & Järvikivi J (2016). The influence of gradient foreign accentedness and listener experience on word recognition. Journal of Phonetics, 58, 1–21. [Google Scholar]
- Potter CE & Saffran JR (2017). Exposure to multiple accents supports infants’ understanding of novel accents. Cognition, 166, 67–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
- Raftery AE (1995). Bayesian model selection in social research. Sociological Methodology, 111–163. [Google Scholar]
- Reinisch E & Holt LL (2014). Lexically guided phonetic retuning of foreign-accented speech and its generalization. Journal of Experimental Psychology: Human Perception and Performance, 40 (2), 539–555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reinisch E & Mitterer H (2016). Exposure modality, input variability and the categories of perceptual recalibration. Journal of Phonetics, 55, 96–108. [Google Scholar]
- Reinisch E, Wozny DR, Mitterer H, & Holt LL (2014). Phonetic category recalibration: What are the categories? Journal of Phonetics, 45, 91–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadakata M & McQueen JM (2014). Individual aptitude in mandarin lexical tone perception predicts effectiveness of high-variability training. Frontiers in psychology, 5, 1318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Samuel AG (2011). Speech perception. Annual review of psychology, 62, 49–72. [DOI] [PubMed] [Google Scholar]
- Samuel AG & Kraljic T (2009). Perceptual learning for speech. Attention, Perception, Psychophysics, 71 (6), 1207–1218. [DOI] [PubMed] [Google Scholar]
- Schmale R & Seidl A (2009). Accommodating variability in voice and foreign accent: Flexibility of early word representations. Developmental Science, 12 (4), 583–601. [DOI] [PubMed] [Google Scholar]
- Schmale R, Seidl A, & Cristia A (2015). Mechanisms underlying accent accommodation in early word learning: Evidence for general expansion. Developmental science, 18 (4), 664–670. [DOI] [PubMed] [Google Scholar]
- Sidaras SK, Alexander JE, & Nygaard LC (2009). Perceptual learning of systematic variation in spanish-accented speech. The Journal of the Acoustical Society of America, 125 (5), 3306–3316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22 (11), 1359–1366. [DOI] [PubMed] [Google Scholar]
- Smith R, Holmes-Elliott S, Pettinato M, & Knight R-A (2014). Cross-accent intelligibility of speech in noise: Long-term familiarity and short-term familiarization. The Quarterly Journal of Experimental Psychology, 67 (3), 590–608. [DOI] [PubMed] [Google Scholar]
- Soto V, Siohan O, Elfeky M, & Moreno P (2016). Selection and combination of hypotheses for dialectal speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, (pp. 5845–5849). IEEE. [Google Scholar]
- Stuart-Smith J (2008). Scottish english: Phonology. Varieties of English, 1, 48–70. [Google Scholar]
- Sumner M (2011). The role of variation in the perception of accented speech. Cognition, 119 (1), 131–136. [DOI] [PubMed] [Google Scholar]
- Tamminga M, Wilder R, Lai W, & Wade L (2020). Perceptual learning, talker specificity, and sound change. Papers in Historical Phonology, 5, 90–122. [Google Scholar]
- Tatman R (2016). Speaker dialect is a necessary feature to model perceptual accent adaptation in humans. In 4th Pacific Northwest Regional NLP Workshop: NW-NLP 2016. [Google Scholar]
- Theodore RM & Monto NR (2019). Distributional learning for speech reflects cumulative exposure to a talker’s phonetic distributions. Psychonomic Bulletin and Review, 26 (3), 985–992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng CY, Alexander JE, Sidaras SK, & Nygaard LC (2016). The role of training structure in perceptual learning of accented speech. Journal of Experimental Psychology: Human Perception and Performance, 42 (11), 1793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van der Zande P, Jesse A, & Cutler A (2014). Cross-speaker generalisation in two phoneme-level perceptual adaptation processes. Journal of Phonetics, 43, 38–46. [Google Scholar]
- Vasishth S, Mertzen D, Jäger LA, & Gelman A (2018). The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103, 151–175. [Google Scholar]
- Verhagen J & Wagenmakers E-J (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General, 143 (4), 1457. [DOI] [PubMed] [Google Scholar]
- Vroomen J, van Linden S, De Gelder B, & Bertelson P (2007). Visual recalibration and selective adaptation in auditory–visual speech perception: Contrasting build-up courses. Neuropsychologia, 45 (3), 572–577. [DOI] [PubMed] [Google Scholar]
- Wade T, Jongman A, & Sereno J (2007). Effects of acoustic variability in the perceptual learning of non-native-accented speech sounds. Phonetica, 64 (2–3), 122–144. [DOI] [PubMed] [Google Scholar]
- Wagenmakers EJ (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14 (5), 779–804. [DOI] [PubMed] [Google Scholar]
- Weatherholtz K & Jaeger TF (2016). Speech perception and generalization across talkers and accents. Oxford Research Encyclopedia. [Google Scholar]
- Weil SA (2001). Foreign accented speech: Adaptation and generalization. Master’s thesis, Ohio State University. [Google Scholar]
- Witteman MJ, Weber A, & McQueen JM (2013). Foreign accent strength and listener familiarity with an accent codetermine speed of perceptual adaptation. Attention, Perception, & Psychophysics, 75 (3), 537–556. [DOI] [PubMed] [Google Scholar]
- Xie X, Buxó-Lugo A, & Kurumada C (2020). Encoding and decoding of meaning through structured variability in intonational speech prosody. Cognition. [DOI] [PubMed] [Google Scholar]
- Xie X, Earle FS, & Myers EB (2018). Sleep facilitates generalisation of accent adaptation to a new talker. Language, Cognition and Neuroscience, 33 (2), 196–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie X, Liu L, & Jaeger TF (2020). Cross-talker generalization in foreign-accented speech perception. Retrieved from osf. io/brwx5. [Google Scholar]
- Xie X & Myers EB (2017). Learning a talker or learning an accent: Acoustic similarity constrains generalization of foreign accent adaptation to new talkers. Journal of Memory and Language, 97, 30–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie X, Theodore RM, & Myers EB (2017). More than a boundary shift: Perceptual adaptation to foreign-accented speech reshapes the internal structure of phonetic categories. Journal of Experimental Psychology: Human Perception and Performance, 43 (1), 206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie X, Weatherholtz K, Bainton L, Rowe E, Burchill Z, Liu L, & Jaeger TF (2018). Rapid adaptation to foreign-accented speech and its transfer to an unfamiliar talker. The Journal of the Acoustical Society of America, 143 (4), 2013–2031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yarkoni T (2019). The generalizability crisis. PsyArXiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.