Abstract
When speech is degraded or challenging to recognize, young adult listeners with normal hearing are able to quickly adapt, improving their recognition of the speech over a short period of time. This rapid adaptation is robust, but the factors influencing rate, magnitude, and generalization of improvement have not been fully described. Two factors of interest are lexico-semantic information and talker and accent variability; lexico-semantic information promotes perceptual learning for acoustically ambiguous speech, while talker and accent variability are beneficial for generalization of learning. In the present study, rate and magnitude of adaptation were measured for speech varying in level of semantic context, and in the type and number of talkers. Generalization of learning to an unfamiliar talker was also assessed. Results indicate that rate of rapid adaptation was slowed for semantically anomalous sentences, as compared to semantically intact or topic-grouped sentences; however, generalization was seen in the anomalous conditions. Magnitude of adaptation was greater for non-native as compared to native talker conditions, with no difference between single and multiple non-native talker conditions. These findings indicate that the previously documented benefit of lexical information in supporting rapid adaptation is not enhanced by the addition of supra-sentence context.
I. INTRODUCTION
Speech produced by non-native talkers is often associated with poorer or more effortful speech recognition performance as compared to native speech (Wade et al., 2007; Goslin et al., 2012; Porretta and Kyröläinen, 2019). A non-native accent results from a combination of the segmental (i.e., phoneme-level), subsegmental (i.e., phonetic-level), and suprasegmental (i.e., word or phrase-level) features of a non-native talker's first language and the second language in which they are speaking (Flege, 1988). These changes lead to alterations of the speech signal and can cause listeners difficulty when mapping lexical meaning onto an acoustic signal that may not align with the expected features of a native production. In addition to containing altered acoustic features, non-native productions may be more acoustically variable than those produced by native talkers [Baese-Berk and Morrill (2015) and Wade et al. (2007), though see the distinctions highlighted by Vaughn et al. (2019) and Xie and Jaeger (2020)].
Even in the absence of signal alterations associated with non-native accent, native speech is still highly variable: the acoustic properties of the same lexical item will not be identical from utterance to utterance, both across and within individual speakers. Yet, listeners appear to be unaffected by this variability, with recognition of spoken communication continuing fairly smoothly. Further, when the speech signal is altered, most young adult listeners are able to overcome an initial decrease in speech recognition ability after a short period of exposure, improving their recognition for this challenging speech over time (Bradlow and Bent, 2008; Clarke and Garrett, 2004; Samuel and Kraljic, 2009). In this study, this process of rapid adaptation is explored for conditions containing both native and non-native speech.
A. Semantic context, stimulus type, and speech recognition
Semantic context supports speech recognition (Miller et al., 1951; Kalikow et al., 1977; Nittrouer and Boothroyd, 1990): words in isolation or in weakly constraining sentence contexts are recognized less accurately than words presented in meaningful sentences. Several studies have compared the context benefit under conditions varying in stimulus quality. Reductions in the context benefit have been observed for recognition of degraded speech [i.e., low-pass filtered, time-compressed, noise-vocoded and presented in noise (Goy et al., 2013; Winn, 2016)], as compared to the benefit seen for undistorted speech. Goy et al. (2013) examined the semantic context benefit for a lexical decision task using three different forms of signal distortion, including low-pass filtering, time-compression, and concurrent 12-talker babble, and found a greater benefit of context for undistorted stimuli, with no significant differences across distortion types. Aydelott et al. (2006) also found that the N400 effect, an electrophysiologic marker of the context benefit, was delayed and reduced in magnitude when stimuli were low-pass filtered, suggesting an electrophysiologic analog of the reduced context benefit for acoustically degraded speech.
Semantic context also benefits recognition of speech produced by non-native English talkers. Behrman and Akhund (2013) measured listeners' ratings of comprehensibility and accent strength, as well as intelligibility scores, for Spanish-accented English speech produced by talkers with mild, moderate, and strong accents. They found that listeners benefitted from contextual information when listening to all three accent strength levels, but that the context effect was largest and most consistent in the strongest accent condition. Similarly, young adults showed a greater context benefit for non-native compared to native speech on a word-identification task (Bent et al., 2019). This presents something of a contrast compared to the findings of Goy et al. (2013) described above, suggesting that the alterations to segmental, subsegmental, and suprasegmental features present in non-native speech have a different impact on the context benefit than do the spectral changes imposed by filtering or noise-vocoding. Overall, the literature reviewed indicates that semantic context is beneficial across different forms of signal alterations, but that the degree of signal alteration may influence the strength and direction of the context benefit. In the present study, conditions containing both single and multiple non-native English talkers are included and compared to speech produced by native English talkers, to examine the effects of talker and accent variability.
B. Semantic context, stimulus type, and rapid adaptation
In addition to benefitting overall speech recognition performance, availability of lexico-semantic information is known to promote perceptual adaptation to unfamiliar speech signals. When faced with a stimulus containing an ambiguous speech sound, for example, listeners are thought to take advantage of the lexical information present in the stimulus to adjust their internal boundaries of category representation to include the ambiguous phoneme (Eisner and McQueen, 2005). Lexically guided learning is more robust for more ambiguous stimuli, with learning reduced for maximally or minimally altered stimuli (Babel et al., 2019).
Availability of lexical-semantic information has also been shown to influence perceptual learning and rapid adaptation to unfamiliar or challenging speech (Norris et al., 2003; Davis et al., 2005; Maye et al., 2008). For example, Davis et al. (2005) assessed the benefits of training with noise-vocoded versions of standard English sentences, semantically anomalous (but syntactically intact with real words) sentences, and Jabberwocky (syntactically intact with non-real content words) sentences. They found that listeners who trained on sentences containing lexical information (standard English, semantically anomalous, Jabberwocky) all showed greater learning than those who trained with non-words or who did not train at all. The presence of lexical information in the training stimuli appears to have been critical for learning of spectrally distorted stimuli. Similar findings have been documented with non-native speech (Cooper and Bradlow, 2016).
Baese-Berk et al. (2021) examined the effects of semantic context on adaptation to Mandarin-accented English. Listener groups were trained on 40 sentences that had one of three levels of semantic predictability: high, low, and anomalous sentences. They found that training with all three sentence types provided a similar degree of benefit on a post-test when compared to a group who had received no training. However, closer examination showed that, of the listeners who had completed training, the greatest benefits were seen when the level of semantic information present during training was identical to that heard at post-test. The authors suggest that listeners may adopt different strategies in adapting to an unfamiliar accent, depending on the linguistic structure of the target speech. While this study examined adaptation to non-native accented speech, no comparisons were made with a native control talker, or with accented talkers varying in number or accent type. Considering the interactions of talker type and context level found in prior studies (Behrman and Akhund, 2013; Bent et al., 2019), an investigation of these effects on patterns of rapid adaptation is warranted. In the present study, three levels of semantic context are examined for their influence on rapid adaptation to non-native speech, and their potential interactions with talker and accent variability.
C. Inhibitory control and rapid adaptation
An individual's cognitive capacity may influence their speech recognition ability, particularly in challenging listening circumstances (Dey and Sommers, 2015; Janse, 2012; Rönnberg et al., 2008, 2013; Sommers and Danielson, 1999). Relevant to this study, inhibitory control has been shown to be predictive of rate and magnitude of adaptation to non-native speech (Banks et al., 2015). The hypothesized role of inhibitory mechanisms in speech recognition is in restricting information that is irrelevant to the target from taking up resources that would be used for processing the target speech. In this study, listeners completed the Stroop task (Stroop, 1935), a common measure of inhibitory control. The Stroop effect, as measured in this study, represents an individual's capacity to inhibit their automatic response to a written word in order to process and respond to the text color. When listening to non-native speech, listeners must at times inhibit an automatic misperception of a speech segment that has been produced differently than they might expect; phonemic changes induced by non-native accent can increase the activation of lexical competitors during the speech recognition process (Porretta and Kyröläinen, 2019). Stronger inhibition of the automatic bottom-up response should allow a listener to better integrate the remainder of the sentence and utilize the lexico-semantic information present in the sentence in order to improve recognition of future tokens. Thus, listeners with better Stroop scores are expected to show both higher overall speech recognition, and a faster rate of learning.
D. Summary
In the proposed study, the relative effects and interactions of talker type (individual talkers with the same native language and with different native languages) and semantic context on rapid adaptation to non-native speech will be explored. It is expected that listeners will show a faster rate and larger magnitude of adaptation in conditions with a greater degree of semantic information, because lexical information is known to guide adaptation (Norris et al., 2003; Davis et al., 2005; Scharenborg and Janse, 2013). Prior literature suggests that the benefit of semantic context for speech recognition is reduced when the signal is spectrally or temporally degraded (Aydelott et al., 2006; Goy et al., 2013), but that it may be strengthened for a more naturalistic form of signal alteration such as non-native speech (Behrman and Akhund, 2013; Bent et al., 2019). To further probe the effects of talker type on rapid adaptation, two non-native talker conditions, in addition to a native talker condition, are evaluated: a single non-native talker condition and a multiple non-native talker condition. This manipulation of stimulus variability is hypothesized to interact with the effect of semantic context, such that the context benefit will be maximal at the intermediate level of stimulus variability (i.e., single non-native talker) (Babel et al., 2019).
II. METHOD
A. Participants
Participants for this study included 365 total listeners between the ages of 18–31 years (mean 24.17 years). Participants were recruited and compensated for their time via the Prolific online recruitment platform (Prolific, 2022). Listeners were required to report the United States as their country of birth and country of current residence. In addition, all participants reported learning American English as their first language, and no experience with any languages other than English before the age of 7. Further, any listener reporting regular exposure during childhood to non-native English speech from family members or caregivers was disqualified from participation. All listeners reported no hearing difficulties, no history of ear surgeries, and were required to score lower than a 6 on the Revised Hearing Handicap Inventory (Cassarly et al., 2020). Details about the participants included in each condition can be found in Table I.
TABLE I.
Characteristics of the study participants. ANOM = anomalous; STD = standard; TG = topic-grouped; ENG = Native English; SPA = Native Spanish; ML1 = Multiple L1s.
| Condition | n (#females) | Age mean (sd) | Hearing handicap score mean (sd) |
|---|---|---|---|
| ANOM_ENG | 40 (21) | 25.03 (3.64) | 0.30 (0.85) |
| ANOM_SPA | 40 (26) | 24.28 (3.82) | 0.15 (0.7) |
| ANOM_ML1 | 41 (15) | 24.27 (3.64) | 0.20 (0.75) |
| STD_ENG | 39 (19) | 24.41(3.61) | 0.41 (1.04) |
| STD_SPA | 41 (24) | 24.02 (3.66) | 0.49 (1.25) |
| STD_ML1 | 40 (17) | 23.73 (3.23) | 0.05 (0.32) |
| TG_ENG | 40 (18) | 23.9 (3.5) | 0.15 (0.53) |
| TG_SPA | 42 (17) | 23.45 (3.66) | 0.38 (1.01) |
| TG_ML1 | 42 (20) | 24.43 (3.6) | 0.33 (0.87) |
B. Stimuli and procedure
1. Talkers
Stimuli for this experiment were produced by both native talkers and non-native English talkers with moderately strong foreign accents. Stimuli were obtained from the ALLSSTAR (Bradlow, 2022) and Hoosier database (Atagi and Bent, 2013) corpora housed within SpeechBox (formerly OSCAAR) (Bradlow, 2022); additional talkers were recruited from the UMD community and were recorded in the Hearing Research Laboratory. Talkers were rated for accent strength (Atagi and Bent, 2013) on a scale of 1 (no accent) to 9 (very strong accent) by a group of 14 young, normal-hearing, native English listeners, with the goal of including recordings from moderately accented talkers (i.e., ratings of 4–6/9) as stimuli. Following pilot testing, a total of 24 talkers were included in the final experiment. Of these, 14 were recorded at UMD, and 10 were obtained from the SpeechBox database (Bradlow, 2022). The non-native talkers had a variety of native languages including French, Hindi, Japanese, Korean, Mandarin, Portuguese, and Spanish. All talkers were male, and the non-native talkers had a mean accent rating of 5.39/9 (SD 0.79). See Table II for details of the talkers and their characteristics.
TABLE II.
Characteristics of the talkers. L1 = native language; ANOM = anomalous; STD = standard; TG = topic-grouped; ENG = Native English; SPA = Native Spanish; ML1 = Multiple L1s. Average intelligibility is measured in quiet. Asterisks represent the talkers in the ML1 conditions who were alternated for adaptation and generalization.
| Talker ID | Database | L1 | Experimental Condition | Accent Rating (x/9) | Avg. Intelligibility (all sentences used) |
|---|---|---|---|---|---|
| UMD1E | UMD | English | ANOM_ENG | 1.17 | 99% |
| UMD2E | UMD | English | ANOM_ENG | 1.31 | 98% |
| UMD6S | UMD | Spanish | ANOM_SPA | 5.06 | 99% |
| UMD5S | UMD | Spanish | ANOM_SPA | 6.36 | 97% |
| UMD1N | UMD | Hindi | ANOM_ML1 | 6.1 | 99% |
| UMD1M | UMD | Mandarin | ANOM_ML1 | 4.74 | 99% |
| UMD3S* | UMD | Spanish | ANOM_ML1 | 5.14 | 96% |
| UMD4S* | UMD | Spanish | ANOM_ML1 | 3.56 | 99% |
| UMD3E | UMD | English | STD_ENG | 1.49 | 100% |
| UMD4E | UMD | English | STD_ENG | 1.05 | 99% |
| 662 | ALLSSTAR | Spanish | STD_SPA | 6.24 | 99% |
| 837 | ALLSSTAR | Spanish | STD_SPA | 6.48 | 95% |
| J2M | Hoosier | Japanese | STD_ML1 | 5.48 | 98% |
| 544 | ALLSSTAR | Portuguese | STD_ML1 | 5.4 | 96% |
| 839* | ALLSSTAR | Spanish | STD_ML1 | 5.48 | 98% |
| UMD1S* | UMD | Spanish | STD_ML1 | 5.74 | 97% |
| E1M | Hoosier | English | TG_ENG | 1.11 | 100% |
| E5M | Hoosier | English | TG_ENG | 1.42 | 99% |
| S1M | Hoosier | Spanish | TG_SPA | 5.89 | 96% |
| UMD9S | UMD | Spanish | TG_SPA | 5.89 | 99% |
| K7M | Hoosier | Korean | TG_ML1 | 5.77 | 98% |
| F2M | Hoosier | French | TG_ML1 | 4.68 | 95% |
| UMD8S* | UMD | Spanish | TG_ML1 | 4 | 99% |
| UMD7S* | UMD | Spanish | TG_ML1 | 5 | 97% |
2. Stimuli
The stimuli included BKB/HINT-type sentence sets (Bench et al., 1979; Nilsson et al., 1994) that were altered from their original form to create three levels of stimulus context, including from least to greatest amount of available semantic information: anomalous sentences (ANOM), standard sentences (STD), and topic-grouped sentences (TG). The text for the anomalous sentences was constructed by scrambling the keywords of the sentence corpus within grammatical type such that sentences retained their syntactic structure but were semantically meaningless. For example, the sentence “A/The FARMER KEEPS a/the BULL” becomes “A/The DOG HELPED the POTATOES.” The anomalous sentences were then recorded in the Hearing Research Laboratory. There were no perceived differences in prosody between the recorded anomalous and standard sentences, because the anomalous sentences retained the same syntactic structure as the standard sentences. The standard sentence sets contained unaltered sentences presented in randomized order, and the topic-grouped sentences included unaltered sentences presented in lists organized by topic, such as “Food and Drink” or “Transportation and Travel.” Listeners were informed of the topic prior to the presentation of the first sentence. The sentences' conformity to the topic categories was confirmed by pilot testing with 14 young, normal-hearing listeners. An additional round of pilot testing was conducted to confirm that all sentences and talkers used in the experiment had a similar, high level of intelligibility. Young, normal hearing listeners (five per talker) listened to and transcribed each sentence in quiet; these intelligibility scores were used to guide the formation of the experimental lists. See Table II for the mean intelligibility of each talker's stimuli. Stimuli were presented in six-talker babble at a signal-to-noise ratio (SNR) of 0 dB. The SNR was set at this level following pilot testing with ten young, normal hearing listeners, and was intended to avoid ceiling and floor effects. It should be noted that different listeners served in each of the pilot studies described above.
C. Procedure
The experimental procedures were carried out using two online data collection platforms: Qualtrics (Qualtrics, Provo, UT) and PennController (Zehr and Schwarz, 2018). First, listeners completed a headphone check screening developed by Woods et al. (2017), which confirmed that listeners were using headphones to complete the experiment, rather than listening in the sound field. The headphone check was implemented via Qualtrics. In each trial of the listening check, listeners were asked to judge which of three presented tones was the softest. Each tone involved stereo presentation, but one of the three tones was presented 180° out of phase across channels. Thus, for listeners not using headphones, the task becomes inordinately difficult due to phase cancellation. Participants who did not pass this screening were disqualified from completing the listening experiment. Following the headphone check, listeners completed a series of Qualtrics-administered questionnaires probing hearing history, language experience, and accent exposure history. All listeners who passed the headphone screening and were not disqualified based on their language and accent exposure histories were then advanced to the listening experiment.
In each trial of the listening tasks, listeners heard a sentence and were asked to transcribe it to the best of their ability. Written responses were recorded and stored for scoring. Following their transcription, listeners heard the same sentence a second time spoken by the same talker and saw the text of the sentence written on the computer screen in front of them. This explicit feedback was designed to facilitate lexically guided learning (Davis et al., 2005). In each condition, listeners heard an initial set of 30 sentences (adaptation phase), followed by an additional set of 10 sentences (generalization phase featuring a new talker). After completing the speech tasks, all listeners completed the color-naming Stroop task (Stroop, 1935). The speech experiment and Stroop tasks were implemented via PennController (Zehr and Schwarz, 2018).
A total of nine conditions were included in the experiment, including three levels of supportive semantic context (anomalous, standard, topic-cued), and three levels of talker type [native English (ENG), native Spanish (SPA), and multiple L1s (ML1)]. Three talkers with three different language backgrounds were heard during the ML1 adaptation condition. This set of conditions allows for an examination not just of the main effect of context, but also any potential interactions of a theorized context benefit with degree of stimulus variability as imposed by the various talker conditions. In each condition, adaptation was immediately followed by a test of generalization to an unfamiliar talker. Listeners heard 10 sentences produced by an unfamiliar talker who shared their L1 with the talkers heard during adaptation. For the ML1 conditions, the generalization talker's L1 was Spanish, which was one of the accents heard during adaptation. Within each condition, the talkers used for adaptation and generalization were alternated and counterbalanced in order to minimize potential talker effects. For example, in the single ENG condition, half of the listeners heard talker E1M for adaptation and E2M for generalization, and for the other half of the listeners, talker E2M was heard for adaptation and talker E1M was heard for generalization. The context level of the generalization sentences was the same as in the adaptation condition, and the structure of the trials was identical.
III. STATISTICAL ANALYSES
The speech recognition analyses center around three main outcome measures: time course of adaptation, magnitude of adaptation, and generalization to unfamiliar talkers with familiar accents.
A. Time course of adaptation
In order to model the non-linear time course patterns of adaptation, generalized additive mixed models (GAMMs) (Hastie and Tibshirani, 1987; Wood, 2006) were utilized. GAMMs include both traditional parametric model terms as well as smooth terms. These smooth terms (or “smooths”) allow for the description of non-linearity, and estimate the degree of non-linearity from the data. GAMM analysis allows for a determination of whether or not various predictor variables change the time-course patterns of the dependent variable over time, and if so, how these patterns differ over time. This type of analysis is useful for examinations of rapid adaptation. GAMM analyses for time course of adaptation were built following the recommendations of Wieling (2018) and Sóskuthy (2021), utilizing binary and ordinal contrast coding schemes to target the comparisons of interest within the analyses. All GAMMs included random smooths for token and subject. Bonferroni correction was used with ordinal coded models, as each comparison is represented twice within the same model in order to examine both intercept-level and slope-level effects (Sóskuthy, 2021). Weighted binomial distributions were utilized in the models.
B. Magnitude of adaptation
Magnitude of adaptation was derived by calculating a relative change measure, comparing performance at the start and end of adaptation (End-Start/Start). For this measure, “start” and “end” consisted of the average of the first and last five trials of adaptation. The relative change measures were analyzed using multiple linear regression. Talker type and context level were evaluated, and their interaction was inspected for contribution to model fit.
C. Generalization
In order to examine generalization to an unfamiliar talker, a generalized linear mixed effects regression (GLMER) was constructed with proportion keywords correct included as the dependent variable. A three-level factor of test (start of adaptation, end of adaptation, generalization; reference: generalization) was included as a predictor variable, allowing for an evaluation of the performance at generalization as an improvement relative to the start of adaptation, and as a maintenance of performance at the end of adaptation. For this analysis, “start” and “end” included ten trials each, in order to make a balanced comparison with the 10 trials included in the generalization phase. Talker type, context level, and their interactions were all evaluated as predictors for generalization.
IV. RESULTS
A. Single talker conditions
Performance patterns for the two single-talker conditions are visualized in Fig. 1. The effects of talker language and context level were examined using an ordinal-coded GAMM, which allowed for consideration of both intercept and slope-related differences: i.e., were significant effects due to differences in overall performance level, differences in the pattern of speech recognition across trials, or both. A full model was run including contrast-coded model terms for the effects of condition, talker, and their interactions. Random effects' structure included random smooths for participant and token. The model summary is contained in Table III.
FIG. 1.

Speech recognition performance over the course of 30 trials for sentences spoken by a single native English talker (solid lines) and single native Spanish (dotted lines) talker, with separate panels for each level of semantic context. Each point represents the raw group mean for a single trial. Lines represent the predicted values generated by the GAMM analysis, with shading representing the 95% confidence interval. ANOM: semantically anomalous sentences; STD: standard sentences; TG: topic-grouped sentences.
TABLE III.
GAMM including ordinal terms to compare single talker conditions. Reference levels: Context = ANOM; Talker = SPA. Note that an alpha of 0.025 is used for significance testing due to Bonferroni correction for an ordinal-coded model.
| Accuracy | ||||
|---|---|---|---|---|
| Parametric coefficients | Estimate | Std. Error | z-value | p |
| (Intercept) | 0.55 | 0.18 | 3.11 | <0.01 |
| Is_std_ordTRUE | 0.01 | 0.25 | 0.03 | 0.97 |
| Is_tg_ordTRUE | 0.40 | 0.25 | 1.60 | 0.11 |
| Is_eng_ordTRUE | 0.79 | 0.25 | 3.16 | <0.01 |
| Is_std_eng_ordTRUE | 1.63 | 0.36 | 4.44 | <0.0001 |
| Is_tg_eng_ordTRUE | 0.21 | 0.35 | 0.59 | 0.56 |
| Smooth terms | edf | Ref.df | χ 2 | p |
| s(Trial) | 6.14 | 7.18 | 35.18 | <0.0001 |
| s(Trial): Is_std_ordTRUE | 3.00 | 3.63 | 13.70 | <0.01 |
| s(Trial): Is_tg_ordTRUE | 3.24 | 3.93 | 14.7 | <0.01 |
| s(Trial): Is_eng_ordTRUE | 1.00 | 1.00 | 0.07 | 0.80 |
| s(Trial): Is_std_eng_ordTRUE | 3.18 | 3.88 | 4.66 | 0.25 |
| s(Trial): Is_tg_eng_ordTRUE | 1.00 | 1.00 | 1.71 | 0.19 |
| s(Trial, Subject) | 367.15 | 537.00 | 1659.09 | <0.0001 |
| s(Trial, Token) | 550.77 | 718.00 | 3468.63 | <0.0001 |
Releveling was used to examine the comparisons of interest not represented in the model seen in Table III. The models indicated that, at each level of context (anomalous, standard, topic-grouped), performance was significantly lower for speech produced by the native Spanish talker as compared to the native English talker (STD: β = 2.4, SE = 0.26, z = 9.3, p < 0.001; TG: β = 0.99, SE = 0.25, z = 3.95, p < 0.001; ANOM: β = 0.79, SE = 0.25, z = 3.16, p < 0.01). Additionally, the parametric interaction of talker and context was significant for the standard vs anomalous (β = 1.63, SE = 0.36, z = 4.44, p < 0.001) and topic-grouped vs standard comparisons (β = –1.42, SE = 0.36, z = –3.95, p < 0.001), but not the anomalous vs topic-grouped comparison (β = 0.21, SE = 0.35, z = 0.59, p = 0.56). These interactions indicate that the overall talker effect was larger in the standard condition than in either of the other two conditions. This effect is driven by higher performance in the standard condition with the native English talker, whereas all three context levels for the native Spanish talker elicited similar overall performance levels.
Performance patterns, represented by the smooth terms, did not differ significantly by talker type, within any of the context levels (STD: edf = 1.73, χ2 = 2.65, p = 0.31; TG: edf = 1.08, χ2 = 2.22, p = 0.14; ANOM: edf = 1.0, χ2 = 0.06, p = 0.8). The patterns of adaptation did differ across context types. The pattern of adaptation to the anomalous sentences was significantly different from the patterns for both the standard or topic-grouped sentences (ANOM vs STD: edf = 3.0, χ2 = 13.7, p < 0.01; ANOM vs TG: edf = 3.24, χ2 = 14.7, p < 0.01; note these values reflect the reference levels—the native Spanish conditions). In Fig. 2, the terms from the model representing the smooth condition effects for standard vs anomalous and topic-grouped vs anomalous at the reference level (native Spanish) are visualized.
FIG. 2.
Visualization of the difference smooth terms between the standard and anomalous conditions (left) and the topic-grouped and anomalous conditions (right), for the single native Spanish condition. These curves represent just the differences in smooth patterns; the intercept-level differences are not visualized. Note the estimated difference is plotted in log-odds, due to the logistic modelling approach. Shading represents the 95% confidence interval; regions where the shading deviates from 0 indicate a significant difference between the patterns in the two conditions.
In this figure, the areas where the shaded regions differ from 0 represent the time ranges of significant difference. Thus, for both the standard and topic-grouped sentences, there is a significant difference in the early trials in which performance is relatively lower than in the anomalous condition. However, the standard and topic-grouped conditions show rapid improvement and continue to improve and outperform the anomalous condition in the mid and late portion of trials. It should be noted that these visualized terms represent the condition effects for the native Spanish talker, but the non-significant interaction smooth term indicates that these condition effects did not significantly differ for the native English talker. The patterns of performance for the standard and topic-grouped sentences did not significantly differ from one another (STD vs TG: edf = 1.0, χ2 = 1.04, p = 0.31).
In summary, when listening to a single unfamiliar talker, performance was significantly lower when listening to a non-native talker as compared to a native talker. Speech recognition performance improves over the course of 30 sentences, but the rate of adaptation differed depending on the degree of semantic information available in the sentence. When sentences contained no semantic information (anomalous sentences), the rate of adaptation was more gradual than when sentences had a standard degree of semantic information (standard sentences). The addition of global list-wise context cues (topic-grouped sentences) did not provide additional benefit in terms of an increased rate of adaptation.
B. Non-native talker conditions
Performance patterns for the two non-native talker conditions are visualized in Fig. 3. These two talker conditions were compared using a binary-coded GAMM to examine the effects of talker type, context, and any potential interactions. A full model was fit including the full effects and interactions structure; random effects structure included random smooths for participant and token. Non-significant terms were removed iteratively from the model, until the final model was selected. There was no significant difference between performance with the two talker types (single native Spanish, multiple L1s), nor did talker type interact significantly with context level; these terms were removed from the model (Wieling, 2021). The final model included difference terms for the context effects only and is summarized in Table IV.
FIG. 3.

Speech recognition performance over the course of 30 trials for sentences spoken by a single native Spanish talker (dotted lines) and by multiple talkers with unique L1s (dashed lines), with separate panels for each level of semantic context. Each point represents the raw group mean for a single trial. Lines represent the predicted values generated by the GAMM analysis, with shading representing the 95% confidence interval. ANOM: semantically anomalous sentences; STD: standard sentences; TG: topic-grouped sentences.
TABLE IV.
GAMM including binary difference terms comparing multiple-talker conditions. Reference level: Context = ANOM.
| Accuracy | ||||
|---|---|---|---|---|
| Parametric coefficients | Estimate | Std. Error | z value | p |
| (Intercept) | 0.37 | 0.15 | 2.48 | <0.05 |
| Smooth terms | edf | Ref.df | χ2 | p |
| s(Trial) | 4.83 | 5.77 | 34.8 | <0.0001 |
| s(Trial): Is_std_bin | 4.09 | 4.72 | 16.16 | <0.01 |
| s(Trial): Is_tg_bin | 4.14 | 4.78 | 27.76 | <0.0001 |
| s(Trial, Subject) | 455.9 | 2175.00 | 2071.78 | <0.0001 |
| s(Trial, Token) | 542.49 | 2697.00 | 5017.38 | <0.0001 |
Examination of model terms revealed that the pattern of adaptation to anomalous sentences differed significantly from both standard sentences (edf = 4.09, χ2 = 16.16, p < 0.01) and topic-grouped sentences (edf = 4.13, χ2 = 27.76, p < 0.001). However, there was no significant difference in performance between the standard and the topic-grouped sentences (edf = 42.51, χ2 = 2.43, p = 0.41) when the model was releveled to examine this comparison.
Figure 4 visualizes the difference smooth terms for the two significant condition effects, averaged across talker type. In both cases, performance was generally similar at the outset of trials, but performance increased more rapidly in the two context-rich conditions as compared to the anomalous condition during the early trials. This difference in conditions slowed during the second half of trials, with listeners in the standard and topic-grouped conditions showing a plateau in performance. These difference curves represent the differences in performance across talker types, as talker type had been dropped from the model as a non-significant predictor.
FIG. 4.
Visualization of the difference smooth terms between the standard and anomalous conditions (left) and the topic-grouped and anomalous conditions (right), averaged across talker conditions. Note the estimated difference is plotted in log-odds, due to the logistic modelling approach. Shading represents the 95% confidence interval; regions where the shading deviates from 0 indicate significant difference between performance in the two conditions.
Individual listeners' scores on the Stroop task were evaluated for contribution to the variance in rapid adaptation performance; inclusion of Stroop scores did not significantly contribute to model fit (p = 0.99).
C. Magnitude of adaptation
Magnitude of adaptation was calculated by comparing performance on the initial and final five trials of adaptation. This relative change measure is plotted in Fig. 5 and was compared across conditions. Seven outlier values were removed from the analysis. A linear regression was fit to the data examining the effects of talker type, context level, and their interactions. The final model selected [F(2, 355) = 8.31, R2 = 0.04, p < 0.001] contained only a main effect of talker type. The main effect of context, and the interaction of talker type and context, did not contribute significantly to model fit (p > 0.05). The individual Stroop effect scores were examined for contribution to model fit, but inclusion of Stroop did not improve the model (p > 0.05). The final model output can be found in Table V. The effect of talker type indicates that the magnitude of adaptation was greater for both non-native talker conditions as compared to the ENG condition (ENG vs SPA: β = 0.26, SE = 0.09, t = 2.96, p < 0.01; ENG vs ML1: β = 0.36, SE = 0.09, t = 3.91, p < 0.001). The magnitude of adaptation did not differ significantly between the two non-native talker conditions (SPA vs ML1: β = 0.08, SE = 0.09, t = 0.95, p = 0.35). However, it was also noted that the magnitude of adaptation for the native English talker was significantly greater than 0 (β = 0.15, SE = 0.06, t = 2.4, p < 0.05), meaning that improvements in performance over trials were seen even for the native English talker. In sum, magnitude of adaptation was greater for the non-native talker conditions than for the native English condition, regardless of context level.
FIG. 5.

Magnitude of rapid adaptation observed in each condition. Outlier values falling outside of 2 standard deviations from the mean were removed from plots and analysis. Error bars reflect standard error. ANOM: semantically anomalous sentences; STD: standard sentences; TG: topic-grouped sentences.
TABLE V.
General linear regression for magnitude of adaptation. Reference level: Talker = ENG.
| Relative Change | ||||
|---|---|---|---|---|
| Predictors | Estimate | std. Error | t-value | p |
| (Intercept) | 0.15 | 0.06 | 2.40 | <0.05 |
| Stim [SPA] | 0.26 | 0.09 | 2.96 | <0.01 |
| Stim [M.L1] | 0.35 | 0.09 | 3.91 | <0.001 |
| Observations | 358 | |||
| R2 / R2 adjusted | 0.045 / 0.039 | |||
D. Generalization to an unfamiliar talker
Performance on the generalization task was compared to performance at both the starting and ending points of the adaptation period. Speech recognition scores for the first and last 10 trials of adaptation were averaged and are plotted in comparison to the 10 generalization trials in Fig. 6. The speech recognition scores for these three time points were fitted to a GLMER using a weighted binomial distribution following the recommendations of Hox et al. (2010) in order to examine the effects of talker type, context level, and various individual predictors. The final model selected to describe the data were: . The model summary is presented in Table VI.
FIG. 6.

Generalization to an unfamiliar talker speaking with a familiar accent. Data displayed include the first 10 trials of adaptation (Start), the last 10 trials of adaptation (End), and the generalization phase (Gen). Error bars indicate standard error of the mean. ANOM: semantically anomalous sentences; STD: standard sentences; TG: topic-grouped sentences.
TABLE VI.
Generalization to an unfamiliar talker. Reference levels: Test = Generalization; Stimulus type = NE; Context = ANOM.
| Accuracy | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | Std. Error | z-value | p |
| (Intercept) | 3.92 | 0.76 | 7.06 | <0.001 |
| Test [Start] | 0.85 | 0.06 | −2.37 | <0.05 |
| Test [End] | 1.19 | 0.08 | 2.49 | <0.05 |
| Stim [SPA] | 0.46 | 0.12 | −2.87 | <0.01 |
| Stim [M.L1] | 0.28 | 0.08 | −4.27 | <0.001 |
| Context [STD] | 6.00 | 1.67 | 6.43 | <0.001 |
| Context [TG] | 0.87 | 0.31 | −0.40 | 0.69 |
| Test [Start] * Stim [SPA] | 1.00 | 0.10 | 0.02 | 0.99 |
| Test [End] * Stim [SPA] | 0.94 | 0.09 | −0.62 | 0.54 |
| Test [Start] * Stim [M.L1] | 0.98 | 0.09 | −0.20 | 0.84 |
| Test [End] * Stim [M.L1] | 0.93 | 0.09 | −0.75 | 0.45 |
| Test [Start] * Context [STD] | 0.89 | 0.11 | −0.95 | 0.34 |
| Test [End] * Context [STD] | 0.96 | 0.13 | −0.33 | 0.74 |
| Test [Start] * Context [TG] | 1.87 | 0.64 | 1.81 | 0.07 |
| Test [End] * Context [TG] | 2.16 | 0.75 | 2.22 | <0.05 |
| Stim [SPA] * Context [STD] | 0.17 | 0.07 | −4.56 | <0.001 |
| Stim [M.L1] * Context [STD] | 0.32 | 0.13 | −2.73 | <0.01 |
| Stim [SPA] * Context [TG] | 1.56 | 0.80 | 0.87 | 0.38 |
| Stim [M.L1] * Context [TG] | 5.18 | 2.70 | 3.15 | <0.01 |
| (Test [Start] * Stim [SPA]) * Context [STD] | 0.93 | 0.15 | −0.45 | 0.65 |
| (Test [End] * Stim [SPA]) * Context [STD] | 1.07 | 0.18 | 0.41 | 0.68 |
| (Test [Start] * Stim [M.L1]) * Context [STD] | 0.92 | 0.14 | −0.51 | 0.61 |
| (Test [End] * Stim [M.L1]) * Context [STD] | 1.26 | 0.21 | 1.43 | 0.15 |
| (Test [Start] * Stim [SPA]) * Context [TG] | 0.48 | 0.23 | −1.50 | 0.13 |
| (Test [End] * Stim [SPA]) * Context [TG] | 0.64 | 0.31 | −0.90 | 0.37 |
| (Test [Start] * Stim [M.L1]) * Context [TG] | 0.19 | 0.10 | −3.28 | <0.01 |
| (Test [End] * Stim [M.L1]) * Context [TG] | 0.26 | 0.13 | −2.73 | <0.01 |
| Random Effects | ||||
| σ2 | 3.29 | |||
| τ00 token | 1.54 | |||
| τ00 Name | 0.41 | |||
| ICC | 0.37 | |||
| N Name | 361 | |||
| N token | 550 | |||
| Observations | 15 748 | |||
| Marginal R2 / Conditional R2 | 0.135 / 0.457 | |||
The significant three-way interaction of test point, talker type, and context level was examined using the emmeans package. The post hoc comparisons showed that, in the anomalous and standard conditions, performance at generalization was consistently higher than at the start of adaptation (p < 0.05, all comparisons). In the topic-grouped conditions, performance at generalization was only higher than the start of adaptation for the multiple talker condition (β = –1.2, SE = 0.35, z-ratio: = 3.37, p < 0.01); for the single talker conditions the difference was not significant (p > 0.05, both comparisons). When comparing generalization with the end of adaptation, performance was typically stable, with the exception of three conditions where performance at generalization was significantly lower (ANOM_ENG, STD_ML1, TG_ENG, p < 0.05, all comparisons).
This analysis also allowed for an alternate measure of adaptation; performance was significantly higher at the end of adaptation than at the start for all conditions tested (p < 0.05, all conditions). This finding supplements those seen in the nonlinear time course analyses above: recognition improved over the course of 30 trials for all combinations of context level and talker type.
In summary, listeners' performance improved during the course of adaptation and, in most conditions, listeners were able to maintain these improvements when tested on a new talker. The two conditions where listeners did not maintain improved performance at generalization were the two single-talker conditions with topic-grouped stimuli. One possible explanation for this finding is that, in the single-talker conditions, the supra-sentence context allowed listeners to rely on the top-down information for recognition, rather than using the contextual information to help adjust their internal category boundaries for processing the acoustic input. The acoustic challenge posed by the multiple-talker condition may have been sufficient to trigger some adjustment to the bottom-up processes for recognition that carried over to the generalization phase in this condition.
V. DISCUSSION
The goal of this study was to evaluate two stimulus-related factors for their effects on rapid adaptation to unfamiliar speech. A series of listening conditions were evaluated that varied the level of semantic context available to the listener, as well as the type and number of talkers. It was expected that increasing the level of semantic context would facilitate rapid adaptation, and that increasing stimulus variability would hinder adaptation. It was also hypothesized that semantic context and stimulus variability would interact, such that the benefit of context would be greatest for the intermediate level of stimulus variability, i.e., when listeners heard a single native Spanish talker.
A. Effects of talker type on recognition and adaptation
When the two single-talker conditions were compared, a clear difference in overall speech recognition performance was observed. Listeners had better speech recognition ability for a single native English talker than for a single native Spanish talker. This finding is consistent with prior literature; the presence of a non-native accent is known to inhibit speech recognition, especially in the presence of competing talkers (Gordon-Salant et al., 2010a, 2010b). Non-native speech contains alterations to the segmental, subsegmental, and suprasegmental features of the speech stimuli that can lead to misperceptions (Flege, 1988). Fortunately, young adults are able to quickly adjust to an unfamiliar non-native accented talker, as is seen in this and prior studies (Clarke and Garrett, 2004; Bradlow and Bent, 2008; Sidaras et al., 2009).
Performance in the single non-native talker condition was also compared with a multiple-talker condition, in which all talkers had different language backgrounds. In many prior reports, speech recognition and recall were lower for conditions containing multiple vs single talkers (Mullennix et al., 1989; Goldinger et al., 1991; Sommers et al., 1994; Nygaard et al., 1995); this effect was not seen in the present study. There were no significant differences in overall level of performance between the single and multiple non-native conditions. One possible explanation for the lack of an effect of talker variability may be that both conditions contained stimuli from non-native, accented talkers. The classic reports cited above of the detrimental effect of talker variability on speech recognition compare performance with native talkers. Perhaps the challenge imposed by non-native speech is more salient than that imposed by multiple talkers, when all three talkers have similar intelligibility and ratings of accent strength.
A few prior studies have specifically examined the effects of single vs multiple talkers when measuring rapid adaptation to non-native speech. Bent and Holt (2013) compared word identification performance with single and multiple (n = 4) non-native talkers and found a detriment of multiple talkers. However, their study used individual word stimuli, whereas the present study utilized sentence-length stimuli, which are known to elicit relatively higher performance than single words. Kapolowicz et al. (2018) included conditions comparing performance on IEEE sentences (IEEE, 1969) with a single non-native talker vs five non-native talkers and found that performance was lower for the multiple talker condition. The IEEE sentences are complex sentences that are not typical of everyday sentences (e.g., “The birch canoe slid on the smooth planks.”). The HINT/BKB sentences used in the present study were originally designed to test speech recognition in children (Bench et al., 1979) and therefore are relatively less challenging than those presented in the two prior studies.
When considered in conjunction with the prior studies utilizing native [e.g., Mullennix et al. (1989) and Sommers et al. (1994)] and nonnative talkers (Bent and Holt, 2013; Kapolowicz et al., 2018), the present study is distinctive in not finding a detrimental effect of multiple talkers. It may be that the talker and stimulus combinations used in the present study represent an intermediate level of challenge. As a result, the detrimental effect of a multiple talker condition is not observed under the present conditions, though it is observed when the overall level of performance is made higher (e.g., with native talkers only) or lower (e.g., with a more challenging sentence set), as is seen in the studies described above.
Luthra et al. (2021) found that listeners required additional trials to achieve the same degree of learning for an ambiguous phoneme in a two-talker condition as opposed to two sequential single-talker conditions. This slowed rate of learning is taken to indicate that listeners experience a perceptual cost when adapting simultaneously to multiple talkers, as they must update multiple internal models at once. This talker-specific understanding of the perceptual adaptation process is not supported by the present study, which suggests that the addition of multiple talkers did not slow learning. Rather, the present findings may be more closely aligned with a belief-updating model of adaptation (Kleinschmidt and Jaeger, 2016, 2011), where learning is guided by the listeners' expectations of the talker, which are partially informed by their overall listening experiences with any talker. One key difference between the present findings and those of Luthra et al. (2021) is the use of naturally occurring accents rather than a synthetic phoneme. Most listeners experience a range of accents and dialects in their everyday lives, and thus draw on their expectations of these talkers when recognizing and adapting to their speech (Hanulíková et al., 2012; Hanulíková and Weber, 2012; Vaughn, 2019), which may reduce or eliminate the need for updating multiple generative models in this process. A more explicit examination exploring adaptation to naturalistic vs synthetic speech may help distinguish these underlying processing theories, though the balance between ecological validity and experimental control is inherently challenging.
The nature of the between-subjects study design introduced a potential confound into the study, in that different talkers with different language backgrounds were used for each condition. To minimize this potential confound, the talkers were balanced in terms of perceived accent strength, and all stimuli used in the study had similar, high levels of baseline intelligibility in quiet. Additionally, counter-balancing within condition and the random effects structure utilized in the statistical models were expected to minimize the potential talker effects. Nevertheless, the findings presented may be limited in their generalizability to different stimulus types and should be replicated for confirmation. An additional possible limitation is the use of a single native and single non-native talker in the single talker conditions. Although the overall difference in recognition of native speech vs non-native speech is comparable to that seen in other studies, the effect of context and pattern of rapid adaptation may be talker specific, suggesting that patterns of rapid adaptation with different types of context would be important to examine in additional single talkers.
B. Effects of semantic context on recognition and adaptation
Across all three levels of context, listeners showed improvements in speech recognition performance over the course of the 30 sentences. However, the patterns of improvement over trials differed depending on context level. When sentences were syntactically correct but lacking semantic meaning, the pattern of adaptation was linear, and relatively shallow. This pattern contrasted with those seen for standard and topic-grouped sentences. In these conditions, listeners showed a steep initial increase in performance, with a plateau and/or shallower improvement in the second half of trials. These performance differences were seen across all three talker types.
This general finding aligns with the literature indicating that lexical and contextual information support perceptual learning for challenging speech stimuli (Norris et al., 2003; Davis et al., 2005; Maye et al., 2008; Hervais-Adelman et al., 2011; Jesse and McQueen, 2011). In these prior studies of lexically guided learning, there is evidence that the degree of semantic context available does not influence rate or magnitude of learning, provided that some meaningful lexical information in the listener's native language is available (Baese-Berk et al., 2021; Davis et al., 2005; Cooper and Bradlow, 2016; Luthra et al., 2020). While all conditions in this study did contain intact lexical information [i.e., this study did not contain non-word conditions, as seen in studies by Davis et al. (2005) and Cooper and Bradlow (2016)], rate of learning was shown to be slowed in conditions where the sentences were semantically anomalous. The present finding contrasts with the findings of Baese-Berk et al. (2021), who noted that the rate of adaptation to a single Mandarin-accented talker was similar for high-predictability, low-predictability, and anomalous sentences. However, this conclusion was based on a comparison of the first and second blocks of training, each consisting of 20 trials, rather than on detailed trial-by-trial learning trajectories. This analysis makes a true comparison across studies challenging, particularly given that rapid adaptation can be observed in as few as 10 to 20 trials (Adank and Janse, 2010; Brown et al., 2020; Davis et al., 2005; Golomb et al., 2007; Peelle and Wingfield, 2005).
It was hypothesized that this study would find not only a benefit for semantically rich sentences as compared to semantically anomalous sentences, but an additional advantage of supra-sentence context in the form of topic-grouping; this enhancement was not seen for any of the talker conditions. One possible explanation for this lack of additional benefit may be related to the sentences used as stimuli in this experiment. The HINT/BKB sentences are relatively simple, declarative sentences that contain a relatively high level of internal contextual information as compared to more challenging corpora such as the Harvard IEEE sentences (IEEE, 1969). Thus, the topic groupings may not have been beneficial in boosting learning for these simple sentences.
The additional challenges imposed by the non-native talker(s) utilized in this experiment were expected to increase listeners' reliance on contextual information in this study, resulting in interactions of context and talker type on recognition of and adaptation to the non-native speech. These interactions were not observed, contrasting with prior findings of an interaction of context level and stimulus quality (Aydelott et al., 2006; Goy et al., 2013; Winn, 2016). However, the alterations to stimulus type in the present study differ from those in prior studies; here, listeners were tested on non-native English speech, while in prior studies the listeners heard low-pass filtered, vocoded, and time-compressed speech. These results suggest that the context benefit is similar when listening to non-native speech, a naturalistic form of signal alteration, as compared to native English speech. The study additionally contrasts with the findings of Behrman and Akhund (2013) and Bent et al. (2019), who showed an increased context benefit for non-native speech, particularly in stronger accent conditions. In the present study, the non-native talkers were selected to have a similar, moderate level of accent strength. Thus, the contrast between single and multiple non-native talkers may not have been significantly detrimental to trigger an increased reliance on semantic context for recognition and learning.
Overall, the findings of the current study generally align with a model of lexically guided adaptation in which listeners take advantage of linguistic information contained within the target signal in order to adjust their internal parameters for the mapping of sound to meaning. In all conditions examined here, listeners were able to improve their recognition of an unfamiliar talker over the duration of 30 trials. The findings also support prior studies that found a benefit of any level of linguistic information in supporting adaptation (Cooper and Bradlow, 2016; Davis et al., 2005), and extend them to show that, when examining the trial-by-trial performance patterns, constrained semantic information may initially slow adaptation, even if magnitude of adaptation is ultimately shown to be equivalent across sentence types.
C. Magnitude of adaptation
In addition to examining the time-course patterns of rapid adaptation, this study measured the magnitude of adaptation to each context and stimulus variability manipulation. Magnitude was measured as a comparison of performance between the first and last five trials of the adaptation condition. The analyses showed that talker type had an effect on magnitude of adaptation, with the native talker condition showing a smaller magnitude than either of the non-native talker conditions. This reduced magnitude for native talker conditions was true regardless of the level of context. This finding likely relates to the overall higher level of speech recognition performance for the native as compared to the non-native talker conditions. In conditions where starting performance is lower, listeners often show greater magnitude of learning and adaptation. This effect of starting level on auditory learning and adaptation has been observed by others for rapid speech (Manheim et al., 2018; Peelle and Wingfield, 2005) and foreign-accented speech (Banks et al., 2015; Tzeng et al., 2016). In the present study, the effects of accent and starting performance level on magnitude of adaptation cannot be disentangled; controlling for starting level of performance could help highlight whether magnitude of adaptation is greater for a non-native talker as compared to a native talker regardless of overall performance level.
The results of this study indicate that the presence of a non-native accent influences the magnitude of adaptation, while changes in the presence of semantically meaningful sentences increase the rate of adaptation to unfamiliar speech. The dissociation between the relative influences of talker type and sentence type on rate versus magnitude of adaptation is notable and emphasizes the purpose of examining these measures separately. In the present findings, adaptation to standard and topic-grouped sentences has a steeper initial rate of adaptation, but performance plateaus with additional sentences. This plateau likely contributes to the similarity in magnitude between these conditions and the anomalous condition: the overall change seen in all conditions is similar despite differences in how listeners achieved it. While magnitude may be the most ecologically relevant outcome, future studies which compare outcomes in terms of both rate and magnitude are critical in constructing a model of rapid adaptation which allows for optimization of both outcomes.
D. Generalization
Generalization to unfamiliar talkers was examined in this study. While the benefit of lexico-semantic information for perceptual learning of speech is well-documented (Davis et al., 2005; Cooper and Bradlow, 2016), it was unclear whether the level of context present in the adaptation stimulus would influence the degree of generalization of learning. The literature indicates that exposure to multiple talkers during an adaptation phase is beneficial in facilitating generalization, above adapting to a single talker (Bradlow and Bent, 2008; Sidaras et al., 2009; Baese-Berk et al., 2013).
In each condition, generalization was tested with an unfamiliar talker who shared a language background with a talker heard during adaptation; the level of context was constant between adaptation and generalization. The findings of the generalization analysis indicate that generalization of learning was dependent on both the level of context and talker type. In the anomalous and standard sentence conditions, listeners performed significantly better at generalization than at the start of adaptation, regardless of talker type. In the topic-grouped sentences, higher performance at generalization was only seen for the multiple talker condition; for the single talker conditions, there were no significant differences in performance between start of adaptation and generalization. Start of adaptation and generalization both constitute the first exposure to an unfamiliar talker; higher performance at generalization indicates that learning occurred and was maintained in these conditions, at least to some degree. Generalization to an unfamiliar talker in the anomalous conditions suggests that although the rate of learning was slower in these conditions, the learning experienced was still sufficient to facilitate relatively high recognition of an unfamiliar talker with a familiar accent. Thus, the retuning of internal category boundaries for mapping phonemic input to lexical meaning that occurred during adaptation was slowed by the lack of contextual information but was not eliminated altogether.
These findings expand upon of Baese-Berk et al. (2021), who found that listeners showed the greatest amount of learning when the level of predictability was similar between training and test items, and that learning did not extend across sentence types. In the present study, sentence type and talker accent were held constant between adaptation and generalization items, but the identity of the talker changed. Thus, under the ideal condition observed by Baese-Berk et al. (2021)—that is, semantic match between training and test items—listeners are able to generalize learning to an unfamiliar talker with a shared accent, even when the sentence is semantically anomalous.
E. Inhibitory control and adaptation
In this study, Stroop scores were measured and examined as a potential predictor for recognition of and adaptation to non-native speech. Stroop was not found to be a significant variable in any of the analyses presented here. This finding contrasts with the findings of Banks et al. (2015), who documented a relationship between Stroop scores and adaptation to non-native speech. However, that prior study utilized an artificial accent, while the accents used in the present study were all naturalistic accents. Listeners encounter and adapt to natural non-native accents in their everyday listening environments, while the artificial accent would have been unfamiliar to all listeners. Thus, it is possible that additional processing demands were imposed by the non-naturalistic accent employed in the Banks et al. (2015) study, which manifested as a greater reliance on inhibitory control.
F. Conclusion
This study evaluated the relative contributions of semantic context and stimulus variability on rapid adaptation and generalization to non-native English speech in young adults with normal hearing. Collectively, the present results indicate that manipulations of bottom-up phonemic detail influenced overall performance levels, and contextual manipulations affected the time-course of adaptation, but that these two effects were independent of one another. In conditions with semantically congruous sentences, the rate of learning was increased, regardless of the stimulus type. However, learning was still observed in the semantically anomalous conditions, and the magnitude of learning and generalization to unfamiliar non-native English talkers was not reduced as compared to the semantically intact conditions. Models of perceptual learning for acoustically ambiguous or challenging speech (Norris et al., 2003) indicate that lexical information influences changes to lower-level processing over the course of learning. In the present study, listeners were presented with lexically intact information (i.e., real words) in all context conditions, and were provided feedback throughout the course of training to support lexically guided learning. The introduction of semantic incongruity may incur a processing cost that disrupts but does not eliminate the lexically guided learning process.
ACKNOWLEDGMENTS
This work was supported by NIH Grant No. T32DC000046 (PIs: C.E. Carr and S.G.S.) and was conducted in the Hearing Research Lab supported by NIH Grant No. R01AG009191 (S.G.-S.). The authors would like to thank Sarah Elazar and Marjan Davoodian for their assistance with stimulus preparation and data collection.
References
- 1. Adank, P. , and Janse, E. (2010). “ Comprehension of a novel accent by young and older listeners,” Psychology and Aging 25, 736–740. 10.1037/a0020054 [DOI] [PubMed] [Google Scholar]
- 2. Atagi, E. , and Bent, T. (2013). “ Auditory free classification of nonnative speech,” J. Phon. 41, 509–519. 10.1016/j.wocn.2013.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Aydelott, J. , Dick, F. , and Mills, D. L. (2006). “ Effects of acoustic distortion and semantic context on event-related potentials to spoken words,” Psychophysiology 43, 454–464. 10.1111/j.1469-8986.2006.00448.x [DOI] [PubMed] [Google Scholar]
- 4. Babel, M. , McAuliffe, M. , Norton, C. , Senior, B. , and Vaughn, C. (2019). “ The goldilocks zone of perceptual learning,” Phonetica 76, 179–200. 10.1159/000494929 [DOI] [PubMed] [Google Scholar]
- 5. Baese-Berk, M. M. , Bent, T. , and Walker, K. (2021). “ Semantic predictability and adaptation to nonnative speech,” JASA Express Lett. 1, 015207. 10.1121/10.0003326 [DOI] [PubMed] [Google Scholar]
- 6. Baese-Berk, M. M. , Bradlow, A. R. , and Wright, B. A. (2013). “ Accent-independent adaptation to foreign accented speech,” J. Acoust. Soc. Am. 133, EL174–EL180. 10.1121/1.4789864 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Baese-Berk, M. M. , and Morrill, T. H. (2015). “ Speaking rate consistency in native and non-native speakers of English,” J. Acoust. Soc. Am. 138, EL223–EL228. 10.1121/1.4929622 [DOI] [PubMed] [Google Scholar]
- 8. Banks, B. , Gowen, E. , Munro, K. J. , and Adank, P. (2015). “ Cognitive predictors of perceptual adaptation to accented speech,” J. Acoust. Soc. Am. 137, 2015–2024. 10.1121/1.4916265 [DOI] [PubMed] [Google Scholar]
- 9. Behrman, A. , and Akhund, A. (2013). “ The influence of semantic context on the perception of spanish-accented American English,” J. Speech, Lang. Hear. Res. 56, 1567–1578. 10.1044/1092-4388(2013/12-0192) [DOI] [PubMed] [Google Scholar]
- 10. Bench, J. , Kowal, A. , and Bamford, J. (1979). “ The BKB (Bamford-Kowal-Bench) sentence lists for partially-hearing children,” Br. J. Audiol. 13, 108–112. 10.3109/03005367909078884 [DOI] [PubMed] [Google Scholar]
- 11. Bent, T. , and Holt, R. F. (2013). “ The influence of talker and foreign-accent variability on spoken word identification,” J. Acoust. Soc. Am. 133, 1677–1686. 10.1121/1.4776212 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Bent, T. , Holt, R. F. , Miller, K. , and Libersky, E. (2019). “ Sentence context facilitation for children's and adults' recognition of native-and nonnative-accented speech,” J. Speech, Lang. Hear. Res. 62, 423–433. 10.1044/2018_JSLHR-H-18-0273 [DOI] [PubMed] [Google Scholar]
- 13. Bradlow, A. (2022). “ OSCAAR—The online speech/corpora archive & analysis resource,” https://oscaar3.ling.northwestern.edu/#!/about (Last viewed March 26, 2020).
- 14. Bradlow, A. R. , and Bent, T. (2008). “ Perceptual adaptation to non-native speech,” Cognition 106, 707–729. 10.1016/j.cognition.2007.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Brown, V. A. , McLaughlin, D. J. , Strand, J. F. , and van Engen, K. J. (2020). “ Rapid adaptation to fully intelligible nonnative-accented speech reduces listening effort,” Quarterly Journal of Experimental Psychology 73(9), 1431–1443. 10.1177/1747021820916726 [DOI] [PubMed] [Google Scholar]
- 16. Cassarly, C. , Matthews, L. J. , Simpson, A. N. , and Dubno, J. R. (2020). “ The revised hearing handicap inventory and screening tool based on psychometric reevaluation of the hearing handicap inventories for the elderly and adults,” Ear Hear. 41, 95–105. 10.1097/AUD.0000000000000746 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Clarke, C. M. , and Garrett, M. F. (2004). “ Rapid adaptation to foreign-accented English,” J. Acoust. Soc. Am. 116, 3647–3658. 10.1121/1.1815131 [DOI] [PubMed] [Google Scholar]
- 18. Cooper, A. , and Bradlow, A. R. (2016). “ Linguistically guided adaptation to foreign-accented speech,” J. Acoust. Soc. Am. 140, EL378–EL384. 10.1121/1.4966585 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Davis, M. H. , Johnsrude, I. S. , Hervais-Adelman, A. G. , Taylor, K. , and McGettigan, C. (2005). “ Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences,” J. Exp. Psychol. Gen. 134, 222–241. 10.1037/0096-3445.134.2.222 [DOI] [PubMed] [Google Scholar]
- 20. Dey, A. , and Sommers, M. S. (2015). “ Age-related differences in inhibitory control predict audiovisual speech perception,” Psychol. Aging 30, 634–646. 10.1037/pag0000033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Eisner, F. , and McQueen, J. M. (2005). “ Specificity of perceptual learning,” Percept. Psycophys. 67, 224–238. 10.3758/BF03206487 [DOI] [PubMed] [Google Scholar]
- 22. Flege, J. E. (1988). “ Factors affecting degree of perceived foreign accent in English sentences,” J. Acoust. Soc. Am. 84, 70–79. 10.1121/1.396876 [DOI] [PubMed] [Google Scholar]
- 23. Goldinger, S. D. , Pisoni, D. B. , and Logan, J. S. (1991). “ On the nature of talker variability effects on recall of spoken word lists,” J. Exp. Psychol. Learn. Mem. Cogn. 17, 152–162. 10.1037/0278-7393.17.1.152 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Golomb, J. D. , Peelle, J. E. , and Wingfield, A. (2007). “ Effects of stimulus variability and adult aging on adaptation to time-compressed speech,” The Journal of the Acoustical Society of America 121, 1701–1708. 10.1121/1.2436635 [DOI] [PubMed] [Google Scholar]
- 25. Gordon-Salant, S. , Yeni-Komshian, G. H. , and Fitzgibbons, P. J. (2010a). “ Recognition of accented English in quiet by younger normal-hearing listeners and older listeners with normal-hearing and hearing loss,” J. Acoust. Soc. Am. 128, 444–455. 10.1121/1.3397409 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Gordon-Salant, S. , Yeni-Komshian, G. H. , and Fitzgibbons, P. J. (2010b). “ Recognition of accented English in quiet and noise by younger and older listeners,” J. Acoust. Soc. Am. 128, 3152–3160. 10.1121/1.3495940 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Goslin, J. , Duffy, H. , and Floccia, C. (2012). “ An ERP investigation of regional and foreign accent processing,” Brain Lang. 122, 92–102. 10.1016/j.bandl.2012.04.017 [DOI] [PubMed] [Google Scholar]
- 28. Goy, H. , Pelletier, M. , Coletta, M. , and Pichora-Fuller, M. K. (2013). “ The effects of semantic context and the type and amount of acoustic distortion on lexical decision by younger and older adults,” J. Speech, Lang. Hear. Res. 56, 1715–1732. 10.1044/1092-4388(2013/12-0053) [DOI] [PubMed] [Google Scholar]
- 29. Hanulíková, A. , van Alphen, P. M. , Goch, M. M. , and Weber, A. (2012). “ When one person's mistake is another's standard usage: The effect of foreign accent on syntactic processing,” J. Cogn. Neurosci. 24, 878–887. 10.1162/jocn_a_00103 [DOI] [PubMed] [Google Scholar]
- 30. Hanulíková, A. , and Weber, A. (2012). “ Sink positive: Linguistic experience with th substitutions influences nonnative word recognition,” Atten. Percept. Psychophys. 74, 613–629. 10.3758/s13414-011-0259-7 [DOI] [PubMed] [Google Scholar]
- 31. Hastie, T. , and Tibshirani, R. (1987). “ Generalized additive models: Some applications,” J. Am. Stat. Assoc. 82, 371–386. 10.1080/01621459.1987.10478440 [DOI] [Google Scholar]
- 32. Hervais-Adelman, A. G. , Davis, M. H. , Johnsrude, I. S. , Taylor, K. J. , and Carlyon, R. P. (2011). “ Generalization of perceptual learning of vocoded speech,” J. Exp. Psychol. Hum. Percept. Perform. 37, 283–295. 10.1037/a0020772 [DOI] [PubMed] [Google Scholar]
- 33. Hox, J. J. , Moerbeek, M. , and van de Schoot, R. (2010). Multilevel Analysis: Techniques and Applications, 2nd ed. ( Routledge, New York: ). [Google Scholar]
- 34.IEEE (1969). “ IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust. 17, 225–246. 10.1109/TAU.1969.1162058 [DOI] [Google Scholar]
- 35. Janse, E. (2012). “ A non-auditory measure of interference predicts distraction by competing speech in older adults,” Aging Neuropsychol. Cogn. 19, 741–758. 10.1080/13825585.2011.652590 [DOI] [PubMed] [Google Scholar]
- 36. Jesse, A. , and McQueen, J. M. (2011). “ Positional effects in the lexical retuning of speech perception,” Psychon. Bull. Rev. 18, 943–950. 10.3758/s13423-011-0129-2 [DOI] [PubMed] [Google Scholar]
- 37. Kalikow, D. N. , Stevens, K. N. , and Elliott, L. L. (1977). “ Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337–1351. 10.1121/1.381436 [DOI] [PubMed] [Google Scholar]
- 38. Kapolowicz, M. R. , Montazeri, V. , and Assmann, P. F. (2018). “ Perceiving foreign-accented speech with decreased spectral resolution in single- and multiple-talker conditions,” J. Acoust. Soc. Am. 143, EL99–EL104. 10.1121/1.5023594 [DOI] [PubMed] [Google Scholar]
- 39. Kleinschmidt, D. F. , and Jaeger, T. F. (2016). “ What do you expect from an unfamiliar talker?,” in Proceedings of the 38th Annual Meeting of the Cognitive Science Society, edited by Trueswell, J. , Papafragou, A. , Grodner, D. , and Mirman, D. ( Cognitive Science Society, Austin, TX: ). [Google Scholar]
- 40. Kleinschmidt, D. , and Jaeger, T. F. (2011). A Bayesian Belief Updating Model of Phonetic Recalibration and Selective Adaptation ( Association for Computational Linguistics; ), pp. 10–19. [Google Scholar]
- 41. Luthra, S. , Magnuson, J. S. , and Myers, E. B. (2020). “ Boosting lexical support does not enhance lexically guided perceptual learning,” J. Exp. Psychol. Learn. Mem. Cogn. 47, 685–704. 10.1037/xlm0000945 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Luthra, S. , Mechtenberg, H. , and Myers, E. B. (2021). “ Perceptual learning of multiple talkers requires additional exposure,” Atten. Percept. Psychophys. 83, 2217–2228. 10.3758/s13414-021-02261-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Manheim, M. , Lavie, L. , and Banai, K. (2018). “ Age, hearing, and the perceptual learning of rapid speech,” Trends Hear. 22, 1–18. 10.1177/2331216518778651 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Maye, J. , Aslin, R. N. , and Tanenhaus, M. K. (2008). “ The Weckud Wetch of the Wast: Lexical adaptation to a novel accent,” Cogn. Sci. 32, 543–562. 10.1080/03640210802035357 [DOI] [PubMed] [Google Scholar]
- 45. Miller, G. A. , Heise, G. A. , and Lichten, W. (1951). “ The intelligibility of speech as a function of the context of the test materials,” J. Exp. Psychol. 41, 329–335. 10.1037/h0062491 [DOI] [PubMed] [Google Scholar]
- 46. Mullennix, J. W. , Pisoni, D. B. , and Martin, C. (1989). “ Some effects of talker variability on spoken word recognition,” J. Acoust. Soc. Am. 85, 365–378. 10.1121/1.397688 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Nilsson, M. , Soli, S. D. , and Sullivan, J. A. (1994). “ Development of the Hearing In Noise Test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 95, 1085–1099. 10.1121/1.408469 [DOI] [PubMed] [Google Scholar]
- 48. Nittrouer, S. , and Boothroyd, A. (1990). “ Context effects in phoneme and word recognition by young children and older adults,” J. Acoust. Soc. Am. 87, 2705–2715. 10.1121/1.399061 [DOI] [PubMed] [Google Scholar]
- 49. Norris, D. , McQueen, J. M. , and Cutler, A. (2003). “ Perceptual learning in speech,” Cogn. Psychol. 47, 204–238. 10.1016/S0010-0285(03)00006-9 [DOI] [PubMed] [Google Scholar]
- 50. Nygaard, L. C. , Sommers, M. S. , and Pisoni, D. B. (1995). “ Effects of stimulus variability on perception and representation of spoken words in memory,” Percept. Psychophys. 57, 989–1001. 10.3758/BF03205458 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Peelle, J. E. , and Wingfield, A. (2005). “ Dissociations in perceptual learning revealed by adult age differences in adaptation to time-compressed speech,” Journal of Experimental Psychology: Human Perception and Performance 31, 1315–1330. 10.1037/0096-1523.31.6.1315 [DOI] [PubMed] [Google Scholar]
- 52. Porretta, V. , and Kyröläinen, A. J. (2019). “ Influencing the time and space of lexical competition: The effect of gradient foreign accentedness,” J. Exp. Psychol. Learn. Mem. Cogn. 45, 1832–1851. 10.1037/xlm0000674 [DOI] [PubMed] [Google Scholar]
- 53.Prolific (2022). www.prolific.co (Last viewed 05/01/2021).
- 54. Rönnberg, J. , Lunner, T. , Zekveld, A. , Sörqvist, P. , Danielsson, H. , Lyxell, B. , Dahlström, Ö. , Signoret, C. , Stenfelt, S. , Pichora-Fuller, M. K. , and Rudner, M. (2013). “ The Ease of Language Understanding (ELU) model: Theoretical, empirical, and clinical advances,” Front. Syst. Neurosci. 7, 1–17. 10.3389/fnsys.2013.00031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Rönnberg, J. , Rudner, M. , Foo, C. , and Lunner, T. (2008). “ Cognition counts: A working memory system for ease of language understanding (ELU),” Int. J. Audiol. 47, S99–S105. 10.1080/14992020802301167 [DOI] [PubMed] [Google Scholar]
- 56. Samuel, A. , and Kraljic, T. (2009). “ Perceptual learning for speech,” Atten. Percept. Psychophys. 71, 1207–1218. 10.3758/APP.71.6.1207 [DOI] [PubMed] [Google Scholar]
- 57. Scharenborg, O. , and Janse, E. (2013). “ Comparing lexically guided perceptual learning in younger and older listeners,” Atten. Percept. Psychophys. 75, 525–536. 10.3758/s13414-013-0422-4 [DOI] [PubMed] [Google Scholar]
- 58. Sidaras, S. K. , Alexander, J. E. D. , and Nygaard, L. C. (2009). “ Perceptual learning of systematic variation in Spanish-accented speech,” J. Acoust. Soc. Am. 125, 3306–3316. 10.1121/1.3101452 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Sommers, M. S. , and Danielson, S. M. (1999). “ Inhibitory processes and spoken word recognition in young and older adults: The interaction of lexical competition and semantic context,” Psychol. Aging 14, 458–472. 10.1037/0882-7974.14.3.458 [DOI] [PubMed] [Google Scholar]
- 60. Sommers, M. S. , Nygaard, L. C. , and Pisoni, D. B. (1994). “ Stimulus variability and spoken word recognition. I. Effects of variability in speaking rate and overall amplitude,” J. Acoust. Soc. Am. 96, 1314–1324. 10.1121/1.411453 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Sóskuthy, M. (2021). “ Evaluating generalised additive mixed modelling strategies for dynamic speech analysis,” J. Phon. 84, 101017. 10.1016/j.wocn.2020.101017 [DOI] [Google Scholar]
- 62. Stroop, J. R. (1935). “ Studies of interference in serial verbal reactions,” J. Exp. Psychol. 18, 643–662. 10.1037/h0054651 [DOI] [Google Scholar]
- 63. Tzeng, C. Y. , Alexander, J. E. D. , Sidaras, S. K. , and Nygaard, L. C. (2016). “ The role of training structure in perceptual learning of accented speech,” J. Exp. Psychol. Hum. Percept. Perform. 42, 1793–1805. 10.1037/xhp0000260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Vaughn, C. , Baese-Berk, M. , and Idemaru, K. (2019). “ Re-examining phonetic variability in native and non-native speech,” Phonetica 76, 327–358. 10.1159/000487269 [DOI] [PubMed] [Google Scholar]
- 65. Vaughn, C. R. (2019). “ Expectations about the source of a speaker's accent affect accent adaptation,” J. Acoust. Soc. Am. 145, 3218–3232. 10.1121/1.5108831 [DOI] [PubMed] [Google Scholar]
- 66. Wade, T. , Jongman, A. , and Sereno, J. (2007). “ Effects of acoustic variability in the perceptual learning of non-native-accented speech sounds,” Phonetica 64, 122–144. 10.1159/000107913 [DOI] [PubMed] [Google Scholar]
- 67. Wieling, M. (2018). “ Analyzing dynamic phonetic data using generalized additive mixed modeling: A tutorial focusing on articulatory differences between L1 and L2 speakers of English,” J. Phon. 70, 86–116. 10.1016/j.wocn.2018.03.002 [DOI] [Google Scholar]
- 68. Wieling, M. (2021). (private communication).
- 69. Winn, M. B. (2016). “ Rapid release from listening effort resulting from semantic context, and effects of spectral degradation and cochlear implants,” Trends Hear. 20, 1–17. 10.1177/2331216516669723 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Wood, S. N. (2006). “ Low-rank scale-invariant tensor product smooths for generalized additive mixed models,” Biometrics 62, 1025–1036. 10.1111/j.1541-0420.2006.00574.x [DOI] [PubMed] [Google Scholar]
- 71. Woods, K. J. P. , Siegel, M. H. , Traer, J. , and McDermott, J. H. (2017). “ Headphone screening to facilitate web-based auditory experiments,” Atten. Percept. Psychophys. 79, 2064–2072. 10.3758/s13414-017-1361-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Xie, X. , and Jaeger, T. F. (2020). “ Comparing non-native and native speech: Are L2 productions more variable?,” J. Acoust. Soc. Am. 147, 3322–3347. 10.1121/10.0001141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Zehr, J. , and Schwarz, F. (2018). “ PennController for Internet Based Experiments (IBEX).”


