Effects of irrelevant unintelligible and intelligible background speech on spoken language production

Jieying He; Candice Frances; Ava Creemers; Laurel Brehm

doi:10.1177/17470218231219971

. 2024 Jan 21;77(8):1745–1769. doi: 10.1177/17470218231219971

Effects of irrelevant unintelligible and intelligible background speech on spoken language production

Jieying He ^1,^2,^✉, Candice Frances ¹, Ava Creemers ¹, Laurel Brehm ^1,³

PMCID: PMC11295403 PMID: 38044368

Abstract

Earlier work has explored spoken word production during irrelevant background speech such as intelligible and unintelligible word lists. The present study compared how different types of irrelevant background speech (word lists vs. sentences) influenced spoken word production relative to a quiet control condition, and whether the influence depended on the intelligibility of the background speech. Experiment 1 presented native Dutch speakers with Chinese word lists and sentences. Experiment 2 presented a similar group with Dutch word lists and sentences. In both experiments, the lexical selection demands in speech production were manipulated by varying name agreement (high vs. low) of the to-be-named pictures. Results showed that background speech, regardless of its intelligibility, disrupted spoken word production relative to a quiet condition, but no effects of word lists versus sentences in either language were found. Moreover, the disruption by intelligible background speech compared with the quiet condition was eliminated when planning low name agreement pictures. These findings suggest that any speech, even unintelligible speech, interferes with production, which implies that the disruption of spoken word production is mainly phonological in nature. The disruption by intelligible background speech can be reduced or eliminated via top–down attentional engagement.

Keywords: Irrelevant speech effect, name agreement, speech production

Introduction

Much of daily conversation, which requires both speech comprehension and production, occurs in the presence of irrelevant external auditory stimulation, including noise from nearby traffic or construction, a television broadcasting in the background, or a colleague talking on the phone. Extensive work has shown that background noise, music, and speech all have detrimental effects on spoken language comprehension (e.g., Eckert et al., 2016). However, very few studies have investigated how speakers plan their speech in the presence of irrelevant background noise, especially irrelevant background speech (e.g., Fargier & Laganaro, 2016, 2019; He, Meyer & Brehm, 2021). Understanding speech production in non-verbal and verbal sources of noise advances our understanding of how speakers cope with auditory disruption when planning their speech. The present study thus investigated how different types of irrelevant background speech (word lists and sentences) influenced spoken word production with varying lexical selection demands, and whether the influence was modulated by the difficulty of speech production.

One irrelevant speech effect, two relevant theories

Previous studies have found that speech and non-speech sounds disrupt cognitive tasks such as serial recall (e.g., Parmentier & Beaman, 2015; Röer et al., 2014, 2015; Schlittmeier et al., 2012) and reading (e.g., Cauchard et al., 2012; Hyönä & Ekholm, 2016; Yan et al., 2018), even when they are irrelevant for the task and can be ignored. This is referred to as the irrelevant speech effect (or irrelevant sound effect; Colle & Welsh, 1976; Jones & Morris, 1992). One major account for the irrelevant speech effect is the involvement of shared mechanisms or representations in both tasks; this is known as the domain-specific interference-by-similarity account (e.g., Jones et al., 1993; Martin et al., 1988; Salamé & Baddeley, 1982, 1989). This was first proposed to explain the changing-state effect in serial recall where distractor sequences like A B C D E F G H disrupt more than A A A A A A A A (Hughes, 2014; Hughes et al., 2007; Jones et al., 1993; Jones & Morris, 1992). The effect has been attributed to conflict driven by automatic processing of the irrelevant auditory distractors’ order (interference-by-process account; e.g., Hughes, 2014; Jones et al., 1993). This interference-by-similarity account resembles the crosstalk account for dual-task processing based on neural resources (Pashler, 1994; outcome conflict: Navon & Miller, 1987), claiming that shared or similar representations or processes cause interference in task performance.

Two hypotheses attribute the irrelevant speech effect to different sources that are both important to consider for the effect of background speech on speech production. The phonological disruption view (Salamé & Baddeley, 1982, 1989) hypothesises that the irrelevant speech effect results from the similarity in content of phonological codes (e.g., reading and irrelevant background speech), which are both buffered in a phonological memory store (a component of the phonological loop; Baddeley, 2000, 2003). This view predicts that disruption in speaking should occur from the presence of irrelevant background speech, regardless of its content. By contrast, the semantic disruption view (Martin et al., 1988) attributes the effect to the shared use of semantic processing (e.g., English reading is disrupted more by English-intelligible- than Russian-unintelligible-background speech). This view predicts that disruption in speaking should be produced by intelligible meaningful speech because meaningless speech does not recruit semantic processing.

In contrast to the domain-specific interference-by-similarity, the domain-general attention capture account posits that irrelevant speech or sound disrupts focal task performance by diverting attention away from the task (Buchner et al., 2004; Cowan, 1995; Elliott & Briganti, 2012; Röer et al., 2013, 2015). When the focus of attention is captured by task-irrelevant sounds, fewer attentional resources are available and task performance is impaired. The attention capture theory has some support in how irrelevant background speech interferes with serial recall performance (e.g., Buchner et al., 2004; Cowan, 1995; Elliott & Briganti, 2012; Röer et al., 2013, 2015) and reading (e.g., Hyönä & Ekholm, 2016). This attention capture account is compatible with the capacity limitation account for dual-task processing (Pashler, 1994; Ruthruff et al., 2003), which states that the amount of attentional resources available to focal cognitive tasks determines task performance.

There is a similar divide within this domain-general attention capture view with different predictions of the effects of irrelevant background on speech production (Eimer et al., 1996). Aspecific attention capture occurs when a sound captures attention because of the context in which it occurs, such as the sudden onset of speech following a period of silence (Eimer et al., 1996). This view predicts that irrelevant background speech with varied context (stimulus-aspecific variation, e.g., pauses in speech) should interfere more with the focal task than background speech with constant context (e.g., continuous speech). Alternatively, specific attention capture can occur when the content of the sound diverts attention (e.g., Eimer et al., 1996; Röer et al., 2013; Wood & Cowan, 1995), which implies that the attention-diverting power is attributable to the stimulus itself (stimulus-specific variation). This view predicts irrelevant background speech with rich linguistic representations (e.g., full sentences) should elicit more disruption than that with less linguistic information (e.g., word lists).

Irrelevant speech effects in spoken language production

The earlier work is nearly all conducted on language comprehension, and importantly, similar processes may or may not be relevant for speech production. Prior literature has indicated that speech production and comprehension draw upon similar processes/representations (e.g., Glaser & Düngelhoff, 1984; Kittredge & Dell, 2016; Mitterer & Ernestus, 2008; Schriefers et al., 1990), and both require attention (Cleland et al., 2006; Lien et al., 2008; Roelofs & Piai, 2011). This implies that the domain-specific interference-by-similarity (Martin et al., 1988; Salamé & Baddeley, 1982, 1989) and domain-general attention capture (Buchner et al., 2004; Cowan, 1995; Elliott & Briganti, 2012; Röer et al., 2013, 2015) mechanisms may play roles in the disruption by irrelevant background speech on speech production. However, it is also important to note that speech production and speech comprehension are also fundamentally different processes, with different goals (production = convert message to output form; comprehension = convert input form to message), and different burdens of attention. This makes it important to systematically investigate the irrelevant speech effect in language production.

Evidence from the picture–word interference (PWI) studies (Glaser & Düngelhoff, 1984; Schriefers et al., 1990) has supported the interference-by-similarity explanation. When naming a picture (e.g., DOG) with a spoken related distractor word (e.g., FOX), naming latencies and error rates increased compared with trials with an unrelated distractor (e.g., RANK; Damian & Martin, 1999; Schriefers et al., 1990). This suggests that the distractor word activated semantic representations required by the target word, interfering with spoken word production when they are related (see Roelofs, 1992, 2003), which is consistent with the semantic disruption view (Martin et al., 1988). When naming a picture (e.g., BED), a phonologically related distractor word (e.g., BEND) elicits less interference than an unrelated distractor (e.g., DUKE) (Damian & Martin, 1999; Schriefers et al., 1990). This suggests that comprehending a distractor word pre-activates phonological representations similar to the target, facilitating production when they are related. This, in turn, implies that if what is produced mismatches with what is comprehended, pre-activation of phonological/phonetic representations could also elicit interference, which is consistent with the phonological disruption view (Salamé & Baddeley, 1982, 1989).

Fargier and Laganaro (2016) investigated the roles of both interference-by-similarity and capacity limitation mechanisms by using a dual-task paradigm. Participants named pictures in three listening conditions with varying attentional demand: without distractors (low), while passively listening to distractors (medium), and during a distractor detection task (high). The auditory distractors were either tones (non-verbal stimuli) or syllables (verbal stimuli). Production latencies were longer for syllables relative to tones, and increased for tasks with higher attentional demand. These results suggest that increased representational similarity and attentional demand cause more interference on speech production performance.

To expand on earlier work on interference between single-word production and comprehension (e.g., Fargier & Laganaro, 2016; Glaser & Düngelhoff, 1984; Schriefers et al., 1990), He, Meyer & Brehm, 2021 conducted a study which mainly supports the role of interference-by-similarity in the irrelevant speech effect for speech production. In this study, Dutch speakers named sets of pictures while ignoring Dutch word lists, Chinese word lists, or eight-talker babble (i.e., language-like noise). Irrelevant background speech (Dutch and Chinese word lists) disrupted spoken word production more than eight-talker babble, and Dutch caused more disruption than Chinese word lists. This suggests that more interference on spoken word production is obtained as the representational similarity between speech production and irrelevant background speech increases, consistent with the interference-by-similarity view (Martin et al., 1988; Salamé & Baddeley, 1982, 1989). However, He, Meyer & Brehm, 2021 did not distinguish between phonological and semantic sources of disruption, which might both contribute to interference. This study also does not rule out disruption by attention capture because the irrelevant background speech varied in both aspecific context (pauses in word lists but not in eight-talker babble) and specific linguistic content (information content in word lists but not in eight-talker babble).

Furthermore, because speaking requires attention, task demands may modulate the irrelevant speech effect in language production. He, Meyer & Brehm, 2021 also manipulated the difficulty of speech production by varying name agreement (high, low) of to-be-named pictures. Name agreement is the extent to which participants agree on the name of a picture. Previous studies have found that naming a picture with high name agreement (e.g., the item called banana) is faster and more accurate than naming one with low name agreement (e.g., the item called sofa or couch; e.g., Alario et al., 2004; Cheng et al., 2010; Shao et al., 2014; Vitkovitch & Tyrrell, 1995). The effect is caused by both difficulty in object recognition (confusion over what the object should be called) and the demands of lexical selection (the need to select among competing lexical candidates); He, Meyer & Brehm, 2021 used stimuli designed to elicit the latter effect. Irrelevant speech effects were strongest for high name agreement pictures with low lexical selection demands, which suggests that the interference can be eliminated when speech production is more demanding. The finding is consistent with a top–down attention engagement mechanism (also referred to as task engagement; see Halin et al., 2014; Marsh et al., 2015): difficult speech production may make speakers concentrate harder and reduce processing of irrelevant background speech. This means that to study irrelevant speech effects in speech production, it is also important to consider the production demands.

Current study

The present study was designed to explore how different types of irrelevant background speech affected spoken language production. Given that previous studies have supported the reliability of conducting speech production research online (e.g., Fairs & Strijkers, 2021; He, Meyer, Creemers, & Brehm, 2021; Stark et al., 2022; Vogt et al., 2022), we designed two web-based experiments which focused on teasing apart the variants of the interference-by-similarity and attention capture accounts. To distinguish between the semantic and phonological interference-by-similarity views, we examined disruption by unintelligible (Chinese, Experiment 1) and intelligible background speech (Dutch, Experiment 2) on Dutch spoken word production. The phonological disruption view (Salamé & Baddeley, 1982, 1989) predicts that background speech, regardless of its intelligibility, should disrupt speech production relative to a quiet condition, predicting a similar pattern of results across experiments. By contrast, the semantic disruption view (Martin et al., 1988) predicts that only intelligible background speech should interfere with speech production, predicting more disruption in Experiment 2 than Experiment 1. The predictions for each account in the present study are shown in Table 1.

Table 1.

A summary of predictions in the present study.

Account	Predictions
Interference-by-similarity account (e.g., Jones et al., 1993; Martin et al., 1988; Salamé & Baddeley, 1982, 1989)
Phonological disruption view (Salamé & Baddeley, 1982, 1989)	Both Chinese speech (in Exp1) and Dutch speech (in Exp2) should disrupt spoken word production relative to a quiet condition.
Semantic disruption view (Martin et al., 1988)	Chinese speech (in Exp1) should not disrupt spoken word production relative to a quiet condition, but Dutch speech (in Exp2) should.
Attention capture account (e.g., Buchner et al., 2004; Cowan, 1995; Elliott & Briganti, 2012; Röer et al., 2013, 2015)
Aspecific attention capture view (Eimer et al., 1996)	Exp1: Chinese word lists should be more disruptive than Chinese sentences. Exp2: Dutch word lists may be more disruptive than Dutch sentences.
Specific attention capture view (Eimer et al., 1996)	Exp1: Chinese word lists should have the same disruptive potency as the sentences. Exp2: Dutch word lists may be less disruptive than Dutch sentences.
Attention engagement account (Halin et al., 2014; Marsh et al., 2015)
Stimulus-aspecific disruption	Interference elicited by Chinese background speech (in Exp1) should not be affected by name agreement.
Stimulus-specific disruption	Interference elicited by Dutch background speech (in Exp2) should be reduced for low name agreement pictures.

Open in a new tab

In both experiments, we compared word lists containing silent pauses (e.g., 渔夫,合唱团,足球,苹果,尺子,鹿; “fisherman, choir, football, apple, ruler, deer”) with sentences that form continuous speech without pauses (e.g., 鹿和尺子在苹果的左边, 并且足球和合唱团在渔夫的右边. “The deer and the ruler are to the left of the apple, and the football and the choir are to the right of the fisherman.”). This allows us to distinguish between the two attention capture view variants (Buchner et al., 2004; Cowan, 1995; Elliott & Briganti, 2012; Röer et al., 2013, 2015). In Experiment 1, if attention capture is only caused by aspecific context variation (e.g., the presence/absence of pauses), Chinese word lists should elicit more interference than Chinese sentences because they contain more pauses. By contrast, if attention capture is only caused by specific linguistic content (e.g., semantics or syntax), Chinese word lists should cause the same disruption as the Chinese sentences because they are meaningless to our Dutch speakers. Specific and aspecific properties will also elicit similar patterns of disruption in Experiment 2, though these may be modulated by specific linguistic content because Dutch word lists and sentences differ to Dutch speakers in both semantics and syntax. We thus make relatively weak predictions under the attention capture view variants for Experiment 2. See Table 1 for more details.

In both experiments, we also investigated the role of top–down attention engagement by manipulating the name agreement (high vs. low) and therefore, lexical selection demands, of to-be-named pictures. This provides insight into whether and how speakers take top–down strategies to shield against auditory disruption when planning their speech. Following earlier work (Alario et al., 2004; Cheng et al., 2010; Shao et al., 2014; Vitkovitch & Tyrrell, 1995), we predicted that pictures with low name agreement would be named more slowly than those with high name agreement in both experiments. Interactions between the type of irrelevant background speech and name agreement also show how the irrelevant speech effects are affected by the required attentional demand of speech production. Because stimulus-aspecific disruption occurs automatically, we predicted that any interference present in Experiment 1 would not be affected by name agreement. This is because the stimulus-aspecific disruption is rooted in the automatic processing of the auditory input that escapes cognitive control (Hughes, 2014). By contrast, stimulus-specific disruption is non-automatic, which means that any disruption caused by the attention-capturing properties of intelligible background speech in Experiment 2 might be reduced for low compared with high name agreement pictures. This is because stimulus-specific disruption requires central attention that taps into cognitive control (Hughes, 2014; Marsh et al., 2018).

Experiment 1

Method

Participants

We recruited 50 native speakers of Dutch who had no experience with Chinese (45 females, M_age = 25 years, range: 20–35 years) from the participant pool at the Max Planck Institute for Psycholinguistics. Power simulations (see https://osf.io/wuafh/) showed that 50 participants and 144 items (80% of the items in the study named successfully) would provide 95% power to measure a plausibly sized condition difference of 20 ms (SD = 900 ms). All participants reported normal or corrected-to-normal vision and no speech or hearing problems. They signed an online informed consent form and received a payment of €6 for their participation. The study was approved by the ethics board of the Faculty of Social Sciences of Radboud University.

Apparatus

The experiment was implemented in FRINEX (FRamework for INteractive EXperiments; Withers, 2017), a web-based platform developed at the Max Planck Institute for Psycholinguistics. Participants used their own laptops with headphones/earphones. We restricted participation to 14-in. or larger laptops (range: 14–24 in.) with Google Chrome, Firefox, Microsoft Edge, or Brave web browsers. Each participant’s speech was recorded by a built-in voice recorder in the web browser. WebMAUS Basic was used for phonetic segmentation and transcription (https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSBasic). Praat (Boersma & Weenink, 2009) was then used to extract the onsets and offsets of all segmented responses.

Materials

Visual stimuli

A total of 240 pictures from He, Meyer and Brehm (2021), Experiment 2; pictures selected from the MultiPic database, Duñabeitia et al., 2018; see Supplementary Material, Table A1) were used in the present study. Of these, 120 were high name agreement pictures, all with a name agreement percentage of 100%, and 120 were low name agreement pictures, with a name agreement between 50% and 87% (M = 72%, SD = 11%). Independent t-tests revealed that the two sets of pictures differed significantly in name agreement, but not in any of the following psycholinguistic attributes: visual complexity, word frequency (WF), age-of-acquisition (AoA), number of phonemes, number of syllables, word prevalence, phonological neighbourhood frequency (PNF), phonological neighbourhood size (PNS), orthographic neighbourhood frequency (ONF), and orthographic neighbourhood size (ONS).

The 120 high name agreement and 120 low name agreement pictures were each divided into three subsets and paired with the two background speech conditions (Chinese word list, Chinese sentence) and a quiet control condition, meaning that each auditory condition was paired with 40 high name agreement and 40 low name agreement pictures. The three sets of pictures were matched on the 10 above-mentioned attributes, and the high and low name agreement picture sets were assigned to each auditory condition.

On each trial of the experiment, four pictures, all with high name agreement or all with low name agreement, were presented simultaneously in a 1 × 4 grid (size: 10 cm × 40 cm). The pictures per grid were all from different semantic categories and the first phoneme of each word was unique, as judged by a native speaker of Dutch. There were 20 picture grids for each background speech condition, resulting in 60 grids in total; 24 additional pictures (6 picture grids) were selected as practice stimuli from the same database.

Irrelevant background speech

For the Chinese word list condition (see Supplementary Material, Table A2), 120 additional Dutch nouns were selected from the MultiPic database (Duñabeitia et al., 2018) and translated into Chinese by a native Mandarin Chinese speaker. These 120 Chinese nouns were divided into 20 word lists of 6 nouns and paired with the 20 picture grids. All 20 lists were matched on the number of phonemes and number of syllables. The number of syllables was also matched between the Chinese nouns and the sets of to-be-named pictures, t_(305.91) = −1.58, p > .05. To avoid phonological overlap between picture naming and background speech, we designed the word lists so that the six Chinese nouns per list did not share the first phoneme, and any five consecutive Chinese nouns per list also did not share the first phoneme with the to-be-named pictures in the same ordinal position. To create practice stimuli, 12 additional Dutch nouns were selected from the same database (Duñabeitia et al., 2018) and translated into Chinese, resulting in two lists. All of the word lists were recorded by a female native Mandarin Chinese speaker in neutral prosody using Audacity software (https://www.audacityteam.org/download/) at a sample rate of 44,100 Hz. Each word list was processed using Adobe Audition (https://www.adobe.com/products/audition.html) and Praat to delete initial and final silences and compress by up to 0.74%, so that each word list lasted 8 s and there were similar periods of silence (about 700 ms) between consecutive nouns. Naming latencies for pictures can be around 1 s (e.g., Shao et al., 2014; Vitkovitch & Tyrrell, 1995), the duration (the difference from speech onset and offset of a word) of a spoken one- or two-syllable word may be up to 500 ms (e.g., Damian, 2003), and both utterance onset and articulation may be slowed in the presence of background speech. Therefore, we estimated that it takes approximately 2 s to name one picture (also see He, Meyer and Brehm 2021)), totaling 8 s per word list.

For the Chinese sentence condition (see Supplementary Material, Table A3), the 20 Chinese word lists were transformed into 20 Chinese sentences by reversing the order of nouns in the list and adding conjunctions (e.g., 和/并且, “and”) and prepositional phrases (e.g., 在左边/在右边; “to the left/right of”) to link the nouns. Again, no five consecutive Chinese nouns per sentence were phonologically related to any to-be-named pictures in the same ordinal position. The two Chinese word lists were also transformed into two Chinese sentences as practice stimuli. The same speaker recorded these in neutral prosody and they were edited in the same fashion as each Chinese word list (by stretching up to a maximum of 9.59%) to last 8 s.

To test the participants’ concentration level and compliance to wearing headphones throughout the experiment, 19 additional two-syllable Dutch nouns (4 for the practice stage, 15 for the test stage) were selected from Duñabeitia et al. (2018) to be used as attention check stimuli to be repeated back during the experiment. These were recorded by a native Dutch speaker in neutral prosody and matched on intensity, total RMS (root mean square) = −33.98 dB, in Adobe Audition.

Design

The type of unintelligible background speech (Chinese word list, Chinese sentences, quiet) and the difficulty of lexical selection in speech production (Name agreement: high, low) were treated as within-participant variables; both were randomised within experimental blocks and counterbalanced across participants. Items were repeated three times resulting in three blocks containing 60 trials each with one repetition of each background speech condition and picture grid. Across blocks, the same set of four pictures was paired with all three background speech conditions, and the pictures were presented in a different arrangement within each repetition. A unique order of stimulus presentation was created for each participant with the Mix programme (van Casteren & Davis, 2006), with the constraints that word lists and sentences sharing the same nouns were presented at least every three trials, and attention check trials were presented at least every five trials.

Procedure

Participants were tested online¹ and received instructions that they should perform this experiment in a quiet room with the door shut and with potentially distracting electronic equipment turned off. They were asked to imagine that they were in a laboratory during the experiment, to wear headphones properly, and to set the volume of their laptops to a level that they usually use (e.g., to watch a video) and not change it during the experiment. We asked them to report their volume values before the test began.

During the experiment, a practice session of 10 trials (six test trials and four attention check trials) was followed by three blocks of experimental trials, each containing 60 test trials and five attention check trials. Participants were allowed to take a short break after each block. After completing the main portion of the experiment, participants were asked to type the value of their volume again, which allowed us to check whether they changed it during the experiment. They also were asked to fill out a questionnaire asking about their Chinese experience (see Supplementary Material, Table A4). The experiment lasted about 30 min.

Practice and experimental trials began with a fixation cross presented for 500 ms, followed by a blank screen for 300 ms. Then, a 1 × 4 grid appeared on the screen in which four pictures were presented simultaneously while a sound file played for up to 8 s. Participants named the four pictures one by one from left to right as quickly and accurately as possible while ignoring the background speech. Once finished, they clicked the mouse to end the trial, at which point a blank screen was presented for 1,500 ms. An example of a test trial is shown in Figure 1. Attention check trials were also included to test the concentration level of participants. The attention test trials shared the same structure as the test trials, but the stimulus screen was blank and an audio file of a single Dutch word was played. In these trials, participants were asked to repeat the Dutch word as quickly and accurately as possible.

Figure 1. — An example trial in which participants named pictures with high name agreement while ignoring a Chinese word list (translation: fisherman, choir, football, apple, ruler, deer).

Analyses

Seven dependent variables were coded to index naming performance. This provides a full description of the many ways production performance can be disrupted. Production accuracy reflects the proportion of trials where all four pictures were named correctly. Picture names were coded as correct if they matched any of the multiple names given to the picture in the MultiPic database (Duñabeitia et al., 2018); if they were diminutive versions of one of those names (e.g., munt “coin” named as muntje “little coin”), or if they were judged reasonable by trained research assistants (e.g., kruk “stool” named as stoel “chair”).

For trials on which all pictures were named correctly and which had no hesitations or self-corrections (hereafter, “fully correct trials”), we calculated four time-based measures. Onset latency was defined as the interval from the onset of stimulus presentation to onset of the utterance, and indexes the beginning stages of speech planning. Utterance duration was defined as the interval between the onset of the first picture name and the offset of the fourth picture name, and reflects how long participants took to produce all four picture names. Total pause time was defined as the sum of all pauses between object names, and indexes the planning done between producing responses. Articulation time was defined as the sum of the articulation durations of all four picture names, and reflects processing during articulations.

For fully correct trials, we also examined how participants grouped their four responses. Since earlier studies of spontaneous speech coded silent durations longer than 200 ms as silent pauses (e.g., Heldner & Edlund, 2010), we coded responses with 200 ms or less between them as a single response chunk. Two measures were derived: Total chunk number refers to how many response chunks participants made on one trial, with a larger number meaning more separate planning units for production. First chunk length refers to how many names participants produced in their initial response, and provides a measure of how much information participants planned before starting to speak.

To quantify the magnitude of all effects, Bayesian mixed-effect models (Nicenboim & Vasishth, 2016) were conducted in R version 4.0.3 (R Core Team, 2020) with the package brms (version 2.14.4, Bürkner, 2017). Predictors were name agreement (high/low) and the type of background speech (Chinese word list/Chinese sentence/quiet). Name agreement (high/low) was contrast coded with (0.5, −0.5). Two contrasts were made for the type of background speech: the first was coded with (0.25, 0.25, −0.5) to compare the two Chinese speech conditions (word list and sentence) with the quiet condition, and the second was coded with (0.5, −0.5, 0) to compare the Chinese word list and Chinese sentence conditions. The random effect structure for the models included random intercepts for participants and items, and random slopes for name agreement and the type of background speech by participants and items. Separate models were fitted for each dependent measure. All models had four chains and each chain had 24,000 iterations depending on model convergence (listed in model output tables). We used a warm-up (or burn-in) period of 2,000 iterations in each chain, which means we removed the data based on the first 2,000 iterations in order to correct the initial sampling bias.

All models used weak, widely spread priors that would be consistent with a range of null to moderate effects. The model of accuracy used family Bernoulli combined with a logit link, with a Student-t prior with 1 degree of freedom and a scale parameter of 2.5. The models of log-transformed onset latency, log-transformed utterance duration, and log-transformed articulation time used a weak normal prior with an SD of 0.2, and the model of log-transformed total pause time used a weak normal prior with an SD of 1. These models were performed using the family Gaussian and identity link. Total chunk number and first chunk length had weak normal priors centred at zero with an SD of 1, and used family Poisson combined with the log link. All models were run until the R-hat value for each parameter was 1.00, indicating convergence.

For these models, the size of reported betas reflects estimated effect sizes, with larger absolute values of betas reflecting larger effects. We reported the parameters for which 95% credible intervals (hereafter, Cr.I) do not contain zero, which is analogous to the frequentist null hypothesis significance test: the parameter has a non-zero effect with high certainty. We also reported any parameters for which the point estimate for the beta is about twice the size of its error, as this suggests that the estimated effect is large compared with the uncertainty around it. We also reported the posterior probability of all weak effects, indicating the proportion of samples with a value equal to or above the beta estimate.

Results

Six participants were removed from further analyses: three did not run the experiments successfully due to a bad internet connection, two gave no responses on attention check trials, and one had too much Chinese experience as indicated by their responses on the Chinese experience questionnaire. The data from the remaining 44 participants were checked for errors, removing from analysis any trials with implausible names (e.g., koekje “cookie” named as virus), hesitations (e.g., komkommer “cucumber” named as kom . . . komkommer), self-corrections (e.g., komkommer “cucumber” misnamed as courgette . . . komkommer “courgette . . . cucumber”), and any trials where objects were omitted or named in the wrong order. The exclusion of these inaccurate trials resulted in a loss of 13.7% of the data (range by participants: 1.1%–30% of removed trials). Then, any onset latencies below 200 ms were removed from this analysis, resulting in a loss of 0.47% of the data. Any total pause times below 20 ms were also removed from this analysis, resulting in a loss of 12.98% of the data. Finally, any data points more than 2.5 SDs below or above the mean values were removed for each time measure (1.87% for log-transformed onset latency, 0.86% for log-transformed utterance duration, 0.97% for log-transformed total pause time, and 1.33% for log-transformed articulation time). Descriptive statistics appear in Table 2.

Table 2.

Means and standard deviations of the dependent variables by name agreement and the type of background speech in Experiment 1.

	High NA			Low NA
	Chinese Word List	Chinese sentence	Quiet	Chinese word list	Chinese sentence	Quiet
Accuracy	91%	91%	92%	82%	82%	81%
Onset latency (ms)	1,246(462)	1,279 (522)	1,198 (408)	1,434 (579)	1,413 (539)	1,345 (486)
Utterance duration (ms)	2,868(790)	2,868 (771)	2,791(765)	3,475 (1,062)	3,482(1,025)	3,392 (970)
Total pause time (ms)	685(621)	662 (590)	645 (582)	1,078 (860)	1,043 (790)	1,040 (805)
Articulation time (ms)	2,309(431)	2,332 (429)	2,246 (392)	2,518 (498)	2,536 (522)	2,450 (476)
Total chunk number	1.9 (1.0)	1.9 (1.0)	1.9 (1.0)	2.3 (1.1)	2.4 (1.1)	2.4 (1.1)
First chunk length	2.7 (1.3)	2.7 (1.3)	2.8 (1.3)	2.3 (1.3)	2.2 (1.2)	2.2 (1.2)

Open in a new tab

Note. Standard deviations are given in parentheses. All time and chunking measures reflect fully correct trials only. NA: name agreement.

Attention check

The mean accuracy for attention check responses was 97% (range by participants: 73%–100%), showing that participants’ attention levels were good and that they indeed heard the background speech.

Accuracy

Participants produced sensible responses on 86% of the naming trials. As shown in Table 3, a Bayesian mixed-effect model showed that accuracy was considerably lower for low name agreement pictures than high name agreement pictures (β = .099, SE = .025, 95% Cr.I = [0.051, 0.147]), but it was not influenced by the type of background speech. Name agreement and the type of background speech did not interact.

Table 3.

Results of Bayesian mixed-effect models for all dependent variables in Experiment 1.

	Estimate	Est. error	95% Cr. I		Effective samples
	Estimate	Est. error	Lower	Upper	Effective samples
Accuracy
Population-level effects
Intercept	0.863	0.017	0.83	0.895	32,170
Name agreement	0.099	0.025	0.051	0.147	59,697
Speech vs. quiet	0	0.014	–0.028	0.029	107,958
Word List vs. Sentence	0.003	0.011	–0.019	0.025	131,954
NA × (S vs. Q)	–0.02	0.028	–0.076	0.036	107,878
NA × (WL vs. S)	0.001	0.022	–0.042	0.045	134,552
Group-level effects
Participants
sd(Intercept)	0.075	0.009	0.06	0.095	27,257
sd(NA)	0.043	0.01	0.024	0.064	54,647
sd(S vs. Q)	0.016	0.012	0.001	0.043	48,050
sd(WL vs. S)	0.012	0.009	0.001	0.033	56,746
sd(NA × (S vs. Q))	0.021	0.016	0.001	0.061	69,866
sd(NA × (WL vs. S))	0.023	0.017	0.001	0.065	55,462
Items
sd(Intercept)	0.058	0.02	0.016	0.092	6,156
sd(NA)	0.117	0.04	0.033	0.184	6,086
sd(S vs. Q)	0.05	0.018	0.011	0.085	20,580
sd(WL vs. S)	0.03	0.018	0.002	0.066	16,829
sd(NA × (S vs. Q))	0.099	0.037	0.023	0.17	22,166
sd(NA × (WL vs. S))	0.06	0.036	0.003	0.133	17,133
Log-transformed onset latency
Population-level effects
Intercept	7.133	0.028	7.078	7.188	5,293
Name agreement	–0.122	0.014	–0.149	–0.095	48,510
Speech vs. quiet	0.064	0.038	–0.011	0.138	49,911
Word list vs. sentence	–0.002	0.037	–0.074	0.071	47,960
NA × (S vs. Q)	–0.006	0.07	–0.144	0.132	50,854
NA × (WL vs. S)	–0.014	0.069	–0.15	0.122	56,068
Group-level effects
Participants
sd(Intercept)	0.177	0.02	0.143	0.223	10,270
sd(NA)	0.029	0.011	0.005	0.051	18,616
sd(S vs. Q)	0.077	0.015	0.049	0.109	31,488
sd(WL vs. S)	0.05	0.013	0.024	0.077	24,869
sd(NA × (S vs. Q))	0.035	0.025	0.001	0.091	27,704
sd(NA × (WL vs. S))	0.048	0.027	0.003	0.105	21,254
Items
sd(Intercept)	0.029	0.012	0.004	0.049	2,331
sd(NA)	0.058	0.024	0.008	0.098	2,319
sd(S vs. Q)	0.173	0.095	0.008	0.311	1,284
sd(WL vs. S)	0.177	0.1	0.006	0.316	1,181
sd(NA × (S vs. Q))	0.345	0.189	0.016	0.622	1,222
sd(NA × (WL vs. S))	0.325	0.202	0.011	0.626	1,228
Log-transformed utterance duration
Population-level effects
Intercept	8.021	0.023	7.974	8.066	6,414
Name agreement	–0.191	0.02	–0.231	–0.151	39,748
Speech vs. quiet	0.029	0.026	–0.022	0.08	54,056
Word list vs. sentence	–0.003	0.022	–0.046	0.04	51,599
NA × (S vs. Q)	0.018	0.05	–0.081	0.117	56,494
NA × (WL vs. S)	0.005	0.044	–0.081	0.091	49,868
Group-level effects
Participants
sd(Intercept)	0.142	0.016	0.115	0.178	12,242
sd(NA)	0.064	0.009	0.047	0.084	35,908
sd(S vs. Q)	0.014	0.01	0.001	0.036	35,029
sd(WL vs. S)	0.01	0.007	0	0.026	45,776
sd(NA × (S vs. Q))	0.019	0.014	0.001	0.054	49,185
sd(NA × (WL vs. S))	0.04	0.02	0.004	0.081	31,111
Items
sd(Intercept)	0.04	0.023	0.002	0.074	1,565
sd(NA)	0.081	0.045	0.004	0.148	1,643
sd(S vs. Q)	0.125	0.055	0.015	0.21	3,193
sd(WL vs. S)	0.111	0.036	0.037	0.173	5,059
sd(NA × (S vs. Q))	0.251	0.109	0.032	0.422	3,182
sd(NA × (WL vs. S))	0.222	0.073	0.072	0.346	4,698
Log-transformed total pause time
Population-level effects
Intercept	6.274	0.081	6.115	6.432	7,041
Name agreement	–0.574	0.058	–0.687	–0.46	43,884
Speech vs. quiet	0.009	0.07	–0.127	0.147	67,063
Word list vs. sentence	0.017	0.064	–0.108	0.143	58,586
NA × (S vs. Q)	0.039	0.134	–0.224	0.304	69,382
NA × (WL vs. S)	0.033	0.126	–0.216	0.283	62,853
Group-level effects
Participants
sd(Intercept)	0.508	0.058	0.41	0.635	13,162
sd(NA)	0.177	0.033	0.116	0.247	43,499
sd(S vs. Q)	0.122	0.052	0.017	0.222	26,954
sd(WL vs. S)	0.067	0.04	0.004	0.152	31,799
sd(NA × (S vs. Q))	0.078	0.06	0.003	0.223	53,517
sd(NA × (WL vs. S))	0.126	0.08	0.006	0.298	32,126
Items
sd(Intercept)	0.107	0.063	0.004	0.204	2,282
sd(NA)	0.222	0.124	0.01	0.409	2,251
sd(S vs. Q)	0.293	0.14	0.023	0.518	3,763
sd(WL vs. S)	0.292	0.102	0.078	0.469	6,780
sd(NA × (S vs. Q))	0.59	0.279	0.049	1.038	3,738
sd(NA × (WL vs. S))	0.579	0.205	0.151	0.935	6,811
Log-transformed articulation time
Population-level effects
Intercept	7.768	0.019	7.731	7.805	5,872
Name agreement	–0.085	0.02	–0.125	–0.046	46,351
Speech vs. quiet	0.038	0.014	0.01	0.066	61,569
Word list vs. sentence	–0.007	0.012	–0.031	0.017	64,224
NA × (S vs. Q)	0.007	0.027	–0.046	0.06	66,049
NA × (WL vs. S)	–0.003	0.024	–0.05	0.044	62,948
Group-level effects
Participants
sd(Intercept)	0.108	0.013	0.087	0.136	11,302
sd(NA)	0.053	0.007	0.041	0.069	28,988
sd(S vs. Q)	0.029	0.008	0.011	0.045	20,619
sd(WL vs. S)	0.008	0.005	0	0.02	35,991
sd(NA × (S vs. Q))	0.014	0.011	0.001	0.039	41,441
sd(NA × (WL vs. S))	0.021	0.014	0.001	0.051	21,175
Items
sd(Intercept)	0.042	0.026	0.001	0.078	1,378
sd(NA)	0.083	0.051	0.003	0.157	1,380
sd(S vs. Q)	0.06	0.036	0.002	0.113	1,763
sd(WL vs. S)	0.055	0.029	0.003	0.098	1,923
sd(NA × (S vs. Q))	0.121	0.071	0.005	0.225	1,729
sd(NA × (WL vs. S))	0.106	0.059	0.005	0.195	1,932
Total chunk number
Population-level effects
Intercept	0.715	0.041	0.635	0.795	9,365
Name agreement	–0.252	0.025	–0.301	–0.203	52,559
Speech vs. quiet	–0.016	0.035	–0.085	0.053	74,601
Word list vs. sentence	–0.017	0.029	–0.074	0.040	79,456
NA × (S vs. Q)	0.014	0.070	–0.123	0.152	77,761
NA × (WL vs. S)	0.009	0.058	–0.105	0.123	78,972
Group-level effects
Participants
sd(Intercept)	0.256	0.030	0.206	0.321	15,391
sd(NA)	0.062	0.021	0.020	0.104	46,312
sd(S vs. Q)	0.023	0.018	0.001	0.067	62,627
sd(WL vs. S)	0.020	0.016	0.001	0.058	63,929
sd(NA × (S vs. Q))	0.049	0.037	0.002	0.139	64,075
sd(NA × (WL vs. S))	0.043	0.033	0.002	0.122	61,696
Items
sd(Intercept)	0.035	0.020	0.002	0.073	8,804
sd(NA)	0.070	0.040	0.004	0.146	7,966
sd(S vs. Q)	0.124	0.058	0.012	0.229	9,285
sd(WL vs. S)	0.102	0.043	0.014	0.183	13,656
sd(NA × (S vs. Q))	0.246	0.116	0.020	0.458	9,163
sd(NA × (WL vs. S))	0.202	0.087	0.025	0.365	13,743
First chunk length
Population-level effects
Intercept	0.863	0.042	0.781	0.946	11,967
Name agreement	0.218	0.025	0.168	0.268	96,798
Speech vs. quiet	–0.012	0.034	–0.077	0.055	95,932
Word list vs. sentence	0.013	0.030	–0.046	0.072	92,168
NA × (S vs. Q)	–0.030	0.067	–0.162	0.101	95,948
NA × (WL vs. S)	–0.027	0.060	–0.145	0.091	95,897
Group-level effects
Participants
sd(Intercept)	0.262	0.031	0.210	0.330	19,220
sd(NA)	0.022	0.016	0.001	0.061	50,297
sd(S vs. Q)	0.025	0.019	0.001	0.069	64,357
sd(WL vs. S)	0.023	0.018	0.001	0.065	61,516
sd(NA × (S vs. Q))	0.047	0.036	0.002	0.135	64,675
sd(NA × (WL vs. S))	0.043	0.033	0.002	0.122	63,963
Items
sd(Intercept)	0.047	0.025	0.003	0.090	5,967
sd(NA)	0.094	0.050	0.005	0.179	5,836
sd(S vs. Q)	0.124	0.053	0.015	0.221	11,407
sd(WL vs. S)	0.116	0.042	0.028	0.195	19,228
sd(NA × (S vs. Q))	0.249	0.106	0.031	0.442	13,355
sd(NA × (WL vs. S))	0.230	0.085	0.051	0.389	18,080

Open in a new tab

NA: name agreement; WL: word list; S: sentence; Q: quiet.

Models for all dependent variables were run for 24,000 iterations. Bolded values indicate effects where the 95% Cr.I does not contain zero.

Onset latency

As shown in Table 3 and the left panel of Figure 2, a Bayesian mixed-effect model showed that log-transformed onset latency was affected by name agreement: it took participants longer to plan names for low name agreement pictures than high name agreement pictures (β = −.122, SE = 0.014, 95% Cr.I = [−0.149, −0.095]). There was moderate evidence for the first contrast (Chinese vs. Quiet) of background speech, showing that the log-transformed onset latencies in the two Chinese speech conditions (word list and sentence) were slower than in the quiet condition (β = .064, SE = 0.038, 95% Cr.I = [−0.011, 0.138]). Note that while the 95% Cr.I contains zero, the point estimate is high relative to the error around it, and 96% of the posterior distribution around the estimated effect is above zero. Name agreement and the type of background speech did not interact.

Utterance duration

As shown in Table 3 and the right panel of Figure 2, a Bayesian mixed-effect model showed that the log-transformed utterance duration was longer for low name agreement pictures than high name agreement pictures (β = −.191, SE = 0.02, 95% Cr.I = [−0.231, −0.151]), but it was not influenced by the type of background speech. Again, name agreement and the type of background speech did not interact.

Total pause time

As shown in Table 3 and the left panel of Figure 2, a Bayesian mixed-effect model showed that the results for this measurement patterned in the same way as the log-transformed utterance duration. The log-transformed total pause time was considerably longer for low name agreement pictures than high name agreement pictures (β = −0.574, SE = 0.058, 95% Cr.I = [−0.687, −0.46]), but it did not vary with the type of background speech. Name agreement and the type of background speech did not interact.

Articulation time

As shown in Table 3 and the right panel of Figure 2, a Bayesian mixed-effect model showed that log-transformed articulation time was influenced by both name agreement and the type of background speech: It was significantly longer for low name agreement pictures than high name agreement pictures (β = −.085, SE = 0.02, 95% Cr.I = [−0.125, −0.046]), and it was reliably longer in the two Chinese speech conditions (word list and sentence) than in the quiet condition (β = 0.038, SE = 0.014, 95% Cr.I = [0.01, 0.066]). Again, name agreement did not interact with the type of background speech.

Total chunk number

As shown in Table 3 and the left panel of Figure 3, a Bayesian mixed-effect model showed that participants grouped their responses in more chunks for low name agreement pictures than high name agreement pictures (β = −.252, SE = −0.025, 95% Cr.I = [−0.301, −0.203]). There was no interaction between name agreement and the type of background speech.

First chunk length

As shown in Table 3 and the right panel of Figure 3, a Bayesian mixed-effect model showed that participants planned fewer names in their first response chunk for low name agreement pictures than high name agreement pictures (β = .218, SE = 0.025, 95% Cr.I = [0.168, 0.258]). First chunk length was not affected by the type of background speech and there was no interaction between name agreement and the type of background speech.

Interim discussion

This experiment provides support for phonological disruption and specific attention capture impacting speech production. Consistent with the phonological disruption view (Salamé & Baddeley, 1982, 1989), the presence of Chinese background speech (word lists and sentences) increased articulation time significantly, but only had a weak impact on speech onset latencies relative to a quiet condition. Consistent with the specific attention capture view (Eimer et al., 1996), there was no difference between the Chinese word list and Chinese sentence conditions on any dependent measures. Finally, name agreement had a main effect on all dependent measures (as in Alario et al., 2004; He et al., 2021; Shao et al., 2014), but did not interact with the type of Chinese background speech, consistent with the automatic stimulus-aspecific disruption proposal by Hughes (2014).

Experiment 2

Experiment 1 demonstrated clear phonological disruption and specific attention capture effects on unintelligible background speech. However, it is unclear whether these patterns generalise to intelligible background speech. Thus, we extended our investigation to an intelligible-background-speech context by replacing Chinese speech with Dutch speech in Experiment 2. Here, both the phonological and semantic disruption views (Martin et al., 1988; Salamé & Baddeley, 1982, 1989) predict that Dutch speech (word lists and sentences) should disrupt speech production relative to a quiet condition. The aspecific attention capture view (Eimer et al., 1996) predicts there may be more interference in the Dutch word list condition (because of pauses it contains), while the specific attention capture view (Eimer et al., 1996) predicts there may be more disruption in the Dutch sentence condition (due to richer representation recruitment); combined, we make relatively weak predictions under the attention capture variants. Finally, following the claim that the stimulus-specific auditory distraction should be reduced or eliminated by an increase in attention engagement because it requires central attention and cognitive control (Hughes, 2014; Marsh et al., 2018), we predicted that planning low name agreement pictures would reduce the processing—and thus interference—of Dutch background speech.