Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2021 Aug 17;64(9):3617–3626. doi: 10.1044/2021_JSLHR-21-00108

Speech-in-Speech Recognition and Spatially Selective Attention in Children and Adults

Stacey G Kane a,, Kelly M Dean a, Emily Buss a
PMCID: PMC8642097  PMID: 34403280

Abstract

Purpose

Knowing target location can improve adults' speech-in-speech recognition in complex auditory environments, but it is unknown whether young children listen selectively in space. This study evaluated masked word recognition with and without a pretrial cue to location to characterize the influence of listener age and masker type on the benefit of spatial cues.

Method

Participants were children (5–13 years of age) and adults with normal hearing. Testing occurred in a 180° arc of 11 loudspeakers. Targets were spondees produced by a female talker and presented from a randomly selected loudspeaker; that location was either known, based on a pretrial cue, or unknown. Maskers were two sequences comprising spondees or speech-shaped noise bursts, each presented from a random loudspeaker. Speech maskers were produced by one male talker or by three talkers, two male and one female.

Results

Children and adults benefited from the pretrial cue to target location with the three-voice masker, and the magnitude of benefit increased with increasing child age. There was no benefit of location cues in the one-voice or noise-burst maskers. Incorrect responses in the three-voice masker tended to correspond to masker words produced by the female talker, and in the location-known condition, those masker intrusions were more likely near the cued loudspeaker for both age groups.

Conclusions

Increasing benefit of the location cue with increasing child age in the three-voice masker suggests maturation of spatially selective attention, but error patterns do not support this idea. Differences in performance in the location-unknown condition could play a role in the differential benefit of the location cue.


Speech recognition in the presence of competing speech is important for successful communication in complex auditory environments for both adults and children. Moreover, when compared to speech-in-noise tasks, speech-in-speech recognition predicts subjective listening difficulties in listeners with hearing loss (Hillock-Dunn et al., 2015; Phatak et al., 2019). The difficulties encountered when listening to speech in the presence of a small number of masker talkers is often described in terms of informational masking (Brungart, 2001; Brungart et al., 2006). Informational masking occurs when the target is encoded at the auditory periphery, but the listener struggles to segregate it from the masker due to stimulus uncertainty and perceptual similarity between the target and masker (Kidd et al., 2002; Leek et al., 1991; Neff & Green, 1987; Watson et al., 1976).

While informational masking is generally greater for school-age children than adults, results to date suggest that manipulations of perceptual similarity affect performance in similar ways. For example, speech-in-speech recognition for children and adults is influenced by mismatches in target and masker talker sex (Brungart, 2001; Leibold et al., 2018), language (Calandruccio et al., 2013, 2016), and location on the horizontal plane (Freyman et al., 2001; Yuen & Yuan, 2014). Much less is known about the role of stimulus uncertainty for children and adults. In the context of speech-in-speech recognition, uncertainty can arise due to variability in the content of the message or the acoustic features of speech, such as voice pitch, location, or presentation level (Eipert et al., 2019). Adults benefit from pretrial cues indicating target location (Brungart & Simpson, 2007; Kidd et al., 2005), target talker identity (Kidd et al., 2005), and stimulus content prior to the target (Freyman et al., 2004). It is unknown whether children experience similar improvements in performance with reduced stimulus uncertainty. The experiments described in this report were designed to evaluate how the use of spatial cues meant to reduce uncertainty and guide selective attention to the location of the target is affected by development in school-age children with normal hearing. Understanding how children are affected by stimulus uncertainty with respect to talker location is important for understanding their ability to communicate and learn in complex multisource environments.

When the target and masker location are fixed across a block of trials, spatial separation between talkers benefits speech recognition for both children and adults. This benefit is described as spatial release from masking (SRM), and it is typically quantified as the difference in speech recognition scores when the target and masker are separated in space relative to when they are colocated. This effect is modest for speech-in-noise recognition, but it can be quite pronounced for speech-in-speech recognition (Arbogast et al., 2002; Freyman et al., 2001), with effects of 10 dB commonly observed. A positive SRM for speech-in-speech recognition is thought to reflect a reduction in perceptual similarity and the introduction of segregation cues associated with differences in sound source location. Some studies with children indicate comparable magnitude of SRM for children and adults (Garadat & Litovsky, 2007; Litovsky, 2005), whereas other studies report increasing SRM with increasing age in children (Cameron et al., 2006; Corbin et al., 2017; Yuen & Yuan, 2014). It is unclear how to account for this discrepancy in findings across studies, but it could be due to differences in the amount of informational masking in the baseline condition, differences in the quality or quantity of speech cues required for children and adults to recognize speech, and the relative contributions of head shadow and “true” binaural hearing to SRM. Regardless, data on SRM in children indicate an ability to benefit from spatial separation.

Uncertainty regarding the spatial position of the target can affect the magnitude of SRM in adults (Brungart & Simpson, 2007; Kidd et al., 2005). For example, Kidd et al. (2005) evaluated speech-in-speech recognition with adult listeners for conditions in which the target location was either known in advance or unknown prior to stimulus presentation. Stimuli were sentences from the coordinate response measure (Bolia et al., 2000) corpus spoken by four male talkers. On any given trial, sentences produced by three talkers were presented simultaneously from three loudspeakers positioned at −60°, 0°, and 60° on the horizontal plane. Prior expectation regarding target location was established by manipulating the probability of the target coming from each of the three loudspeakers within a block of trials, and the call sign associated with the target was presented visually on a computer screen either before or after the stimulus was presented. When the target location was fixed across trials, performance was ~90% correct both with and without the pretrial cue indicating the target call sign. As location became increasingly uncertain, the pretrial call-sign cue conferred more benefit. When location was random (p = .33 for each of three locations), mean performance was 67% correct with a pretrial call-sign cue and 31% without that cue, suggesting that listeners randomly selected a talker to attend to if the call sign was not known ahead of time. The results of Kidd et al. (2005) indicate that prior expectation regarding target location was more beneficial than the pretrial call-sign cue. Whereas adults benefit from prior information about target location, we do not know whether young children have a comparable ability or if spatially selective attention develops during childhood.

The idea that immature selective attention to the target affects children's speech-in-speech recognition is consistent with data from psychoacoustic studies using nonspeech stimuli (Leibold, 2012). For example, Leibold and Buss (2016) measured detection thresholds for a 2-kHz pure-tone signal presented in quiet or with a synchronously gated 60-dB SPL band of noise that was one or more octaves above or below the signal frequency, including a range of noise bandwidths (75–6000 Hz). Compared to thresholds in quiet, those maskers elevated thresholds by 10.5–15.2 dB for 4- to 6-year-olds, by 3.2–8.7 dB for 7- to 10-year-olds, and by 2.0–6.3 dB for adults. This progressive reduction in remote masking with increasing age was attributed to maturation of the ability to selectively attend to the pitch of the signal. One possibility is that spatial listening is similarly affected by maturation in selective attention, such that young children are immature in their ability to preferentially attend to auditory cues based on source location. If that is the case, then young children would derive less benefit from a pretrial cue to target location compared to older children and adults when the target location is roved across trials.

Two experiments were undertaken to better understand the role of spatial uncertainty for speech-in-speech recognition in children and adults. In both cases, the task was spondee word recognition in a masker composed of either two sequences of concatenated spondees or concatenated noise bursts that were shaped using the envelopes of spondees. Target speech was produced by a single female talker, and masker speech was produced by either one male or a mix of male and female talkers. The rationale for manipulating masker talker sex was to vary the overall amount of informational masking; sex differences across talkers provide a segregation cue for both children and adults (Brungart, 2001; Leibold et al., 2018), so more informational masking was expected when the masker contained a female voice. Each spondee or noise burst in the masker was presented from a randomly selected loudspeaker. The spondee masker bears some resemblance to the two-talker speech that is commonly used to study speech-in-speech recognition (e.g., Freyman et al., 2004), in that it contains two voices at any given point in time, but it differs by virtue of unpredictable changes in location and (in the case of the three-voice masker) talker. A sentence produced by the target talker was presented before each interval; in the location-known condition, this pretrial cue came from the target location, and in the location-unknown condition, it was presented from a randomly selected loudspeaker.

At the outset, we expected adults to benefit from a pretrial location cue when the masker was composed of speech. In contrast, little or no benefit was expected for the masker composed of noise bursts because the marked perceptual differences between target speech and masking noise support segregation even in the absence of additional cues. If children have an immature ability to focus their selective attention at a cued location in space, then they should derive less benefit than adults from the pretrial location cue. Support for this hypothesis would suggest that the development of SRM observed in some previous studies (Cameron et al., 2006; Corbin et al., 2017; Yuen & Yuan, 2014) may be due, at least in part, to development of spatially selective listening.

General Method

Stimuli

Speech stimuli were 72 spondees, as shown in Table 1. These words were spoken by two males and two females, all native speakers of American English; one female talker was always used as the target voice, and the other three talkers served as maskers. Mean word length was 780 ms (SD = 105) for the masker voices and 849 ms (SD = 90) for the target voice. These stimuli were developed by Atagi and Bent (2013) and Bent (2014).

Table 1.

Spondee stimuli.

airplane doorbell hopscotch playground
armchair doormat horseshoe playmate
ashtray doorstep hotdog playpen
backyard downtown hothouse railroad
barnyard drawbridge iceberg rainbow
baseball drugstore ice cream scarecrow
bathtub duck pond inkwell schoolboy
bedroom eardrum jackknife schoolroom
bird nest eyebrow jump rope shoelace
birthday farewell meatball sidewalk
blackbird football mousetrap stairway
blue-jay footstool mushroom sunset
bus-stop grandson necktie sunshine
cowboy greyhound northwest toothbrush
cupcake hairbrush oatmeal toyshop
daybreak hardware outside whitewash
daylight headlight padlock woodwork
dollhouse highchair pancake workshop

The target was a randomly selected word, presented from a randomly selected loudspeaker in an arc of 11 speakers spanning −90° to 90°, with 18° between neighboring speakers. The masker comprised two sequences of either spondees or noise bursts. Spondee sequences were generated prior to each trial by randomly selecting words without replacement from the pool of 72 spondees, excluding the target. For the one-voice masker, all masker spondees were produced by the same male talker; for the three-voice masker, they were produced by three talkers (two males and one female), selected at random for each spondee. Each spondee in each stream was presented from a randomly selected loudspeaker, and spondees within a sequence were played without intervening delays. Noise-burst maskers were generated the same way, with the additional step that the temporal envelope of each spondee was extracted and multiplied with a speech-shaped noise (SSN) matching the long-term average power of the three masker talkers. Maskers began 2 s prior to the onset of the target word and continued for 2 s following target offset, gated with 20-ms raised-cosine ramps. Each masker sequence began at a random point in the first spondee or noise burst to limit synchrony across sequences. The two masker sequences presented on each trial contained a total of 12–16 words or noise bursts; on approximately 60% of trials, the masker contained 14 words.

Each trial was preceded by the sentence, “The sky is very blue,” produced by the target talker. In one set of conditions, this pretrial cue was presented from a randomly selected loudspeaker, and in another set of conditions, it was presented from the same loudspeaker as the subsequent target spondee. Both cues provided the listener with a sample of the target talker's voice, but they differed with respect to information about target location. These two cue conditions are therefore described as location-unknown and location-known, respectively.

The level of the target-plus-masker was 65–68 dB SPL over the duration of the target word regardless of target-to-masker ratio (TMR). This was achieved by (a) scaling the root-mean-square (RMS) level of the target word and each masker sequence to 65 dB SPL, (b) incrementing the level of the target by the desired TMR, (c) estimating the level associated with the target and one masker sequence [Δ = 10*log10(1 + 10™R/10)], and (d) summing the target and the two masker sequences, scaling the result by −Δ dB. The level of the target-plus-masker approached 68 dB SPL at low TMRs and 65 dB SPL at high TMRs.

Listeners

Listeners were school-age children and young adults with normal hearing, defined as ≤ 20 dB HL at octave frequencies from 250 to 8000 Hz in both ears (American National Standards Institute, 2018). One exception was an adult participant with a threshold of 25 dB HL in one ear at 8000 Hz. All were native speakers of Mainstream American English. Exclusion criteria were history of middle ear dysfunction within 3 months of participation and speech/language or neurologic disorders. Listeners were tested in one to two test sessions, each lasting approximately 1–1.5 hr. All test procedures were approved by the institutional review board at The University of North Carolina at Chapel Hill.

Procedure

Data were collected in a sound-treated, double-walled booth (10' × 10') with an array of 11 powered loudspeakers (JBL LSR305). Speakers were equidistant from one another in a 180° hemifield, 2 m in diameter. Participants were seated in the center of the arc facing the center speaker at 0° azimuth. Chair height was adjusted so that the listeners' ears were on the same plane as the loudspeaker cones. Participants were instructed to face the center speaker throughout each listening trial; listener head position was monitored using a small video camera mounted on top of the center speaker, and listeners were reinstructed as necessary. Loudspeakers were connected to individual channels of a 24-channel soundcard (MOTU 24i), with differential balanced input. The experiment was controlled using custom MATLAB (MathWorks) scripts and the Psychtoolbox-3 (Brainard, 1997).

Listening conditions were blocked. For the location-known conditions, listeners were told that the target word would come from the same location as the pretrial cue. In the location-unknown trials, they were told that the target word could come from any loudspeaker, regardless of the location of the pretrial cue. After each trial, the participant responded by repeating the target word. The examiner scored the response as correct or incorrect. If the listener's response was incorrect, the examiner noted the participant's spoken response.

Speech reception thresholds (SRTs) were estimated using data collected with two interleaved adaptive tracks. One track used a one-down, two-up stepping rule converging on 29% correct, and the other used a two-down, one-up rule converging on 71% correct. Each track initially adjusted the TMR in steps of 8 dB, and steps were reduced to 4 dB after the second track reversal. The number of trials per track differed for Experiments 1 and 2, as defined below. Data were combined across tracks for each listener and in each condition, and the results were fitted with a logit function. This psychometric function was used to estimate the SRT associated with 50% correct performance for each condition. Both experiments used these methods to estimate SRTs; Experiment 2 also collected fixed-TMR data, with the TMR set to each listener's SRT. Fixed-TMR testing was carried out to evaluate error patterns in listener responses at the TMR associated with ~50% correct recognition.

Analysis

The SRTs measured for children in Experiments 1 and 2 were evaluated as a function of child age using linear mixed models, with random intercepts for each listener. While previous studies sometimes log-transform age to accommodate decelerating effects of development (e.g., Porter et al., 2020), no transformation was applied here. Slightly better fits were observed using age than the log transform of age, although the same pattern of significance is observed either way. Adult data were evaluated without consideration to listener age, with the assumption that young adults' performance is consistent along this dimension.

Analysis of fixed-TMR data in Experiment 2 entailed evaluating the incidence of errors across age and across groups. Trial-by-trial responses were recorded and evaluated for intrusions, which were defined as errors in which the participant did not correctly name the target word but instead named a spondee spoken by one of the masker talkers. The incidence of intrusions was represented as a proportion of the total, transformed using a logit function prior to analysis. Comparisons of the incidence of intrusions from the female masker talker and chance performance used Wilcoxon signed-ranks tests, due to deviations from normality in the data. A two-tailed significance criterion of α = .05 was adopted for all analyses.

Experiment 1

The first experiment evaluated the effect of providing a pretrial cue to target location on masked word recognition as a function of listener age and masker type. Listeners were 18 children (age: 6.5–13.2 years, M = 9.8 years) and 13 adults (age: 19.1–38.1 years, M = 24.5 years). SRTs were measured in each of three maskers: (a) the three-voice masker, (b) the one-voice masker, and (c) the noise-burst masker. Recall that all of these maskers were composed of two sequences of spondees or SSN bursts modulated by spondee envelopes. In the one-voice masker, listeners could monitor their environment for a female voice, but this strategy would not work in the three-voice condition due to the presence of a female masker talker. We know that both children and adults can use differences in talker sex to help them selectively attend to the target (Leibold et al., 2018); for that reason, the one-voice condition was predicted to produce less masking than the three-voice condition. The clear perceptual differences between speech and noise are thought to facilitate stream segregation for both children and adults (Leibold et al., 2016), so performance was predicted to be best for the noise-burst masker.

In this experiment, the TMR was adaptively varied using methods described above, with 20 trials per track. All participants provided data in two tracks for the location-known and location-unknown conditions for each of the three maskers. Data collection was blocked by cue type (location-known and location-unknown), and listeners completed these two blocks in quasi-random order. The order of maskers within each block was counterbalanced.

Results

Figure 1 shows the results of Experiment 1 plotted as a function of child age. Results for the three masker conditions are shown in separate columns, with SRTs on the top row and the benefit of a pretrial cue to target location shown on the bottom row. Boxplots to the far right of each panel indicate the distribution of adult SRTs. The pretrial cue indicating target position had little or no effect for the noise-burst and one-voice masker conditions. For the three-voice condition, the pretrial location cue had very little effect for the youngest children, benefit increased with child age, and adults obtained a mean benefit of 13.5 dB. Lines in Figure 1 indicate significant correlations with child age, including SRTs in the one-voice condition when the location was unknown (r = −.62, p = .008), SRTs for both three-voice conditions (location-known: r = −.74, p = .001; location-unknown: r = −.57, p = .016), and the benefit of the location cue for the three-voice condition (r = .69, p = .002).

Figure 1.

Figure 1.

Results from Experiment 1, plotted as a function of child age. The distributions of adult values appear at the far right of each panel: Horizontal lines indicate the median, boxes span the 25th–75th percentiles, and vertical lines span the 10th–90th percentiles. The top row of panels shows speech reception thresholds (SRTs), and the bottom row shows the benefit of a pretrial cue indicating target location. Lines indicate significant effects of child age. SNR = signal-to-noise ratio; yrs = years.

A linear mixed model with random intercepts for each listener was used to evaluate these effects in data from children. Age was mean-centered on 9.8 years. Categorical variables were coded such that conditional effects referenced performance with the one-voice masker in the location-unknown condition. The model was specified to evaluate differences in performance for masker type (one-voice vs. noise-burst) and number of masker talkers (one-voice vs. three-voice). Table 2 contains coefficient and significance values for this model, including three-way interactions and all lower-order effects.

Table 2.

Model estimates for the effects of masker, location cue, and child age on speech reception threshold in Experiment 1.

Variable Value SE df t value p value
(Intercept) −9.25 0.81 75 −11.41 < .001
Age −1.35 0.39 15 −3.43 .004
Type −1.80 1.00 75 −1.80 .076
Number 11.29 1.00 75 11.30 < .001
Location cue −1.49 1.00 75 −1.49 .140
Type × Location Cue 2.03 1.41 75 1.44 .155
Number × Location Cue −3.75 1.41 75 −2.66 .010
Age × Type 0.69 0.48 75 1.42 .160
Age × Number 0.62 0.48 75 1.29 .202
Age × Location Cue 0.76 0.48 75 1.57 .122
Age × Type × Location Cue −0.24 0.69 75 −0.36 .723
Age × Number × Location Cue −1.85 0.69 75 −2.70 .009

Note. Age was mean centered (9.8 years). Values in bold font are statistically significant. Categorical variables were reference-coded to evaluate masker type (0 = one-voice, 1 = noise-burst), the number of masker talkers (0 = one-voice, 1 = three-voice), and effects of the location cue (0 = location-unknown, 1 = location-known).

Results revealed a conditional effect of child age, consistent with development in the one-voice, location-unknown condition (p = .004). There was a nonsignificant trend for an effect of masker type that suggested lower (better) SRTs in the noise-burst masker compared to one-voice (p = .076), while the effect of masker number revealed significantly lower SRTs in the one-voice masker compared to the three-voice masker (p < .001). There was no effect of location cue in the reference condition (p = .140), but there was a significant interaction between masker number and the location cue (p = .010), indicating significant improvements in SRT when listeners were provided a location cue in the three-voice masker. Finally, a significant three-way interaction (p = .009) was consistent with the observation that the benefit from location cue in the three-voice masker was dependent on child age. All other effects were nonsignificant.

In addition to effects of age within the population of child listeners, age effects can also be characterized by comparing performance for children and adults. This was evaluated with multiple Welch's independent-samples t tests, with unequal variance. These tests were evaluated as one tailed and not adjusted for multiple comparisons due to the strong prior predictions based on published data and age effects in child data. For all six conditions (3 maskers × 2 location cue conditions), SRTs were significantly higher for children than adults (p ≤ .028), with mean differences ranging from 2.3 dB (noise-burst masker, location unknown) to 8.0 dB (one-voice maker, location unknown). One-tailed, one-sample t tests were also run to evaluate the benefit of a pretrial cue to location in adult data. Consistent with results from children, data from adults indicate that the benefit was not different from zero for the noise-burst masker (M = 0.14), t(12) = 0.14, p = .444, or the one-voice masker (M = −0.38), t(12) = −0.28, p = .601, but it was significantly greater than zero for the three-voice masker (M = 8.50), t(12) = 5.30, p < .001.

Experiment 2

Experiment 2 was designed with two goals in mind: (a) to replicate the findings from Experiment 1 with respect to development of the ability to benefit from a cue to target location in the three-voice masker condition and (b) to gain insight into listening strategies by evaluating intrusions from the masker. Listeners were 16 children (age: 5.6–12.7 years, M = 8.6 years) and 15 young adults (age: 20.8–26.0 years, M = 22.7 years). There was no overlap in participants between experiments. Stimuli and procedures were the same as in Experiment 1, with the following exceptions. Performance was evaluated only in the three-voice masker, with and without the pretrial cue to location. A single estimate of SRT was obtained adaptively in each condition, with 60 trials per estimate, using the same interleaved adaptive tracking procedures as Experiment 1. Fixed-level testing was then carried out at each listener's SRT, with 60 trials in each condition. Participants completed testing blocked by location cue condition (known and unknown), and the order of conditions was counterbalanced.

Results

Figure 2 displays SRTs plotted as a function of child age in the location-known and location-unknown conditions (top left panel), as well as benefit from the location cue (bottom left panel). Boxplots to the far right of each panel show SRTs for adults. Similar to Experiment 1, SRTs in the three-voice masker improved with increasing child age in the location-known condition (r = −.93, p < .001); this association did not reach significance for the location-unknown condition (r = −.35, p = .206). The correlation between age and benefit was significant (r = .91, p < .001). Also similar to Experiment 1, SRTs were significantly higher for children than adults for both the location-known and location-unknown conditions (p < .036). There was a nonsignificant trend for the benefit of a pretrial cue to location to be smaller for children than adults, t(16.2) = −1.47, p = .081, and the benefit of a pretrial cue was significantly greater than zero for adults (M = 10.2), t(11) = 4.08, p < .001.

Figure 2.

Figure 2.

Results from Experiment 2. The left column of panels shows speech reception thresholds (SRTs; top) and the benefit of the pretrial cue to talker location (bottom), with plotting conventions following those of Figure 1. The right column shows results of the error analysis of fixed target-to-masker ratio data. The top panel shows the proportion of intrusions from the female masker talker, and the bottom panel shows the root-mean-square (RMS) separation between cued locations and locations of intrusions from the masker. Circle size reflects the number of intrusions for each child, as defined in the legend. As in the panel depicting SRTs, child data are plotted as a function of age, symbol fill reflects the cue condition, and the distribution of adult data is shown with boxplots. The horizontal lines in panels in the right column indicate chance performance, which would be expected if intrusions were from randomly selected masker talkers (top) or randomly selected loudspeakers (bottom). SNR = signal-to-noise ratio; yrs = years.

Recall that fixed-TMR data were collected at each listener's SRT, so performance in the fixed-TMR conditions was expected to be approximately 50% correct. For children, mean scores were 53% correct (35%–73%) for the location-known condition and 52% correct (37%–75%) for the location-unknown condition. A trend for higher scores was observed for adults, with values of 60% (23%–80%) and 63% (32%–95%), respectively. Two-sided t tests of logit-transformed data indicate a nonsignificant trend for deviation from the expected value of 50% in adult data, t(11) = 1.98, p = .073, and t(11) = 2.11, p = .059, which could reflect effects of practice within a test session for adults.

To assess effects of child age and location cue on SRT estimates in the three-voice masker, a linear mixed model with random intercepts for each participant was applied to this dataset. Age was mean-centered on 8.5 years. Location was coded to reference the location-unknown condition. Table 3 displays coefficient and significance values for this model. As observed in Experiment 1, there was a significant effect of location cue (p < .001) indicating better performance with than without the location cue. Moreover, a significant interaction between age and location cue (p < .001) suggested greater benefit from the location cue with increasing child age.

Table 3.

Model estimates for the effects of location cue and child age on speech reception threshold in Experiment 2.

Variable Value SE df t value p value
(Intercept) 4.76 0.49 13 9.72 < .001
Age −0.30 0.24 13 −1.24 .237
Location cue −6.16 0.53 13 −11.57 < .001
Age × Location Cue −2.05 0.26 13 −7.86 < .001

Note. Age was mean centered (8.5 years). Values in bold font are statistically significant. Location cue condition was reference-coded (0 = location-unknown, 1 = location-known).

Error Analysis

Errors made in the fixed-TMR conditions were evaluated to better understand the strategies that listeners relied on to recognize target words presented in the three-voice masker. For example, if listeners are selectively attending to the cued location, then intrusions from the masker should be more likely for masker words presented at or near the cued location than words presented far from the cued location. Similarly, if listeners are selectively attending to qualitative features of the target voice, then intrusions from the female masker talker should be more likely than those from the male masker talkers, due to greater perceptual similarity of the same-sex voices. The trial-by-trial data for each listener were evaluated, and each incorrect response was categorized as an intrusion or random error. An intrusion was defined as any response that exactly matched one of the 12–16 words in the masker for that trial. Partial matches (e.g., bluebird instead of blue-jay) were not categorized as intrusions in this analysis; the rationale for requiring a perfect match was to avoid ambiguity among partial matches (e.g., the response bluebird when the masker includes blue-jay and blackbird). When an intrusion was identified, the talker sex and loudspeaker location were noted.

There were substantial individual differences in the proportion of errors classified as intrusions across both age groups and conditions, with 13%–100% of errors corresponding to words in the masker. The mean proportion of errors categorized as intrusions for the location-known condition was 74% for children and 60% for adults. Values for the location-unknown condition were 83% and 73%, respectively. In neither case was the difference between age groups significant (p ≥ .367). While most listener responses were spondees from the stimulus set, there were a few common listener responses that were not among the response alternatives: stairwell (n = 15), doghouse (n = 6), and raindrop (n = 4).

For both children and adults, intrusions were more likely to come from the female masker talker than the male masker talkers. This is depicted in the upper right panel of Figure 2. As a proportion of intrusions in the location-known condition, on average, 79% and 80% were from the female talker for children and adults, respectively. Those values were 87% and 90%, respectively, for the location-unknown condition. Recall that there were two male masker talkers and only one female masker talker. Intrusion rates > 33% therefore represent selective attention to female voices. A set of four Wilcoxon signed-ranks tests indicates that the proportional incidence of intrusions from the female masker voice was > 33% for both age groups and both location cue conditions (p ≤ .003).

The final analysis evaluated the RMS error in degrees for each intrusion relative to the associated pretrial cue. Recall that the pretrial cue indicated the target location in the location-known condition but not in the location-unknown condition. If listeners were using the information contained in these cues optimally, that might be reflected in a greater probability of intrusions from the cued location for the location-known condition, but no association between the cue and response locations for the location-unknown condition. This expectation was met for both children and adults, as illustrated in the lower right panel of Figure 2. If intrusions were randomly distributed across loudspeakers, the mean RMS error would be approximately 80°. 1 For both children and adults, RMS error was lower than chance in the location-known condition (p ≤ .001) but not for the location-unknown condition (p ≥ .160).

A linear mixed model with random intercepts for each participant was used to evaluate the effects of age and location cue on RMS error on data of child listeners. Age was represented as a continuous variable, and the location-unknown condition served as the reference condition. Coefficients and significance values are reported in Table 4. This analysis revealed a significant effect of location cue (p < .001); neither age nor the Age × Location interaction reached significance (p = .724 and p = .231, respectively). In other words, children's intrusions tended to cluster near the cued location, but this effect was not larger in older children than younger children. This pattern of results fails to support the idea that the ability to listen in a spatially selective manner develops over the age range tested here.

Table 4.

Model estimates for the effect of child age, mean centered on 8.5 years, on root-mean-square error for intrusions in Experiment 2.

Variable Value SE df t value p value
(Intercept) 64.741 3.966 647 16.325 < .001
Age −0.704 1.949 13 −0.361 .724
Location cue −27.057 3.543 647 −7.636 < .001
Age × Location Cue −2.114 1.762 647 −1.200 .231

Note. Values in bold font are statistically significant. Location cue condition was reference-coded (0 = location-unknown, 1 = location-known).

General Discussion

The main objective of these studies was to better understand how school-age children use information about the location of a target when tasked with recognizing speech in a spatially complex multitalker environment. To do this, we assessed the effects of child age, masker type (noise-burst, one-voice, three-voice), and provision of a pretrial location cue on masked word recognition for a target word whose trial-to-trial position varied randomly between −90° and +90° on the horizontal plane. Spatial uncertainty was expected to be particularly challenging for targets presented in the three-voice masker, due to the compounding effects of stimulus uncertainty and perceptual similarity. Additionally, we expected cues to location would be more beneficial for adults and older children than for younger children, due to maturation in the ability to listen selectively in space. The location cue had little or no effect on spondee recognition in the noise-burst or one-voice masker; for the three-voice masker, the pretrial cue to location improved performance, with larger effects with increasing child age. These results are broadly consistent with the prior prediction, with the caveat that some effect of spatial uncertainty was anticipated for the one-voice masker.

Results of the present study are consistent with previous data showing that young children as a group appear more vulnerable to the effects of informational masking than adults. For instance, Corbin et al. (2016) reported that open-set word recognition became adultlike earlier in development when the masker was SSN (10 years) than when it was two-talker speech (13 years). For the noise-burst masker in the present study, children had higher SRTs compared to adults; however, younger children and older children produced similar SRTs regardless of location cue. In contrast, performance improved significantly with child age in the location-unknown condition with the one-voice masker and in both the location-known and location-unknown conditions with the three-voice masker. These results are consistent with existing literature suggesting that the developmental trajectory for speech-in-speech recognition is more protracted compared to the trajectory for speech-in-noise recognition (e.g., Corbin et al., 2016).

Error analysis of data with the three-voice masker in Experiment 2 showed that both adults and children had substantially more intrusions from the female masker talker than the male masker talkers, suggesting they were preferentially listening for a female target. This supports the idea that talker sex differences are salient cues that reduce perceptual similarity among voices for both adult and child listeners and agrees with findings from Leibold et al. (2018), who showed that a difference in target and masker talker sex provides similar release from masking for adults and school-age children with normal hearing.

Recall that error patterns with respect to intrusion location were predicted to indicate greater evidence of spatially selective listening in older children and adults compared to younger children. Contrary to this prediction, intrusions in the three-voice masker for the location-known condition tended to cluster near the location of the pretrial location cue for both children and adults, with no evidence of reduced benefit in the younger children. This suggests that listeners of all ages were preferentially listening at the cued location in space, as shown in previous studies with adults (Brungart & Simpson, 2007; Kidd et al., 2005). Initially, this may appear inconsistent with the finding that location cues in the three-voice masker condition reduced SRTs more for older children and adults compared to younger children. This combination of results—similar error patterns but differences in the pattern of SRTs—suggests that spatial selectivity may not be responsible for the developmental effects observed for SRTs in the three-voice masker. That is, greater benefit of the location cue in older children may not be related to maturation in the ability to listen selectively in space.

One possible explanation for reduced benefit of the location cue in the three-voice data of young children is related to differences in performance in the baseline (unknown-location) condition. Speech-in-speech recognition in young children is likely affected by a limited ability to recognize target speech based on cues coinciding with temporal and spectral gaps in a speech masker (Sobon et al., 2019), which increases the TMR at threshold. As TMR approaches 0 dB, recognition relies less on glimpsing of information in temporal and spectral gaps in the masker and more on the loudest cues in the summed target-plus-masker stimulus. Consequently, SRTs at or near 0 dB TMR could reflect failure to segregate the target and masker, rather than reflecting the listener's ability to recognize speech based on spectro-temporally sparse cues. The availability of target cues in the absence of segregation may therefore confound our ability to observe trends associated with use of spatial attention and listener age.

One limitation of this study is that randomly selecting target location for each trial may tend to overestimate the challenges faced in typical listening environments. In adult listeners, Brungart and Simpson (2007) measured sentence recognition in a multitalker masker and showed that performance tended to improve when target location was stable over three to four sequential trials. While everyday listening situations require listeners to switch attention to different spatial positions throughout the course of a conversation, these shifts are not random, sources are often visible to the listener, and targets typically consist of speech samples longer than a single word. Context cues present in natural environments would be expected to facilitate spatially selective listening.

Next steps in this line of research include further investigation into spatially selective listening in young children and evaluation of spatially selective attention in children with hearing loss. One way to evaluate hypothesized effects related to loudness cues near 0 dB TMR will be to identify stimulus conditions for which performance in the location-unknown condition falls well below 0 dB TMR. For example, task difficulty could be manipulated by introducing a closed set of response alternatives, introducing a temporal gap between masker words, or reducing similarity of target and masker talkers (e.g., accented speech). Additionally, children with hearing loss often perform more poorly than their peers with normal hearing on SRM tasks (Ching et al., 2011). Future research could evaluate whether the ability to listen selectively in space plays a role in this result.

In summary, data from this study showed that the benefit of a pretrial cue to location increased with increasing child age when the masker was two sequences of spondees produced by male and female talkers and presented from random locations. While this is consistent with maturation in spatially selective attention, error patterns do not support this conclusion. One reason for this discrepancy may be decreased reliance on cues supporting auditory stream segregation at or near 0 dB TMR. More research is needed to understand how performance in the baseline condition, when target location is uncertain, affects the ability to use cues to target location and how maturation of spatial selectivity affects SRM under natural listening conditions.

Acknowledgments

This study was funded by National Institute on Deafness and Other Communication Disorders Grant R01 DC000397 (awarded to E. B.). Special thanks to Kathryn Sobon and Meredith Braza for their help with recruitment and data collection. We also thank Chris Wiessen from the University of North Carolina at Chapel Hill Odum Institute for assistance with statistical analyses.

Funding Statement

This study was funded by National Institute on Deafness and Other Communication Disorders Grant R01 DC000397 (awarded to E. B.).

Footnote

1

Chance performance with respect to the spatial position of the intrusion was determined with a Monte Carlo simulation (n = 1e5). In the absence of prior information about the spatial position, listeners could pursue a range of strategies, with different consequences for chance performance. For example, a participant might preferentially attend to the speaker in the middle of the array or randomly select a speaker on a trial-by-trial basis. Examination of data in the location-unknown condition indicates approximately uniform distribution of intrusion locations, consistent with uniform distribution of attention across speakers or selective attention to a randomly selected location on each trial. This was modeled by randomly selecting one of the 11 speakers on each trial of the simulation.

References

  1. American National Standards Institute. (2018). ANSI S3.6-2018, American national standard specification for audiometers.
  2. Arbogast, T. L. , Mason, C. R. , & Kidd, G. (2002). The effect of spatial separation on informational and energetic masking of speech. The Journal of the Acoustical Society of America, 112(5), 2086–2098. https://doi.org/10.1121/1.1510141 [DOI] [PubMed] [Google Scholar]
  3. Atagi, E. , & Bent, T. (2013). Auditory free classification of nonnative speech. Journal of Phonetics, 41(6), 509–519. https://doi.org/10.1016/j.wocn.2013.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bent, T. (2014). Children's perception of foreign-accented words. Journal of Child Language, 41(6), 1334–1355. https://doi.org/10.1017/S0305000913000457 [DOI] [PubMed] [Google Scholar]
  5. Bolia, R. S. , Nelson, W. T. , Ericson, M. A. , & Simpson, B. D. (2000). A speech corpus for multitalker communications research. The Journal of the Acoustical Society of America, 107(2), 1065–1066. https://doi.org/10.1121/1.428288 [DOI] [PubMed] [Google Scholar]
  6. Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10(4), 433–436. https://doi.org/10.1163/156856897X00357 [PubMed] [Google Scholar]
  7. Brungart, D. S. (2001). Informational and energetic masking effects in the perception of two simultaneous talkers. The Journal of the Acoustical Society of America, 109(3), 1101–1109. https://doi.org/10.1121/1.1345696 [DOI] [PubMed] [Google Scholar]
  8. Brungart, D. S. , Chang, P. S. , Simpson, B. D. , & Wang, D. (2006). Isolating the energetic component of speech-on-speech masking with ideal time–frequency segregation. The Journal of the Acoustical Society of America, 120(6), 4007–4018. https://doi.org/10.1121/1.2363929 [DOI] [PubMed] [Google Scholar]
  9. Brungart, D. S. , & Simpson, B. D. (2007). Cocktail party listening in a dynamic multitalker environment. Perception & Psychophysics, 69(1), 79–91. https://doi.org/10.3758/bf03194455 [DOI] [PubMed] [Google Scholar]
  10. Calandruccio, L. , Brouwer, S. , Van Engen, K. J. , Dhar, S. , & Bradlow, A. R. (2013). Masking release due to linguistic and phonetic dissimilarity between the target and masker speech. American Journal of Audiology, 22(1), 157–164. https://doi.org/10.1044/1059-0889(2013/12-0072) [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Calandruccio, L. , Leibold, L. J. , & Buss, E. (2016). Linguistic masking release in school-age children and adults. American Journal of Audiology, 25(1), 34–40. https://doi.org/10.1044/2015_AJA-15-0053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Cameron, S. , Dillon, H. , & Newall, P. (2006). The listening in spatialized noise test: Normative data for children. International Journal of Audiology, 45(2), 99–108. https://doi.org/10.1080/14992020500377931 [DOI] [PubMed] [Google Scholar]
  13. Ching, T. Y. C. , van Wanrooy, E. , Dillon, H. , & Carter, L. (2011). Spatial release from masking in normal-hearing children and children who use hearing aids. The Journal of the Acoustical Society of America, 129(1), 368–375. https://doi.org/10.1121/1.3523295 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Corbin, N. E. , Bonino, A. Y. , Buss, E. , & Leibold, L. J. (2016). Development of open-set word recognition in children: Speech-shaped noise and two-talker speech maskers. Ear and Hearing, 37(1), 55–63. https://doi.org/10.1097/AUD.0000000000000201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Corbin, N. E. , Buss, E. , & Leibold, L. J. (2017). Spatial release from masking in children: Effects of simulated unilateral hearing loss. Ear and Hearing, 38(2), 223–235. https://doi.org/10.1097/AUD.0000000000000376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Eipert, L. , Selle, A. , & Klump, G. M. (2019). Uncertainty in location, level and fundamental frequency results in informational masking in a vowel discrimination task for young and elderly subjects. Hearing Research, 377, 142–152. https://doi.org/10.1016/j.heares.2019.03.015 [DOI] [PubMed] [Google Scholar]
  17. Freyman, R. L. , Balakrishnan, U. , & Helfer, K. S. (2001). Spatial release from informational masking in speech recognition. The Journal of the Acoustical Society of America, 109(5), 2112–2122. https://doi.org/10.1121/1.1354984 [DOI] [PubMed] [Google Scholar]
  18. Freyman, R. L. , Balakrishnan, U. , & Helfer, K. S. (2004). Effect of number of masking talkers and auditory priming on informational masking in speech recognition. The Journal of the Acoustical Society of America, 115(5), 2246–2256. https://doi.org/10.1121/1.1689343 [DOI] [PubMed] [Google Scholar]
  19. Garadat, S. N. , & Litovsky, R. Y. (2007). Speech intelligibility in free field: Spatial unmasking in preschool children. The Journal of the Acoustical Society of America, 121(2), 1047–1055. https://doi.org/10.1121/1.2409863 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hillock-Dunn, A. , Taylor, C. , Buss, E. , & Leibold, L. J. (2015). Assessing speech perception in children with hearing loss: What conventional clinical tools may miss. Ear and Hearing, 36(2), Article e57. https://doi.org/10.1097/AUD.0000000000000110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kidd, G. , Arbogast, T. L. , Mason, C. R. , & Gallun, F. J. (2005). The advantage of knowing where to listen. The Journal of the Acoustical Society of America, 118(6), 3804–3815. https://doi.org/10.1121/1.2109187 [DOI] [PubMed] [Google Scholar]
  22. Kidd, G. , Mason, C. R. , & Arbogast, T. L. (2002). Similarity, uncertainty, and masking in the identification of nonspeech auditory patterns. The Journal of the Acoustical Society of America, 111(3), 1367–1376. https://doi.org/10.1121/1.1448342 [DOI] [PubMed] [Google Scholar]
  23. Leek, M. R. , Brown, M. E. , & Dorman, M. F. (1991). Informational masking and auditory attention. Perception & Psychophysics, 50(3), 205–214. https://doi.org/10.3758/BF03206743 [DOI] [PubMed] [Google Scholar]
  24. Leibold, L. J. (2012). Development of auditory scene analysis and auditory attention. In Werner L., Fay R. R., & Popper A. N. (Eds.), Human auditory development (Vol. 42, pp. 137–161). Springer. https://doi.org/10.1007/978-1-4614-1421-6_5 [Google Scholar]
  25. Leibold, L. J. , & Buss, E. (2016). Factors responsible for remote-frequency masking in children and adults. The Journal of the Acoustical Society of America, 140(6), 4367–4377. https://doi.org/10.1121/1.4971780 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Leibold, L. J. , Buss, E. , & Calandruccio, L. (2018). Developmental effects in masking release for speech-in-speech perception due to a target/masker sex mismatch. Ear and Hearing, 39(5), 935–945. https://doi.org/10.1097/AUD.0000000000000554 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Leibold, L. J. , Yarnell Bonino, A. , & Buss, E. (2016). Masked speech perception thresholds in infants, children, and adults. Ear and Hearing, 37(3), 345–353. https://doi.org/10.1097/AUD.0000000000000270 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Litovsky, R. Y. (2005). Speech intelligibility and spatial release from masking in young children. The Journal of the Acoustical Society of America, 117(5), 3091–3099. https://doi.org/10.1121/1.1873913 [DOI] [PubMed] [Google Scholar]
  29. Neff, D. L. , & Green, D. M. (1987). Masking produced by spectral uncertainty with multicomponent maskers. Perception & Psychophysics, 41(5), 409–415. https://doi.org/10.3758/BF03203033 [DOI] [PubMed] [Google Scholar]
  30. Phatak, S. A. , Brungart, D. S. , Zion, D. J. , & Grant, K. W. (2019). Clinical assessment of functional hearing deficits: Speech-in-noise performance. Ear and Hearing, 40(2), 426–436. https://doi.org/10.1097/AUD.0000000000000635 [DOI] [PubMed] [Google Scholar]
  31. Porter, H. L. , Leibold, L. J. , & Buss, E. (2020). Effects of self-generated noise on quiet threshold by transducer type in school-age children and adults. Journal of Speech, Language, and Hearing Research, 63(6), 2027–2033. https://doi.org/10.1044/2020_JSLHR-19-00302 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Sobon, K. A. , Taleb, N. M. , Buss, E. , Grose, J. H. , & Calandruccio, L. (2019). Psychometric function slope for speech-in-noise and speech-in-speech: Effects of development and aging. The Journal of the Acoustical Society of America, 145(4), EL284–EL290. https://doi.org/10.1121/1.5097377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Watson, C. S. , Kelly, W. J. , & Wroton, H. W. (1976). Factors in the discrimination of tonal patterns: II. Selective attention and learning under various levels of stimulus uncertainty. The Journal of the Acoustical Society of America, 60(5), 1176–1186. https://doi.org/10.1121/1.381220 [DOI] [PubMed] [Google Scholar]
  34. Yuen, K. C. P. , & Yuan, M. (2014). Development of spatial release from masking in Mandarin-speaking children with normal hearing. Journal of Speech, Language, and Hearing Research, 57(5), 2005–2023. https://doi.org/10.1044/2014_JSLHR-H-13-0060 [DOI] [PubMed] [Google Scholar]

Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES