Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2020 Dec 9;148(6):3527–3543. doi: 10.1121/10.0002661

The effect of fundamental frequency contour similarity on multi-talker listening in older and younger adults

Peter A Wasiuk 1, Mathieu Lavandier 2, Emily Buss 3, Jacob Oleson 4, Lauren Calandruccio 1,a)
PMCID: PMC7863686  PMID: 33379934

Abstract

Older adults with hearing loss have greater difficulty recognizing target speech in multi-talker environments than young adults with normal hearing, especially when target and masker speech streams are perceptually similar. A difference in fundamental frequency (f0) contour depth is an effective stream segregation cue for young adults with normal hearing. This study examined whether older adults with varying degrees of sensorineural hearing loss are able to utilize differences in target/masker f0 contour depth to improve speech recognition in multi-talker listening. Speech recognition thresholds (SRTs) were measured for speech mixtures composed of target/masker streams with flat, normal, and exaggerated speaking styles, in which f0 contour depth systematically varied. Computational modeling estimated differences in energetic masking across listening conditions. Young adults had lower SRTs than older adults; a result that was partially explained by differences in audibility predicted by the model. However, audibility differences did not explain why young adults experienced a benefit from mismatched target/masker f0 contour depth, while in most conditions, older adults did not. Reduced ability to use segregation cues (differences in target/masker f0 contour depth), and deficits grouping speech with variable f0 contours likely contribute to difficulties experienced by older adults in challenging acoustic environments.

I. INTRODUCTION

From crowded coffee shops to busy open-plan offices, challenging listening environments are pervasive in daily life (Agus et al., 2009; Hodgson et al., 2007; Jahncke et al., 2016). Successful speech recognition in such adverse scenarios is contingent upon the complex interaction of the acoustic signal of interest (i.e., the target) and competing sounds within the environment (i.e., the masker or maskers), as well as a variety of both auditory and cognitive factors that are intrinsic to the listener (Mattys et al., 2012; Pichora-Fuller et al., 2016; Rönnberg et al., 2013). Speech recognition in a multi-talker environment requires the listener to perceptually isolate target speech from masker speech, and selectively attend to the target. This process is commonly referred to as auditory stream segregation (Bregman, 1990), and is often exemplified in “the cocktail party problem” (Cherry, 1953). Successful streaming of target and masker speech can break down as a result of energetic masking (EM), informational masking (IM), or, more often, a combination of the two. EM occurs as a result of spectro-temporal overlap of excitation patterns in the peripheral auditory system, causing the neural response to the target speech to be “swamped” by response to the masker speech (Yost, 2013). In contrast, IM is observed when there is stimulus uncertainty (as to what, where, or who to listen for) and/or high degrees of perceptual similarity between the target and masker speech (Durlach et al., 2003; Kidd et al., 2007; Kidd and Colburn, 2017). It is common to explain IM as any masking that is “nonenergetic” in nature (Durlach et al., 2003; Watson, 2005).

The magnitude of IM for speech-in-speech recognition varies based on acoustic features of the target and masker. For example, young adults with normal hearing perform better when the target and masker speech are spatially separated compared to when they are co-located, a benefit based in part on the introduction of differences in perceived position associated with binaural cues (e.g., Freyman et al., 1999). There are also many monaural cues that young adults utilize to help segregate target from masker speech, including differences in mean fundamental frequency (f0; e.g., Assmann, 1999; Bird and Darwin, 1998), vocal tract length (e.g., Darwin et al., 2003), and linguistic features (such as language and accent; e.g., Rhebergen et al., 2005; Calandruccio and Zhou, 2014). The resulting improvements in target intelligibility are often described as a release from (informational) masking.

Some stimulus manipulations also cause a release from EM. For example, it has been argued that binaural cues can cause a release from EM (e.g., Bronkhorst and Plomp, 1988; Lavandier and Culling, 2008). Differences in f0 between a speech target and a harmonic complex have also been shown to produce a release from EM (Deroche and Culling, 2011; Deroche et al., 2014). This release could be due to spectral glimpsing of the target in between the resolved partials of the harmonic masker and/or harmonic cancellation of the masker (de Cheveigné et al., 1995), in which the auditory system cancels the harmonic structure of the masker to reduce EM. Leclère et al. (2017) showed that masking release associated with differences in f0 with a harmonic masker is greatly reduced when the masker f0 fluctuates over time. With speech maskers that vary in f0 over time and contain unvoiced segments (with no f0), it is unclear whether f0-based release from EM affects speech-in-speech recognition.

Unfortunately, older adults with sensorineural hearing loss (SNHL) typically see a reduced benefit from segregation cues associated with masking release (Best et al., 2017; Humes and Coughlin, 2009; Lee and Humes, 2012), and they tend to perform more poorly than adults with normal hearing under a wide range of conditions (Arehart et al., 1997; Kidd et al., 2019; Mackersie et al., 2011). Age, working memory capacity, and degree of hearing loss are often identified as intrinsic predictors of speech recognition in multi-talker listening environments, where lower age, higher working memory capacity, and lower pure tone average (PTA; the average of three to four mid-frequency audiometric thresholds) support better target speech recognition (Edwards, 2016; Gordon-Salant and Cole, 2016; Helfer et al., 2020; Helfer and Freyman, 2008; Kidd et al., 2019; Rönnberg et al., 2013). One acoustic-perceptual cue that has recently been reported to benefit young adults with normal hearing in multi-talker environments is differences in f0 contour depth between the target and masker speech (Calandruccio et al., 2019). The purpose of the present study is to determine whether older adults with SNHL can utilize f0 contour depth differences to improve target speech recognition in a similar manner to young adults with normal hearing. An inability to utilize this cue could contribute to poorer speech recognition for older adults in the challenging environments of everyday life.

A. Utilizing target/masker f0 differences in multi-talker environments: Effects of SNHL and age

A mean difference in f0 between target and masker speech can act as a powerful segregation cue (Assmann, 1999; Bird and Darwin, 1998; Brokx and Nooteboom, 1982; Darwin et al., 2003), improving speech recognition performance by as much as 20 dB in young adults with normal hearing (Kidd et al., 2019). Mean differences in f0 reduce the similarity of the target and masker speech streams, making them easier to segregate (David et al., 2017). This is often explained in terms of reductions in perceptual similarity (a release from IM; Brungart, 2001; Brungart et al., 2001). However, the benefit of differences in mean f0 is often reduced for people with SNHL (Arehart et al., 1997; Mackersie et al., 2011; Misurelli and Litovsky, 2015) and older adults (Helfer and Freyman, 2008; Lee and Humes, 2012).

In one demonstration of the effect of hearing loss on the ability to benefit from target/masker f0 differences, Kidd et al. (2019) reported data for 13 adults with SNHL and an age-matched control group with normal hearing. All of the adults were 40 years old or younger. Participants were tasked with recognizing closed-set sentences spoken by a female talker, presented with a two-talker masker composed of either male or female talkers. The mean f0 for the male and female voices differed by about 11 semitones. Target speech recognition performance was poorer in the female two-talker masker than in the male two-talker masker for both listener groups, but this effect was larger for the group with normal hearing (∼21 dB) than the group with SNHL (∼11 dB). The reduced benefit for listeners with SNHL was argued to be mainly due to a decreased ability to use the mean f0 difference between target and masker speech as a segregation cue. A similar result was obtained by Mackersie et al. (2011), who evaluated sentence recognition in a two-talker masker for targets with a mean f0 that either matched the mean f0 of the masker or differed by up to nine semitones. In that study, adults with SNHL benefited less from a nine-semitone difference in f0 compared to a control group with normal hearing, with mean group differences of about 15 percentage points. The participants in the two groups were not individually matched in age, but the median age was similar across groups.

B. Target/masker differences in f0 contour depth in multi-talker listening environments

In addition to differences in mean f0 across talkers, natural speech also contains dynamic variation in f0 over time (i.e., intonation). The f0 contour provides important prosodic information that can enhance speech recognition in quiet environments, help distinguish word boundaries and word meaning, and communicate information about a talker's emotion and intent to the listener (for a review see Cutler et al., 1997). While a number of investigations have shown that f0 contour is an important cue for speech-in-noise recognition for people with normal hearing (Binns and Culling, 2007; Hillenbrand, 2003; Laures and Weismer, 1999; Miller et al., 2010) and people with SNHL (Grant, 1987; Shen and Souza, 2017), these observations have primarily been made by manipulating the f0 of the speech via digital signal processing. That is, investigators have taken naturally produced speech recordings and processed them using software that alters the f0 contour in a systematic way, including exaggerating (Miller et al., 2010), reducing (e.g., Binns and Culling, 2007), or flattening (e.g., Laures and Weismer, 1999) the natural contour. Such a methodology is useful for examining the importance of the f0 contour as a segregation cue because the contour can be manipulated independently of other stimulus parameters with high degrees of precision. However, this type of signal processing can introduce unwanted distortions and/or artifacts to the speech signal and results in somewhat unnatural speech. For example, a completely flattened, or “monotone,” f0 contour is unlikely to be experienced in everyday life. Based on current speech processing models (e.g., Rönnberg et al., 2013), artificial alterations of auditory cues that are used in stream segregation are hypothesized to impose cognitive processing demands on the listener. Furthermore, such signal alterations may increase the degree of IM by reducing the predictability of prosodic cues, therefore introducing greater listener uncertainty (as suggested by Calandruccio et al., 2019). Conversely, it is possible that digital f0 manipulations applied to only one speech stream (target or masker speech) could reduce perceptual similarity across voices and thereby decrease IM.

An alternative to digital manipulation of f0 contour depth is to record speech stimuli spoken using different speaking styles. A recent study by Calandruccio et al. (2019) used this approach to examine the influence of f0 contour depth while avoiding the use of digital signal processing to manipulate the contours. Sentence recordings were produced by trained actors in a flat, normal, or exaggerated speaking style, which systematically varied f0 contour depth. The authors then investigated whether differences in target and masker f0 contour depth influenced target speech recognition. In conditions in which the target and masker speech were most similar in f0 contour (e.g., flat target and flat masker), listeners were expected to perform worse than in conditions in which the f0 contours were different (e.g., exaggerated target and flat masker). In that study, young adults with normal hearing benefitted from a mismatch in speaking style between the target and masker speech, particularly for the most dissimilar target/masker pairs (i.e., flat/flat vs exaggerated/flat and exaggerated/exaggerated vs flat/exaggerated). These results were interpreted in terms of IM, but additional effects related to EM cannot be ruled out.

C. Current experiments

The experiments of the present study were designed to examine the influence of f0 contour depth similarity on target speech recognition for older adults with varying degrees of SNHL. In experiment I, speech-in-speech recognition with the nine possible target/masker combinations of flat, normal, and exaggerated speaking styles was tested in older adults who had a range of hearing thresholds, from mild to moderate SNHL. In contrast to the methods of Calandruccio et al. (2019), stimuli were spectrally shaped using prescriptive gains for individual participants to increase audibility. The goal of the first experiment was to examine whether older adults with age-related changes in hearing and auditory-cognitive processing would be able to take advantage of differences in f0 contour depth between target and masker speech to the same degree as young adults with normal hearing. Experiment II tested young adults with normal hearing with and without spectral shaping to ensure that this stimulus manipulation did not degrade their ability to utilize differences in f0 contour depth to segregate target and masker speech.

II. METHODS

A. Participants

Twenty-two older adults (12 female; 61–75 years old, M age = 68 years, SD = 4 years) participated in experiment I, and 44 young adults (33 female; 18–31 years old, M age = 20 years, SD = 3 years) participated in experiment II. Older adults were recruited from the Case Western Reserve University (CWRU) community and the local Cleveland, OH area. Recruitment was accomplished primarily through word-of-mouth as well as through CWRU Daily newsletter postings, which comprise a daily email that is sent to the campus community throughout the workweek. Young adults were recruited through word-of-mouth and the Department of Psychological Sciences research subject pool. Participants recruited through word-of-mouth were compensated $15/h, while those recruited through the research subject pool were given course credit for participation. All recruitment and testing methods were approved by the CWRU Institutional Review Board (IRB). Participants first signed an informed consent document approved by the CWRU IRB, and then completed a demographic questionnaire to confirm that they were native speakers of English (American dialect).

Prior to experimental testing, older adults had their hearing thresholds measured bilaterally at all octave frequencies from 250 to 8000 Hz, as well as at 3000 and 6000 Hz, using standard clinical audiometric diagnostic procedures (ANSI, 2009). To qualify for the experiment, older adults were required to have symmetric age-appropriate audiometric thresholds. Older adults with severe or profound hearing loss were not recruited. This meant that the older adults tested here had between mild to moderate SNHL (confirmed using standard clinical procedures for bone conduction). Based on a four-frequency PTA (mean of 500, 1000, 2000, and 4000 Hz thresholds), all older adults had a PTA ≤ 55 dB hearing level (HL) for both ears. Older adults with SNHL had symmetric hearing, defined as thresholds within 10 dB between ears for a minimum of six of the eight test frequencies (Corso, 1963; ISO, 2000). This group of participants was presented spectrally shaped stimuli (described below) and is therefore referred to as the “o-hi-sh,” in which “o” stands for older adult, “hi” stands for hearing impairment, and “sh” stands for spectral shaping. All older adult participants also passed a cognitive health screener (>26 out of 30 points on the Mini-Mental State Exam; Folstein et al., 1975). Age, audiometric thresholds, and four-frequency PTAs for the older adults are provided as supplementary materials.1

Younger adults participating in experiment II were required to pass a hearing screening at 20 dB HL bilaterally at all octave frequencies from 250 to 8000 Hz (indicating normal audiometric thresholds). The first 22 participants were assigned to the “y-nh” group (16 female; 18–31 years old, M age = 22 years, SD = 4 years). This group heard the original, unshaped speech stimuli that were used in Calandruccio et al. (2019). The second 22 participants were assigned to the “y-nh-sh” group (17 female; 18–20 years old, M age = 19 years, SD = 2 years). This group heard the same spectrally shaped stimuli used with the older adults in experiment I. Each young adult in the y-nh-sh group was paired with one of the older adults and heard stimuli with the associated spectral shaping, as described below. The y-nh-sh group allowed us to explore whether the spectral shaping of the stimuli in itself influenced speech-in-speech recognition.

B. Stimuli

The stimuli used in experiment I were developed by Calandruccio et al. (2019). The speech materials were recorded by three different 21-year-old native-English speaking female actors with no noticeable regional accent (referred to as talkers A, B, and C). Sentences used for the target and masker were from the Bamford-Kowal-Bench (BKB) speech corpus (Bench et al., 1979). Each of the 336 sentences within the corpus consists of three to seven words, with three or four keywords per sentence used for scoring. An example BKB sentence is: The boy has black hair (with scoreable keywords underlined).

Sentences were recorded by each talker using three different speaking styles in order to systematically manipulate f0 contour depth. Specifically, each talker produced all BKB sentences using flat, normal, and exaggerated prosody. Calandruccio et al. (2019) report the average, standard deviation, maximum, and minimum f0 values for each talker, as well as the amount of f0 contour fluctuation over time for each contour type. All three talkers showed similar patterns of f0 contour fluctuation across the three production types. For example, for Talker A, the f0 was not entirely constant over time in the flat speaking style (SD ≈ 0.84 semitones, 12 Hz) due to the fact that these stimuli were produced naturally, but variation in f0 was modest compared to the normal (SD ≈ 2.90 semitones, 42 Hz) and exaggerated (SD ≈ 3.32 semitones, 56 Hz) speaking styles. As in Calandruccio et al. (2019), talker A was the target talker, while talkers B and C were masker talkers. Talker A had a mean f0 of 231 Hz (flat), 231 Hz (normal), and 264 Hz (exaggerated). Talker B had a mean f0 of 244 Hz (flat), 237 Hz (normal), and 260 Hz (exaggerated). Talker C had a mean f0 of 233 Hz (flat), 226 Hz (normal), and 276 Hz (exaggerated).

A two-talker masker was used in this experiment because two competing talkers tend to cause higher levels of IM than a single competing voice (Freyman et al., 2004; Iyer et al., 2010; Rosen et al., 2013), and the manipulation of IM via target/masker f0 contour depth similarity was the primary effect of interest in this study. All three two-talker maskers were composed of talkers B and C, each reading the same set of 50 sentences in the same speaking style (e.g., both using the flat speaking style). The shortest string of concatenated sentences within a masker was 75 s. Sentence order for each talker differed to ensure that the two masker talkers were never speaking the same sentence at the same time. For each speaking style, recordings for each of the two masker talkers were root-mean-square (rms) equalized using Praat software (Boersma and Weenink, 2017) and combined. This resulted in three unique two-talker maskers (flat, normal, and exaggerated). Data tables and figures displaying acoustic analyses and visualizations of the f0 contours of these stimuli can be found in the aforementioned paper (Calandruccio et al., 2019) and in Fig. 1.

FIG. 1.

FIG. 1.

Histogram count (percent) of the f0 (averaged within 200-ms windows) for flat (fl.), normal (norm.), and exaggerated (ex.) target (filled bars) and masker (unfilled bars) speech productions. Masker histograms represent aggregated counts across the two masker talkers analyzed individually. Subplots represent the nine target/masker listening conditions. Subplots on the diagonal represent matched target/masker f0 contour depth conditions.

To improve the audibility of the stimuli during the speech recognition task for the older adults, all stimuli (both targets and maskers) were amplified and spectrally shaped to prescribed levels across frequency in accordance with the NAL-RP prescriptive algorithm (linear amplification; Byrne and Dillon, 1986; Dillon, 2012). This was done individually for each older adult participant. Stimuli were filtered into seven component waveforms using octave-wide, band-pass filters (second-order Butterworth) with center frequencies of 250, 500, 1000, 2000, 4000, 6000, and 8000 Hz. The rms level of each of the component waveforms was then amplified to meet frequency-specific NAL-RP prescriptive levels and summed to produce the target and masker speech signals for that participant.

For experiment II, the stimuli presented differed between the two groups: the y-nh group heard the original, natural speech recordings used in Calandruccio et al. (2019), and the y-nh-sh group heard the spectrally shaped stimuli utilized for older adults in experiment I at a similar overall presentation level as the y-nh group.

C. Working memory capacity (WMC) task

Prior to speech recognition testing, verbal working memory capacity was measured for all participants using a reading span task (Daneman and Carpenter, 1980), following procedures described by Klaus and Schriefers (2016). The participant was asked to judge the semantic plausibility of a sentence, while also performing a simultaneous memory task. Testing was performed in blocks of two to six trials each. Each trial consisted of a single sentence judgment and noun presentation. For the sentence judgment task, a sentence (M length = 12 words, SD = 1.5 words) was presented at the center of a computer screen for a maximum of 10 s, or until the participant made the yes/no plausibility judgment by pressing the “Y” key or the “N” key. An example of an implausible sentence is, “Most people would agree that Monday is the worst stick of the week,” whereas an example of a plausible sentence is, “Most people would agree that Monday is the worst day of the week.” After each judgment, a blank screen was presented for 500 ms, followed by a to-be-remembered noun which was presented for 1200 ms. To-be-remembered nouns within a block of trials were not phonologically, semantically, or associatively related to the sentences. At the end of each block of trials, the participant was asked to repeat back as many of the to-be-remembered nouns as possible, and responses were scored without regard for order. Each block size was tested three times, resulting in 15 total blocks per participant (with 60 total trials). A participant's reading span score was the total proportion of to-be-remembered words recalled correctly; this score was interpreted as a measure of each participant's verbal working memory capacity (Daneman and Carpenter, 1980; Klaus and Schriefers, 2016).

D. Speech recognition task

All participants listened to all three target speech types (flat, normal, and exaggerated) in the presence of each of the two-talker maskers (flat, normal, and exaggerated) for a total of nine listening conditions, which were presented in a random order for each participant. Participants heard 32 sentences per masker condition and did not hear any target sentence more than once. For each trial, a random sample of the masker file was chosen; that sample was 1 s longer than the target sentence. The target sentence was temporally centered in the masker sample. The masker was gated on and off using a 50-ms raised-cosine ramp.

Testing was carried out using loudspeakers in a double-walled, sound-isolated suite (Acoustic Systems, ETS-Lindgrend). Participants were seated at the appropriate calibration point in the sound field, while the examiner sat in the single-walled control room. Participants were instructed that they would be listening to a female target talker and that the target talker would not change throughout the experiment. They were also told that other females would be talking at the same time and that their task was to try to ignore the competing speech and repeat back only what the target talker said. Further, they would be able to identify the target initially because she would be the loudest voice. Participants were encouraged to guess after trials in which they were uncertain.

Stimuli were controlled using a custom matlab program, routed to an audiometer (GSI Audiostar Pro, Grason-Stadler) using the External A port, and presented to the participant in the sound field at equal levels from each of two loudspeakers (positioned −45° and +45° azimuth). The target speech was presented at a fixed nominal level of 70 dB sound pressure level (SPL), but the actual level differed across older participants in experiment I as a result of the NAL-RP prescribed amplification. In experiment II, target speech was consistently presented at 70 dB SPL for both groups of young adults (y-nh and y-nh-sh). For the y-nh-sh participants, spectral shaping reduced the low frequency energy within the stimuli, while middle-to-high frequency energy was increased compared to the natural recordings. For the stimuli presented to the y-nh group (without spectral shaping), the average low frequency (<1000 Hz) SPL was 68 dB SPL, while the average middle-to-high frequency (>1000 Hz) level was 66 dB SPL. For the spectrally shaped stimuli presented to the y-nh-sh group, the average low frequency level was 63 dB SPL, while the average middle-to-high frequency level was 69 dB SPL.

The level of the masker speech varied from trial-to-trial and was dependent upon the participant's previous responses. Initially, the signal-to-noise ratio (SNR) was set to +5 dB SNR to ensure that the participant could easily identify the target talker's voice and become familiar with the task. Two interleaved one-up, one-down adaptive tracks of 32 sentences were utilized to estimate the speech recognition threshold (SRT) associated with 50% correct recognition of the target keywords in each target-masker condition. In this procedure, one of the two tracks employed a lax criterion, decreasing the SNR if the participant reported one or more keywords correctly, while the other track used a strict criterion, decreasing the SNR if the participant got all or all but one keyword correct. These two different stepping rules ensured that data points above and below the desired threshold point of 50% correct recognition were sampled, providing data necessary for an accurate psychometric function fit. The function used to establish threshold for each participant and for each condition was a logit, defined as

y=11+exp(xαβ), (1)

where y is the proportion correct, x is the signal level in dB SNR, α is the SRT, and β is the function slope. Fits were made by minimizing the sum of squared error, with the word-level data at each SNR weighted by the number of observations.

Responses were spoken aloud and scored on-the-fly by an examiner seated in the control room of the suite. The examiner had a clear view of the participant's face and heard the response over a headphone connected to the audiometer, giving the examiner both visual and auditory cues for accurate scoring. Any variations of the keyword (including pluralization and tense changes) were marked as incorrect. Total test time for all of the procedures was approximately two hours.

E. Quantification of EM and audibility

The aim of the present study was to evaluate the influence of target/masker f0 contour depth similarity on IM in older adults with SNHL and young adults with normal hearing. To that end, EM was quantified using a speech intelligibility model. The rationale behind incorporating this modeling procedure was that the differences across conditions and groups could reflect a combination of IM and differences in audibility and EM, associated with differences in spectral content, spectral shaping, and hearing loss of the participants. The model does not capture effects of IM, so the differences between the behavioral data and model predictions should reflect differences in susceptibility to IM.

The speech intelligibility model was initially developed to characterize speech-in-noise recognition with and without binaural difference cues in the presence of modulated noises for participants with normal hearing (Collin and Lavandier, 2013; Vicente and Lavandier, 2020), and was later extended to predict speech intelligibility for participants with hearing loss (Lavandier et al., 2018; Vicente et al., 2020). In the present study, all stimuli were diotic (with no binaural effects involved). The model used the calibrated target and masker signals at the two ears as inputs, along with the participant's pure-tone audiograms. For each listener, an internal noise was spectrally shaped using the audiogram. This noise was scaled independently in each frequency band, to capture effects related to outer and inner hair cell loss (seeVicente et al., 2020). Therefore, in this modeling framework, audibility differences based on participant hearing thresholds were incorporated via EM (i.e., internal noise). The target and masker signals were passed through a gammatone filterbank (Patterson et al., 1987), with two half-overlapping filters per equivalent rectangular bandwidth (ERB; Moore and Glasberg, 1983).

Envelope peaks in the speech masker are associated with greater masking of the target, whereas envelope minima and pauses are associated with little or no masking. The model captures this temporal effect by segmenting the input masker signal using 24-ms half-overlapping Hann windows, before applying the gammatone filtering. The SNR as a function of frequency and time is computed as the difference between the target level and either the masker or internal noise level, whichever is larger. Masker level is computed separately for each temporal window, but target level is the average level of the target across time (Vicente and Lavandier, 2020; Vicente et al., 2020, Cubick et al., 2018). The rationale for using the long-term average target level is that both the presence and the absence of target energy are phonemically informative. Computing instantaneous SNR based on short samples of the target would result in low values for internal SNR during spectro-temporal segments of the target associated with gaps or pauses, mistakenly reducing predicted intelligibility (Rhebergen and Versfeld, 2005).

The resulting values of SNR as a function of frequency are capped at a 20-dB ceiling, weighted by the associated frequency band importance (ANSI, 1997), and summed across frequency and averaged across time. The model predicts as output an “internal” broadband long-term SNR, which is then used to predict the SRT (i.e., the higher the internal SNR, the lower the predicted SRT). Differences in inverted predicted internal SNR can be directly compared to differences in SRT. The normal/normal condition (i.e., the condition with normal f0 contour depth for the target and masker speech) measured with the young adults (i.e., the y-nh group with no spectral shaping) was used as the reference condition. The predicted differences in EM are expressed relative to this reference (see Fig. 5).

FIG. 5.

FIG. 5.

Predicted SRT differences (dB SNR) associated with EM and audibility for the three groups of listeners (y-nh, y-nh-sh, o-hi-sh) in the nine conditions tested in this study. Predicted differences are expressed relative to the reference condition measured with the y-nh group (no spectral-shaping of the stimuli) with the normal target/normal masker condition. Only one predicted value per condition was obtained for the y-nh group; for the y-nh-sh and o-hi-sh groups, means and one standard deviation from the mean are shown.

The model considers hearing loss only in terms of an increase in audiometric thresholds. In particular, it does not incorporate the reduced frequency selectivity or reduced temporal resolution which is often associated with SNHL. Despite these simplifications, this model has been shown to accurately predict the effects of audibility and masker envelope modulations when tested on SRTs measured in the presence of vocoded-speech maskers for listeners with and without hearing loss (Vicente et al., 2020; Lavandier et al., 2018). In contrast to vocoded-speech maskers, natural-speech maskers contain voiced parts with a harmonic structure and corresponding f0 information. Harmonic-cancellation was not modelled here, but it is unclear how much this mechanism plays a role with a speech masker containing unvoiced segments and time-varying f0 (Leclère et al., 2017). A harmonic-cancellation model has been recently proposed to predict intelligibility in the presence of a single harmonic complex with no f0 variation (Prud'homme et al., 2020), but to the best of our knowledge no such model is currently available to predict intelligibility in the presence of a competing speech masker, not to mention in the presence of multiple speech maskers, as in the present study.

Following the approach of Vicente et al. (2020), the target signal used to compute the predictions in each condition (flat, normal, or exaggerated) was created by averaging the waveforms of all 336 target sentences, truncated to the shortest sentence duration before averaging. The entire two-talker masker signal was used as model input. The masker signal duration was at least 75 s for each of the three tested conditions. All of the stimuli used for the older adults and the second group of younger adults tested in experiment II (i.e., o-hi-sh and y-nh-sh, respectively) were spectrally shaped using the individual NAL-RP gains as in the experiment. Finally, all signals were calibrated to the sound level used for the target in the experiment. The model computes internal noise levels based on each audiogram and the overall presentation level of the stimuli (target + masker), but this overall level is not known a priori because it depends on the SNR, which varies during the adaptive measurement of SRT. Vicente et al. (2020) approximated this overall level using the presentation level of the masker that was fixed during their experiments. In the present study, the target level was fixed, so target level was used instead as a proxy for overall level when computing internal noise levels.

The pure tone thresholds of the young adults were assumed to be 0 dB HL at all frequencies. An analysis including thresholds for individual younger listeners was not possible because those subjects were screened for normal hearing, but thresholds were not obtained. However, when the model was run with thresholds of either −10 or 15 dB HL at all frequencies, the results were functionally identical (within 0.1 dB). Individual binaural room impulse responses were not measured at the time of the experiment, so the exact signals at the listeners' ears were not available in this study. The signals sent to the two loudspeakers used in the experiment were therefore used as model inputs, as if they were the signals at the listener's ears. This assumption neglects the filtering by the loudspeakers, listening room, and head-related transfer functions of the listener. However, because this filtering was constant across conditions (assuming a fixed head position), it is expected to have little or no impact on the model predictions. Model predictions were computed only once for the y-nh group (identical stimuli and audiograms for all listeners) and for each participant in the o-hi-sh and y-nh-sh groups; average predictions for each group are presented. Data points considered to be outliers (as described below) were also omitted from modeling.

III. RESULTS

A. SRT results for older adults (experiment I)

Older adults' SRTs for the nine target/masker listening conditions were analyzed using a multiple regression model. The effects in the model were target speaking style, masker speaking style, and the interaction between them. Each subject was evaluated in all nine condition combinations so that their nine responses are correlated with each other. In other words, subjects that scored higher than average in one condition are likely to score higher than average in all of the conditions. To account for that correlation due to repeated measures, an unstructured correlation matrix for the residuals was used. An unstructured matrix of correlated errors allows for different variances across conditions, as well as different correlations between conditions (Oleson et al., 2019). The parameter estimates for the full regression model are shown in Table I. False discovery rate (FDR; Benjamini and Hochberg, 1995) adjusted p-values were used to adjust for multiple comparisons. Of the 198 possible data points (9 conditions × 22 participants) there were three missing data points due to experimental error, one in the normal/normal condition and two in the normal/flat condition. The one data point associated with a poor psychometric function fit (r2 = 0.46; flat/exaggerated) was excluded. Data reported below are based on 194 data points, all associated with good psychometric function fits (median r2 = 0.91, interquartile range of r2 = 0.84–0.97).

TABLE I.

Parameter estimates from the regression model assessing the main effects of target and masker speaking styles, and the interaction of these effects on SRT for older adults with hearing loss. Speaking styles are flat (fl.), normal (norm.), and exaggerated (ex.). An * indicates an FDR adjusted p-value.

Effect Target vs Target Masker Estimate Test p-value
Target F2,21 = 27.95 <0.0001
Masker F2,21 = 33.35 <0.0001
Target × Masker F4,21 = 29.14 <0.0001
fl. ex. fl. 0.2 t21 = 0.34 0.7400*
fl. norm. fl. 0.5 t21 = 1.27 0.2784*
ex. norm. fl. 0.3 t21 = 0.82 0.4723*
ex. fl. norm. 1.2 t21 = 3.76 0.0027*
ex. norm. norm. 0.6 t21 = 1.77 0.1642*
norm. fl. norm. 0.5 t21 = 1.55 0.2042*
ex. fl. ex. 3.8 t21 = 12.92 <0.0001*
ex. norm. ex. 2.2 t21 = 7.28 <0.0001*
norm. fl. ex. 1.6 t21 = 5.62 <0.0001*

Across the nine target/masker conditions, the range of mean SRTs was between −0.5 and −4.3 dB SNR for the most difficult to the easiest target/masker combinations (see Fig. 2). Means and standard deviations for SRTs in all target/masker conditions are provided as supplementary materials.1 For the flat masker, there were no statistically significant differences in SRT between the three target types. For the normal masker, there was a statistically significant difference between the exaggerated and flat targets. For the exaggerated masker condition, all three target types were significantly different from each other. For the exaggerated masker, SRTs were worst for the matched target/masker condition (exaggerated/exaggerated), while performance improved as a function of dissimilarity in f0 contour in the two mismatched target/masker conditions (normal/exaggerated; flat/exaggerated).

FIG. 2.

FIG. 2.

SRTs (dB SNR) for older adults with mild to moderate SNHL presented with spectrally shaped stimuli (o-hi-sh) for three target speaking styles: flat (fl.), normal (norm.), and exaggerated (ex.) in two-talker maskers with three different speaking styles: flat (left column), normal (middle column), and exaggerated (right column). Open circles represent individual listener data, while boxes indicate the 25th to 75th percentiles of the data. Whiskers extend to the last data falling within 1.5 times the interquartile range. Lower SRTs (smaller or more negative SNRs) indicate better performance.

The results shown in Fig. 2 indicate that when listening in the flat two-talker masker, the similarity of the f0 contour between the target and masker speech did not influence SRT. However, within the normal two-talker masker, flat targets were associated with lower SRT than exaggerated targets, a difference of ∼1 dB. This effect may reflect lower intelligibility of the exaggerated targets in general. Calandruccio et al. (2019) reported a 1-dB decrement in young adults with respect to the overall intelligibility of exaggerated targets compared to both flat and normal targets in a steady-noise masker. They proposed that this decrease in intelligibility could be due to a discrepancy between pitch contour and the semantic meaning of the sentence (see also Fig. 6 in Sec. IV). For the exaggerated masker condition, the older adults did display a release from masking as the f0 contour depth of the target speech became less similar to that of the masker speech (exaggerated target SRTs > normal target SRTs > flat target SRTs).

FIG. 6.

FIG. 6.

F0 contours for the flat (dotted line), normal (dashed line), and exaggerated (solid line) productions of the BKB sentence, “the farmer keeps a bull.” The text within the figure is lined up with the onset of each respective word. Due to the differences in duration across the three types of sentences, time is normalized to total duration.

B. Intrinsic predictors of multi-talker listening for older adults (experiment I)

In experiment I, the highest mean SRT occurred in the matched exaggerated/exaggerated condition, and significant masking release was observed when the target was mismatched in speaking style with respect to the exaggerated masker (normal/exaggerated and flat/exaggerated; see right column of Fig. 2). This observation of masking release could indicate that there was substantial IM in the exaggerated/exaggerated condition. Based on this observation, results in the exaggerated/exaggerated condition were selected for further analysis. A multiple regression model was used to determine if listener age, working memory capacity (quantified via a reading span score), and hearing loss (quantified via 4-frequency PTA) predicted the SRT in the exaggerated/exaggerated condition. The model was significant (r2 = 0.58, p < 0.0001); higher SRTs were predicted by larger values of PTA (β^ = 0.11, t = 4.51, p = 0.0003) and older age (β^= 0.13, t = 2.24, p = 0.0381). Working memory capacity was not related to SRT (β^ = 0.04, t = 1.59, p = 0.1294) after controlling for age and PTA.2

C. SRT results for young adults (experiment II)

Analysis of the pattern of SRTs for young adults was similar to the approach described above, with the exception that a main effect of listener group (y-nh and y-nh-sh) and interactions with listener group were also included in the model. The addition of listener group to this analysis resulted in an additional three-way interaction (target × masker × group). Differences in SRTs for the nine target/masker listening conditions were analyzed using t-tests with FDR adjusted p-values, according to least squares mean estimates from each target by masker by group combination. Of the 396 possible data points (9 conditions × 44 participants) there were three missing data points due to experimental errors: one for the exaggerated/exaggerated condition and two for the normal/normal condition. Of the remaining 393 data points, 20 (<5%) were excluded from the analysis due to poor psychometric function fits (r2 values < 0.50). This left a total of 373 data points with strong psychometric function fits (median r2 = 0.88, interquartile range of r2 = 0.78–0.95).

Young adults benefited when the target speaking style became less similar to that of the masker speech. This was observed for all three masker conditions and occurred regardless of spectral shaping of the stimuli (Fig. 3). No significant difference was observed between groups tested with and without spectral shaping (F1,42 = 0.33, p = 0.5560), and there was not a significant three-way interaction (F4,42 = 2.43, p = 0.0621). The parameter estimates for the full regression model are shown in Table II. Means and standard deviations for all target/masker conditions for both groups of young adults are provided as supplementary materials.1

FIG. 3.

FIG. 3.

SRTs (dB SNR) for two different young adult listener groups with normal hearing. One group (y-nh) was presented with the original speech stimuli (unfilled box plots), while the second group (y-nh-sh) was presented with the same spectrally shaped stimuli as the older adults in experiment I (filled box plots). Filled circles and open diamonds represent individual listener thresholds for the original and shaped conditions, respectively. Plot layout follows the same format as Fig. 2.

TABLE II.

Parameter estimates from the regression model assessing the main effects of target speaking style, masker speaking style, and listener group, and the interaction of these effects on SRT for young adults (y-nh, young adults tested with original stimuli; y-nh-sh, young adults tested with spectrally shaped stimuli). Speaking styles are flat (fl.), normal (norm.), and exaggerated (ex.). An * indicates an FDR adjusted p-value.

Effect Target vs Target Masker Group Estimate Test p-value
Target F1,42 = 25.00 <0.0001
Masker F1,42 = 36.13 <0.0001
Target × Masker F2,42 = 47.90 <0.0001
Group F1,42 = 0.33 0.5660
Group × Target F2,42 = 2.30 0.1128
Group × Masker F2,42 = 0.16 0.8511
Group × Target × Masker F4,42 = 2.43 0.0621
fl. ex. fl. y-nh 6.6 t42 = 7.78 <0.0001*
fl. norm. fl. y-nh 2.5 t42 = 5.38 <0.0001*
ex. norm. fl. y-nh −4 t42 = −5.47 <0.0001*
ex. fl. norm. y-nh −1.5 t42 = −2.11 0.0473*
ex. norm. norm. y-nh −4.4 t42 = −4.98 <0.0001*
norm. flat norm. y-nh 2.9 t42 = 4.16 0.0002*
ex. fl. ex. y-nh 9.4 t42 = 11.39 <0.0001*
ex. norm. ex. y-nh 2.5 t42 = 7.64 <0.0001*
norm. fl. ex. y-nh 6.9 t42 = 8.03 <0.0001*
fl. ex. fl. y-nh-sh 4 t42 = 4.84 <0.0001*
fl. norm. fl. y-nh-sh 2.3 t42 = 4.90 <0.0001*
ex. norm. fl. y-nh-sh −1.7 t42 = −2.31 0.0316*
ex. fl. norm. y-nh-sh 0.5 t42 = 0.78 0.4398*
ex. norm. norm. y-nh-sh −2.2 t42 = −2.52 0.0202*
norm. fl. norm. y-nh-sh 2.7 t42 = 3.97 0.0004*
ex. fl. ex. y-nh-sh 7.6 t42 = 8.96 <0.0001*
ex. norm. ex. y-nh-sh 2.1 t42 = 6.30 <0.0001*
norm. fl. ex. y-nh-sh 5.5 t42 = 6.21 <0.0001*

D. Matched vs mismatched target/masker conditions (experiments I and II)

At the outset of this project, we predicted lower SRTs when there was a mismatch in the target and masker speaking style as compared to matched style. This prediction was tested in an additional analysis that pooled all thresholds in the three matched conditions and those in the six mismatched conditions. These tests were designed using contrast statements within a regression model that included all three groups of listeners and FDR adjusted p-values (see Table III). Effects in the model were target style, masker style, group, and all two- and three-way interaction terms. All factors were significant (p < 0.001) other than the group × masker style interaction (F4,63 = 1.90, p = 0.1212). See supplementary materials1 for the parameter estimates of the full regression model. Contrasts demonstrated statistically significantly higher thresholds in the matched conditions than the mismatched conditions for young adults (y-nh: t42 = 10.58, p < 0.0001), young adults with shaped stimuli (y-nh-sh: t42 = 7.87, p < 0.0001), and older adults with hearing loss and shaped stimuli (o-hi-sh: t42 = 6.07, p < 0.0001). Averaged across both groups of young adults, there was a 4-dB decrease in SRT from the matched to the mismatched conditions. However, for the older adult group, mismatching the target and the masker in f0 contour depth only decreased thresholds by about 1 dB on average compared to the matched conditions.

TABLE III.

Contrast statements tested within the framework of the regression model evaluating the main effects of target speaking style, masker speaking style, and listener group, as well as all two- and three-way interactions. This model allowed for an evaluation of all matched and mismatched target/masker conditions compared across all three listener groups (o-hi-sh, older adults with hearing loss and spectral shaping; y-nh, young adults tested with original stimuli; y-nh-sh, young adults tested with spectrally shaped stimuli).

Contrast Estimate Standard error df t Value FDR adjusted p-value
matched vs mismatch o-hi-sh 1.075 0.374 63 2.88 0.0055
matched vs mismatch y-nh 4.691 0.379 63 12.39 <0.0001
matched vs mismatch y-nh-sh 3.478 0.376 63 9.24 <0.0001
matched o-hi-sh vs y-nh 3.758 1.026 63 3.66 0.0008
matched o-hi-sh vs y-nh-sh 5.197 1.019 63 5.1 <0.0001
matched y-nh vs y-nh-sh 1.439 1.028 63 1.4 0.1996
mismatched o-hi-sh vs y-nh 29.221 4.154 63 7.03 <0.0001
mismatched o-hi-sh vs y-nh-sh 24.815 4.154 63 5.97 <0.0001
mismatched y-nh vs y-nh-sh −4.398 4.161 63 −1.06 0.2945

Interestingly, for the matched conditions, the thresholds across all three groups only varied by about 1.5 dB, indicating that the older adults did not actually perform much worse than the young adults in those conditions (see left side of Fig. 4). These data are in contrast to the mismatched conditions, in which a difference of almost 5 dB was observed between older adults and the two groups of young adults; the two groups of young adults (y-nh and y-nh-sh) were not significantly different from each other (t42 =−0.81, FDR-adjusted p-value = 0.4170).

FIG. 4.

FIG. 4.

Mean SRTs for the three matched conditions and six mismatched target/masker conditions for the three groups of listeners (o-hi-sh, y-nh, y-nh-sh). Error bars represent 1 standard error of the mean.

E. Quantification of EM and audibility

Differences in EM and audibility predicted by the model of Vicente et al. (2020) for the three groups of listeners and all nine listening conditions are shown in Fig. 5. With the unshaped stimuli, the flat masker provided about 1 dB more masking than the normal masker. Using the spectrally shaped stimuli (for the o-hi-sh and y-nh-sh groups) slightly reduced differences in EM across conditions; specifically, spectral shaping eliminated the difference between the flat and normal maskers for the three target types. The pattern of EM across conditions is very similar for the y-nh-sh and o-hi-sh listener groups. The model predicts that SRTs should be about 2.5 dB higher for the o-hi-sh compared to the y-nh-sh listeners, and that this increase in threshold should be the same across all tested conditions. Overall, for each group of listeners and for each masker condition, the predicted differences in EM produced by changing the target f0 contour are always smaller than 0.8 dB.

IV. DISCUSSION

The purpose of these experiments was to examine the importance of f0 contour depth similarity between target and masker speech on sentence recognition for older adults with varying degrees of SNHL in comparison to young adults with normal hearing. f0 contour depth has been demonstrated to be an effective stream segregation cue for young adults with normal hearing, with differences in the perceptual similarity of target and masker f0 contour depth supporting better speech recognition (Brokx and Nooteboom, 1982; Calandruccio et al., 2019; Darwin et al., 2003). Older adults with SNHL demonstrated significantly less benefit when the target/masker f0 contour depth was mismatched between the target and masker speech, rather than matched. This was true even though individualized amplification was applied to the stimuli in order to compensate for reductions in hearing thresholds and to improve audibility of speech cues. In fact, only in the exaggerated masker condition (right panel, Fig. 2) were the older adults (o-hi-sh) able to benefit from a target/masker f0 contour depth mismatch, and this benefit was still reduced compared to the young adults with normal hearing (Table IV in supplementary materials1).

A. The effect of target/masker f0 contour depth similarity on SRTs

Similar to results reported in Calandruccio et al. (2019), significant masking release was observed for young adults with normal hearing when f0 contour depth was mismatched between the target and the masker speech (see Fig. 3). This phenomenon is most clearly visible when examining the left- and right-hand columns, comparing matched and mismatched target/masker speaking style conditions for the y-nh and y-nh-sh groups; thresholds are poor when the target and masker are matched in speaking style, and improve significantly when a difference in f0 contour depth is introduced (see also Fig. 4).

In the presence of the flat masker condition, the o-hi-sh group displayed no significant differences in SRT across any of the target speaking styles. This result is in line with the EM modeling data, which predicts about a 0.5 dB difference between the three target conditions. However, this result is quite different from that of the young adults, who realized a stepwise improvement (approximately 7 dB) in SRT when the f0 contour depth of the target speech was perceptually dissimilar from the flat masker (see left panel of Fig. 2, in comparison to left panel of Fig. 3). In the normal f0 contour masker, older adults also failed to utilize differences between the target and masker speech to appreciably improve speech recognition (other than the small, 1-dB mean difference between the flat target/normal masker and the exaggerated target/normal masker conditions). This is in contrast to both groups of young adults who displayed significant benefit (ranging between a 2.1 and 4.1 dB improvement in threshold) for the flat and exaggerated target speech compared to the matched normal target speech condition (see middle panel of Fig. 2 in comparison to middle panel of Fig. 3). In contrast, the model predicts about 0.5 dB or less higher thresholds for the flat target than either the normal or exaggerated targets in the normal masker condition.

Only in the exaggerated f0 contour masker condition did older adults demonstrate a release from masking when target f0 contour depth was dissimilar from that of the masker. In the exaggerated masker, older adults' speech recognition improved significantly between the matched (exaggerated/exaggerated) target/masker condition and the mismatched (normal/exaggerated and flat/exaggerated) target/masker conditions. While this pattern of masking release was similar to that of the young adults, the magnitude of benefit demonstrated by the older adults was less than that of the young adults. Average improvement for older adults was about 3.7 dB, while the young adults realized an average masking release of 9.2 dB (y-nh) and 7.4 dB (y-nh-shaped; see right panel of Fig. 2 in comparison to the right panel of Fig. 3). The predicted EM differences between the three target types were largest for the oh-hi-sh group and were predicted to be just over 0.5 dB poorer for the flat/exaggerated condition compared to the two other target/masker conditions (normal/exaggerated and exaggerated/exaggerated). Further, the predicted difference in EM (flat > exaggerated) is the opposite to the behavioral data which indicated lower SRTs for the flat targets (flat < exaggerated) in the exaggerated masker.

B. Effects of age on the ability to benefit from mismatches in f0 contour depth

Why did older adults benefit from differences in the target and masker f0 contour for the exaggerated masker, but not the flat masker? For the y-nh and y-nh-sh groups, the largest masking release is observed when the speaking style of the target and masker is markedly different: flat/exaggerated or exaggerated/flat. However, older adults experience masking release in only one of these two conditions; benefitting from a mismatch in the flat/exaggerated but not the exaggerated/flat condition. One explanation for this discrepancy is that younger and older adults may rely on different cues related to segregation and grouping (Jesse and Helfer, 2019; Russo and Pichora-Fuller, 2008).

Successful speech recognition in a speech masker requires that the listener segregate the target from the masker and that they group together glimpses of the target speech (Cooke, 2006). If segregation is the primary challenge facing the listener, any perceptual difference (flat/exaggerated or exaggerated/flat) should be equally beneficial (Bregman, 1990). However, if grouping were a limiting factor (Bologna et al., 2018), we might expect differential benefit in these two conditions. One the Gestalt principles of psychology is continuity (Koffka, 1935). When a visual or auditory object is partially masked, we often perceive it as continuing “through” the masker, even though the masked portion is not directly perceived; for acoustic speech stimuli, continuity can result in phonemic restoration (Warren, 1970). When the target has a flat f0 contour, the f0 before and after the masked epochs is very similar, and continuity is clear. However, when the target is exaggerated, the f0 changes markedly and unpredictably; this could reduce perceived continuity (Clarke et al., 2014), and, as a result, compromise grouping of information related to the target.

The unpredictable nature of the exaggerated sentences is due in part to the fact that the current stimulus set consisted of declarative sentences, which are most naturally produced with a neutral f0 contour, characterized by an f0 decrease of about 2–7 Hz across the length of the sentence (Liberman et al., 2002; Paulmann and Pell, 2010; Raphael et al., 2007). Figure 6 illustrates f0 contours for the sentence, “The farmer keeps a bull.” For the normal target, the f0 contour peaks at the first keyword (“farmer,” the subject of the sentence) and falls in frequency as the sentence progresses. For the flat production, we observe very little variation in the f0 contour across the entire duration of the sentence. However, for the exaggerated target, the f0 contour is not only more variable for the first keyword, but we also observe a sharp (unexpected) increase in f0 contour for the final keyword (“bull,” the object of the sentence), which is atypical for declarative sentence production in natural speech. The idea that predictability improves grouping is broadly consistent with the finding from Calandruccio et al. (2019) that even young adults recognized exaggerated speech targets more poorly than either flat or normal speech when the masker was noise. Older adults are known to rely more heavily on semantic context (see Abada et al., 2008; Sommers and Danielson, 1999; Wingfield et al., 2006) for speech recognition. If older adults also rely more heavily on the acoustic f0 context than younger adults, this may explain their ability to benefit from an f0 contour mismatch for flat/exaggerated but not exaggerated/flat conditions.

C. Audibility and source segregation cues in multi-talker listening

The data for the young adults replicate the results of Calandruccio et al. (2019), supporting the idea that increased target and masker similarity negatively influences speech recognition. This is likely a result of increased confusability between the speech streams due to similarity in f0 contour depths (Durlach et al., 2003; Kidd and Colburn, 2017). Furthermore, spectral shaping of the stimuli resulted in no significant consequence for SRTs (no significant differences in group between y-nh and y-nh-sh), despite our modeling data suggesting small changes (∼0.8–2.1 dB) in EM between the two stimulus conditions. The lack of a difference in SRTs between the two groups of young adults in experiment II indicates that spectral shaping cannot explain the disparate results of young adults and older adults in experiment I. However, it is not uncommon for older adults to realize less benefit from segregation cues than young adults. Particularly, older adults with SNHL often demonstrate a reduced ability to utilize source segregation cues to improve target speech recognition in multi-talker environments (Feston and Plomp, 1990; Humes and Coughlin, 2009; Kidd et al., 2019; Lee and Humes, 2012). For adults with hearing loss, it has been argued that this reduced ability may be due in part to reduced audibility of important segregation cues (Arbogast et al., 2002; Glyde et al., 2015).

Data examining spatial release from masking (i.e., the difference in SRT between co-located target/masker and spatially separated target/masker conditions), suggest that differences in audibility between listener groups may account for group differences in speech recognition in multi-talker environments (Arbogast et al., 2005; Best et al., 2017; Gallun et al., 2013; Glyde et al., 2015; Jakien et al., 2017). For example, Glyde et al. (2015) systematically varied the amount of amplification applied to speech mixtures for adults with SNHL and for a group of adults with simulated hearing loss. The degree of benefit realized in the spatially separated conditions improved for both groups of listeners as a result of enhanced audibility. Specifically, once increased gain exceeded that prescribed by NAL-RP (the same individualized linear prescription as the one used in our study; Byrne et al., 1990), differences in the magnitude of spatial release from masking were reduced between the adults with SNHL and those with normal hearing. It should be noted that the individuals with SNHL still performed significantly worse compared to adults with normal hearing, even though that group listened with a simulated hearing loss. Therefore, while it seems that audibility played a significant role in the poorer performance of those with SNHL in terms of realizing a release from masking, other factors, such as age and suprathreshold processing deficits that occur as a result of SNHL (i.e., poorer spectral and temporal resolution), also likely contributed to the deficits observed in multi-talker listening for these participants (Glyde et al., 2015; see also Best et al., 2017).

While these studies each examined the role of spatial segregation cues rather than f0 contour depth cues in multi-talker listening, the influence of audibility and sensation level must be considered when interpreting the current data set. An important finding in the modeling predictions was related to group differences in terms of relative SRTs. The modeling suggested that we should expect about a 2.5 dB SRT difference between our o-hi-sh and y-nh-sh groups, with the o-hi-sh having poorer performance due to reduced audibility. This difference in SRTs is consistent with the difference observed between the o-hi-sh and y-nh-sh groups for the matched target/masker conditions, where a maximum group difference of 1.8 dB was observed for the exaggerated/exaggerated condition. The modeling data in combination with the speech recognition data imply that differences in audibility are indeed consistent with the between group differences for the matched conditions tested here. Though we applied individualized linear gain (NAL-RP) to the speech stimuli presented to the o-hi-sh group, this amplification prescription does not fully restore audibility for listeners with sloping hearing losses, especially in the high frequencies (>2–3 kHz; Best et al., 2017; Glyde et al., 2015), and did not ensure the same audibility and/or sensation levels achieved by the two younger groups. This idea—that audibility is necessary to utilize a segregation cue because one must be able to hear it at an adequate sensation level—is also supported by the fact that four-frequency PTA was the strongest intrinsic predictor of SRT performance in every condition for our older participants (i.e., greater hearing loss was associated with worse SRT performance, even with age and WMC controlled for).

One interesting result from the present study is that the computational modeling of EM and audibility does not predict between group differences for the mismatched target/masker conditions. Differences in audibility between the older and younger adults account for a 2.5-dB difference between groups. However, if the differences in thresholds are examined for the two extreme mismatched conditions (flat/exaggerated and exaggerated/flat target/masker conditions) an average ∼5.6 dB difference is observed between the o-hi-sh and the average of the two y-nh groups. Therefore, although audibility may account for some speech-in-speech conditions (matched conditions), utilization of segregation cues remains deficient for older adults with SNHL as evidenced by the mismatched conditions. Though individual cochlear filter bandwidth and frequency resolution was not tested here, it is widely accepted that older adults with SNHL tend to show broadened cochlear filters and reduced frequency selectivity (Moore, 1985; Oxenham and Bacon, 2003), resulting in poorer pitch perception (Mackersie, 2003; Oxenham, 2018). Therefore, it is possible that in addition to reductions in audibility, peripheral changes associated with SNHL for the older adults decreased their ability to exploit target/masker f0 contour depth differences due in part to poorer pitch processing (Mackersie, 2003; Oxenham, 2018). Further, if attentional resources are required to exploit the acoustical differences between the target/masker f0 contours in mismatched conditions, an insufficient amount of remaining attentional resources for the older adults may also in part explain the current data.

D. EM components

In Calandruccio et al. (2019), an ideal-binary mask (IBM) was utilized to estimate EM across the same nine listening conditions evaluated in the present study. IBMs essentially simulate an ideal target/masker segregation process by zeroing out time/frequency epochs within the speech mixture that are dominated by masker energy (Anzalone et al., 2006; Brungart et al., 2006; Wang, 2005). Therefore, in each trial, the participant is left with sparse cues from time/frequency epochs dominated by target speech energy. The IBM results from Calandruccio et al. (2019) indicated that EM was equivalent for all three target speaking styles in both the normal and exaggerated masker conditions (see Fig. 5 in the results section of that paper). In addition, those data indicated that participants had significantly poorer SRTs when listening in the flat masker conditions (∼3 dB poorer performance in the flat masker IBM conditions than the normal and exaggerated IBM conditions). The authors concluded that the flat masker caused more EM than the other two maskers. Most importantly, however, no significant EM differences were found across target speaking styles within any of the three maskers.

This main finding was confirmed by the modeling approach followed in the present study, and was even extended here to older adults with SNHL. For each group of listeners and each masker type, the effect of changing the target contour depths causes very limited differences in EM (<0.8 dB), so that the corresponding differences in SRTs within masker conditions observed in the data for the y-nh and y-nh-sh listeners can be attributed to IM. Model results were consistent with the IBM data of Calandruccio et al. (2019) for the unshaped stimuli (those presented to the y-nh group in experiment II). However, the modeling also suggests that once spectral shaping was added to the target/masker mixture, the difference in EM between the flat and normal masker conditions (for each target/masker pair) was no longer observed.

As mentioned in the method Sec. II E, harmonic-cancellation was not modelled here. Even if modest differences in EM related to f0 contour depth cannot be ruled out, incorporating harmonic cancellation into the model would likely reduce estimates of EM in the flat masker condition. Effects of harmonic cancellation are expected to be larger for the flat condition than the normal or exaggerated condition based on prior results showing reduced effects of harmonic cancellation for harmonic complexes with dynamic f0 (Leclère et al., 2017). However, reducing estimates of EM for the flat masker relative to the other two styles would lead to greater divergence between predicted and observed data. This suggests that effects of harmonic cancellation were probably modest or absent in the conditions tested.

E. Cognitive factors influencing differences in SRT performance

Working memory capacity has been posited to be a strong predictor of speech recognition performance in many widely cited models of speech perception (Edwards, 2016; Rönnberg et al., 2013; Rönnberg et al., 2019). In experiment I, we did not observe a significant relationship between masked SRTs and WMC in older adults. This was true even for the condition thought to exert high IM as a result of semantic mismatch between exaggerated f0 contours and the linguistic content of BKB sentence stimuli (exaggerated/exaggerated) and remained true in a larger statistical model that included all nine listening conditions (see Sec. III B). However, this outcome is not unique to our data; across the literature there are a number of empirical examples in which WMC did not prove to predict speech recognition performance (e.g., Füllgrabe and Rosen, 2016; Füllgrabe et al., 2015). Some inconsistencies throughout the literature may be due to studies being underpowered in investigating this relationship, and/or may be related to the use of samples of participants that are too homogenous in nature and therefore lack adequate variability in WMC between listeners (Dryden et al., 2017). Some researchers have argued that the relationship between speech recognition performance and WMC is more apparent when listeners are faced with high degrees of IM (Rönnberg et al., 2019).

To increase variance in performance across participants and increase the power of the analysis, we conducted a supplemental analysis that examined the relationship between SRT for the exaggerated/exaggerated condition and verbal WMC for all three groups of participants (o-hi-sh, y-nh, and y-nh-sh). As in the analysis of experiment I with the o-hi-sh group, the exaggerated/exaggerated condition is thought to cause higher degrees of IM due to similarity of target and masker pitch contour depths, and uncertainty associated with incongruent dynamic pitch movement and semantic content of stimuli. A multiple regression model predicting SRT based on WMC indicates a significant effect (β^ = −0.06, t63 = −2.94, p = 0.0046). However, when group is included in the analysis, WMC is no longer a significant predictor of SRT (β^= −0.03, t61 = −1.14, p = 0.2569). This suggests that the variance explained by WMC is primarily due to the variability in age or hearing loss between groups.

V. CONCLUSIONS

  • (1)

    Older adults with SNHL are less adept at utilizing f0 contour depth difference cues between the target and masker speech to improve target speech recognition compared to young adults with normal hearing.

  • (2)

    Predicted differences in EM (including reduced stimulus audibility) can explain differences between groups when target/masker f0 contour depth is matched, but not when there is a mismatch between target and masker f0 contour depth. In particular, f0 contour depth mismatch caused very limited differences in EM, so corresponding behavioral differences can be attributed to a release of IM for the young adults with normal hearing.

  • (3)

    Older adults' pure tone thresholds were a better predictor of their performance than age. Older adults' WMC did not predict performance.

ACKNOWLEDGMENTS

This work was supported by the National Institutes of Health (Grant No. R03DC015074). M.L. is part of the LabEx CeLyA (ANR-10-LABX-0060/ANR-16-IDEX-0005) and supported by the Fondation pour l'Audition (Speech2Ears grant). We are thankful to Dr. Frederick (Erick) Gallun and Dr. Virginia (Gin) Best for their help and thoughtful comments throughout this project.

Footnotes

1

See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0002661. Table I shows four-frequency PTAs, audiometric thresholds, age, and reading span scores for the older listener (o-hi-sh) group. Table II shows mean and SD threshold values for the nine target/masker conditions for the older adult (o-hi-sh) group. Table III shows mean and SD threshold values for the nine target/masker conditions for the two groups of young adults (y-nh: original/natural stimuli; y-nh-sh: spectrally shaped stimuli). Table IV shows parameter estimates from the regression model assessing the main effects of target speaking style (fl., norm., ex.), masker speaking style (fl., norm., ex.), and listener group (o-hi-sh, y-nh, y-nh-sh) and the interaction of these effects on SRTs. Table V shows reading span scores and SRT values for the exaggerated target/exaggerated masker condition for each listener in all three groups (o-hi-sh, y-nh, and y-nh-sh).

2

These three intrinsic predictors were also evaluated in a regression model including SRT results in all nine target/masker conditions. Specific conditions can be examined by including two-way interactions between target condition and each of the predictors, two-way interactions between masker condition and each of the predictors, and three-way interactions between target condition, masker condition, and each of the predictors. None of the interaction terms were statistically significant, and the results of this larger model were consistent with the more parsimonious and easier to interpret multiple regression analysis including just data collected with the exaggerated target/exaggerated masker.

References

  • 1. Abada, S. H. , Baum, S. R. , and Titone, D. (2008). “ The effects of contextual strength on phonetic identification in younger and older listeners,” Exp. Aging Res. 34(3), 232–250. 10.1080/03610730802070183 [DOI] [PubMed] [Google Scholar]
  • 2. Agus, T. R. , Akeroyd, M. A. , Noble, W. , and Bhullar, N. (2009). “ An analysis of the masking of speech by competing speech using self-report data,” J. Acoust. Soc. Am. 125(1), 23–26. 10.1121/1.3025915 [DOI] [PubMed] [Google Scholar]
  • 3.ANSI (1997). S3.5, American National Standard Methods for Calculation of the Speech Intelligibility Index ( Acoustical Society of America, New York: ). [Google Scholar]
  • 4.ANSI (2009). S3.21 2004 (R2009), American National Standard Methods for Manual Pure-Tone Threshold Audiometry ( Acoustical Society of America, New York: ). [Google Scholar]
  • 5. Anzalone, M. , Calandruccio, L. , Doherty, K. , and Carney, L. (2006).“ Determination of the potential benefit of time-frequency gain manipulation,” Ear Hear. 27(5), 480–492. 10.1097/01.aud.0000233891.86809.df [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Arbogast, T. L. , Mason, C. R. , and Kidd, G., Jr. (2002). “ The effect of spatial separation on informational and energetic masking of speech,” J. Acoust. Soc. Am. 112, 2086–2098. 10.1121/1.1510141 [DOI] [PubMed] [Google Scholar]
  • 7. Arbogast, T. L. , Mason, C. R. , and Kidd, G., Jr. (2005). “ The effect of spatial separation on informational masking of speech in normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am. 117, 2169–2180. 10.1121/1.1861598 [DOI] [PubMed] [Google Scholar]
  • 8. Arehart, K. , King, C. , and McLean-Mudgett, K. (1997). “ Role of fundamental frequency differences in the perceptual separation of competing vowel sounds by listeners with normal hearing and listeners with hearing loss,” J. Speech Lang. Hear. Res. 40(6), 1434–1444. 10.1044/jslhr.4006.1434 [DOI] [PubMed] [Google Scholar]
  • 9. Assmann, P. F. (1999). “ Fundamental frequency and the intelligibility of competing voices,” in Proceedings of the 14th International Congress of Phonetic Sciences, August 1–7, San Fransisco, CA, pp. 179–182. [Google Scholar]
  • 10. Bench, J. , Kowal, Å. , and Bamford, J. (1979). “ The BKB (Bamford-Kowal-Bench) sentence lists for partially-hearing children,” Brit. J. Audiol. 13, 108–112. 10.3109/03005367909078884 [DOI] [PubMed] [Google Scholar]
  • 11. Benjamini, Y. , and Hochberg, Y. (1995). “ Controlling the false discovery rate: A practical and powerful approach to multiple testing,” J. R. Stat. Soc. Ser. B 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
  • 12. Best, V. , Mason, C. R. , Swaminathan, J. , Roverud, E. , and Kidd, G., Jr. (2017). “ Use of a glimpsing model to understand the performance of listeners with and without hearing loss in spatialized speech mixtures,” J. Acoust. Soc. Am. 141(1), 81–91. 10.1121/1.4973620 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Binns, C. , and Culling, J. F. (2007). “ The role of fundamental frequency contours in the perception of speech against interfering speech,” J. Acoust. Soc. Am. 122(3), 1765–1776. 10.1121/1.2751394 [DOI] [PubMed] [Google Scholar]
  • 14. Bird, J. , and Darwin, C. J. (1998). “ Effects of a difference in fundamental frequency in separating two sentences,” in Psychophysical and Physiological Advances in Hearing, edited by Palmer A. R., Rees A., Summerfield A. Q., and Meedis R. ( Whurr, London: ), pp. 263–269. [Google Scholar]
  • 15. Boersma, P. , and Weenink, D. (2017). “ Praat: Doing phonetics by computer [computer program],” http://www.praat.org/ (Last viewed 1/10/2017).
  • 16. Bologna, W. J. , Vaden, K. I., Jr. , Ahlstrom, J. B. , and Dubno, J. R. (2018). “ Age effects on perceptual organization of speech: Contributions of glimpsing, phonemic restoration, and speech segregation,” J. Acoust. Soc Am. 144(1), 267. 10.1121/1.5044397 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound ( MIT Press, Cambridge, MA: ). [Google Scholar]
  • 18. Brokx, J. P. L. , and Nooteboom, S. G. (1982). “ Intonation and the perceptual separation of simultaneous voices,” J. Phon. 10, 23–36. 10.1016/S0095-4470(19)30909-X [DOI] [Google Scholar]
  • 19. Bronkhorst, A. , and Plomp, R. (1988). “ The effect of head-induced interaural time and level differences on speech intelligibility in noise,” J. Acoust. Soc. Am. 83, 1508–1516. 10.1121/1.395906 [DOI] [PubMed] [Google Scholar]
  • 20. Brungart, D. S. (2001). “ Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Am. 109(3), 1101–1109. 10.1121/1.1345696 [DOI] [PubMed] [Google Scholar]
  • 21. Brungart, D. S. , Chang, P. S. , Simpson, B. D. , and Wang, D. (2006). “ Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Am. 120(6), 4007–4018. 10.1121/1.2363929 [DOI] [PubMed] [Google Scholar]
  • 22. Brungart, D. S. , Simpson, B. D. , Ericson, M. A. , and Scott, K. R. (2001). “ Informational and energetic masking effects in the perception of multiple simultaneous talkers,” J. Acoust. Soc. Am. 110(5), 2527–2538. 10.1121/1.1408946 [DOI] [PubMed] [Google Scholar]
  • 23. Byrne, D. , and Dillon, H. (1986). “ The National Acoustic Laboratories' (NAL) new procedure for selecting the gain and frequency response of a hearing aid,” Ear Hear. 7(4), 257–265. 10.1097/00003446-198608000-00007 [DOI] [PubMed] [Google Scholar]
  • 24. Byrne, D. J. , Parkinson, A. , and Newall, P. (1990). “ Hearing aid gain and frequency response requirements for the severely/profoundly hearing impaired,” Ear Hear. 11(1), 40–49. 10.1097/00003446-199002000-00009 [DOI] [PubMed] [Google Scholar]
  • 25. Calandruccio, L. , Wasiuk, P. A. , Buss, E. , Leibold, L. J. , Kong, J. , Holmes, A. , and Oleson, J. (2019). “ The effect of target/masker fundamental frequency contour similarity on masked-speech recognition,” J. Acoust. Soc. Am. 146(2), 1065. 10.1121/1.5121314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Calandruccio, L. , and Zhou, H. (2014). “ Increase in speech recognition due to linguistic mismatch between target and masker speech: Monolingual and simultaneous bilingual performance,” J. Speech Lang. Hear. Res. 57(3), 1089–1097. 10.1044/2013_JSLHR-H-12-0378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Cherry, E. C. (1953). “ Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am. 25, 975–979. 10.1121/1.1907229 [DOI] [Google Scholar]
  • 28. Clarke, J. , Gaudrain, E. , Chatterjee, M. , and Başkent, D. (2014). “ T'ain't the way you say it, it's what you say—Perceptual continuity of voice and top-down restoration of speech,” Hear. Res. 315, 80–87. 10.1016/j.heares.2014.07.002 [DOI] [PubMed] [Google Scholar]
  • 29. Collin, B. , and Lavandier, M. (2013). “ Binaural speech intelligibility in rooms with variations in spatial location of sources and modulation depth of noise interferers,” J. Acoust. Soc. Am. 134(2), 1146–1159. 10.1121/1.4812248 [DOI] [PubMed] [Google Scholar]
  • 30. Cooke, M. (2006). “ A glimpsing model of speech perception in noise,” J. Acoust. Soc. Am. 119(3), 1562–1573. 10.1121/1.2166600 [DOI] [PubMed] [Google Scholar]
  • 31. Corso, J. F. (1963). “ Aging and auditory thresholds in men and women,” Arch. Environ. Health 6, 350–356. 10.1080/00039896.1963.10663405 [DOI] [PubMed] [Google Scholar]
  • 32. Cubick, J. , Buchholz, J. M. , Best, V. , Lavandier, M. , and Dau, T. (2018). “ Listening through hearing aids affects spatial perception and speech intelligibility in normal-hearing listeners,” J. Acoust. Soc. Am. 144(5), 2896–2905. 10.1121/1.5078582 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Cutler, A. , Dahan, D. , and van Donselaar, W. (1997). “ Prosody in the comprehension of spoken language: A literature review,” Lang. Speech 40(2), 141–201. 10.1177/002383099704000203 [DOI] [PubMed] [Google Scholar]
  • 34. Daneman, M. , and Carpenter, P. (1980). “ Individual differences in working memory and reading,” J. Verbal Learn. Verbal Behav. 19, 450–466. 10.1016/S0022-5371(80)90312-6 [DOI] [Google Scholar]
  • 35. Darwin, C. J. , Brungart, D. S. , and Simpson, B. D. (2003). “ Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers,” J. Acoust. Soc. Am. 114(5), 2913–2922. 10.1121/1.1616924 [DOI] [PubMed] [Google Scholar]
  • 36. David, M. , Lavandier, M. , Grimault, N. , and Oxenham, A. (2017). “ Sequential stream segregation of voiced and unvoiced speech sounds based on fundamental frequency,” Hear. Res. 344, 235–243. 10.1016/j.heares.2016.11.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. de Cheveigné, A. , McAdams, S. , Laroche, J. , and Rosenberg, M. (1995). “ Identification of concurrent harmonic and inharmonic vowels: A test of the theory of harmonic cancellation and enhancement,” J. Acoust. Soc. Am. 97, 3736–3748. 10.1121/1.412389 [DOI] [PubMed] [Google Scholar]
  • 38. Deroche, M. L. D. , and Culling, J. F. (2011). “ Voice segregation by difference in fundamental frequency: Evidence for harmonic cancellation,” J. Acoust. Soc. Am. 130, 2855–2865. 10.1121/1.3643812 [DOI] [PubMed] [Google Scholar]
  • 39. Deroche, M. , Culling, J. , Chatterjee, M. , and Limb, C. (2014). “ Speech recognition against harmonic and inharmonic complexes: Spectral dips and periodicity,” J. Acoust. Soc. Am. 135, 2873–2884. 10.1121/1.4870056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Dillon, H. (2012). Hearing Aids ( Boomerang Press, Turramurra, Australia: ), pp. 286–335. [Google Scholar]
  • 41. Dryden, A. , Allen, H. A. , Henshaw, H. , and Heinrich, A. (2017). “ The association between cognitive performance and speech-in-noise perception for adult listeners: A systematic literature review and meta-analysis,” Trends Hear. 21, 1–21. 10.1177/2331216517744675 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Durlach, N. I. , Mason, C. R. , Kidd, G., Jr. , Arbogast, T. , Colburn, H. S. , and Shinn-Cunningham, B. G. (2003). “ Note on informational masking (L),” J. Acoust. Soc. Am. 113(6), 2984–2987. 10.1121/1.1570435 [DOI] [PubMed] [Google Scholar]
  • 43. Edwards, B. (2016). “ A model of auditory-cognitive processing and relevance to clinical applicability,” Ear Hear. 37(S1), 85S–91S. 10.1097/AUD.0000000000000308 [DOI] [PubMed] [Google Scholar]
  • 44. Feston, J. M. , and Plomp, R. (1990). “ Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing,” J. Acoust. Soc. Am. 88, 1725–1736. 10.1121/1.400247 [DOI] [PubMed] [Google Scholar]
  • 45. Folstein, M. F. , Folstein, S. E. , and McHugh, P. R. (1975).“  ‘Mini-mental state.’ A practical method for grading the cognitive state of patients for the clinician,” J. Psychiatr. Res. 12(3), 189–198. 10.1016/0022-3956(75)90026-6 [DOI] [PubMed] [Google Scholar]
  • 46. Freyman, R. L. , Balakrishnan, U. , and Helfer, K. S. (2004). “ Effect of number of masking talkers and auditory priming on informational masking in speech recognition,” J. Acoust. Soc. Am. 115(5), 2246–2256. 10.1121/1.1689343 [DOI] [PubMed] [Google Scholar]
  • 47. Freyman, R. L. , Helfer, K. S. , McCall, D. D. , and Clifton, R. K. (1999). “ The role of perceived spatial separation in the unmasking of speech,” J. Acoust. Soc. Am. 106, 3578–3588. 10.1121/1.428211 [DOI] [PubMed] [Google Scholar]
  • 48. Füllgrabe, C. , Moore, B. C. J. , and Stone, M. A. (2015). “ Age-group differences in speech identification despite matched audiometrically normal hearing: Contributions from auditory temporal processing and cognition,” Front. Aging Neurosci. 6, 347. 10.3389/fnagi.2014.00347 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Füllgrabe, C. , and Rosen, S. (2016). “ On the (un)importance of working memory in speech-in-noise processing for listeners with normal hearing thresholds,” Front. Psychol. 7, 1268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Gallun, F. J. , Diedesch, A. C. , Kampel, S. D. , and Jakien, K. M. (2013). “ Independent impacts of age and hearing loss on spatial release in a complex auditory environment,” Front. Neurosci. 7, 252. 10.3389/fnins.2013.00252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Glyde, H. , Buchholz, J. M. , Nielsen, L. , Best, V. , Dillon, H. , Cameron, S. , and Hickson, L. (2015). “ Effect of audibility on spatial release from speech-on-speech masking,” J. Acoust. Soc. Am. 138, 3311–3319. 10.1121/1.4934732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Gordon-Salant, S. , and Cole, S. S. (2016). “ Effects of age and working memory capacity on speech recognition performance in noise among listeners with normal hearing,” Ear Hear. 37(5), 593–602. 10.1097/AUD.0000000000000316 [DOI] [PubMed] [Google Scholar]
  • 53. Grant, K. W. (1987). “ Identification of intonation contours by normally hearing and profoundly hearing-impaired listeners,” J. Acoust. Soc. Am. 82(4), 1172–1178. 10.1121/1.395253 [DOI] [PubMed] [Google Scholar]
  • 54. Helfer, K. S. , and Freyman, R. L. (2008). “ Aging and speech-on-speech masking,” Ear Hear. 29(1), 87–98. 10.1097/AUD.0b013e31815d638b [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Helfer, K. S. , Poissant, S. F. , and Merchant, G. R. (2020). “ Word identification with temporally interleaved competing sounds by younger and older adult listeners,” Ear Hear. 41(3), 603–614. 10.1097/AUD.0000000000000786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Hillenbrand, J. M. (2003). “ Some effects of intonation contour on sentence intelligibility,” J. Acoust. Soc. Am. 114(4), 2338. 10.1121/1.4781079 [DOI] [Google Scholar]
  • 57. Hodgson, M. , Steininger, G. , and Razavi, Z. (2007). “ Measurement and prediction of speech and noise levels and the Lombard effect in eating establishments,” J. Acoust. Soc. Am. 121(4), 2023–2033. 10.1121/1.2535571 [DOI] [PubMed] [Google Scholar]
  • 58. Humes, L. E. , and Coughlin, M. (2009). “ Aided speech-identification performance in single-talker competition by older adults with impaired hearing,” Scand. J. Psychol. 50(5), 485–494. 10.1111/j.1467-9450.2009.00740.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.ISO (2000). ISO-7029, Acoustics-Statistical distribution of hearing thresholds as a function of Age ( ISO, Geneva, Switzerland: ). [Google Scholar]
  • 60. Iyer, N. , Brungart, D. S. , and Simpson, B. D. (2010). “ Effects of target-masker contextual similarity on the multimasker penalty in a three-talker diotic listening task,” J. Acoust. Soc. Am. 128(5), 2998–3010. 10.1121/1.3479547 [DOI] [PubMed] [Google Scholar]
  • 61. Jahncke, H. , Bjorkeholm, P. , Marsh, J. E. , Odelius, J. , and Sorqvist, P. (2016). “ Office noise: Can headphones and masking sound attenuate distraction by background speech,” Work 55(3), 505–513. 10.3233/WOR-162421 [DOI] [PubMed] [Google Scholar]
  • 62. Jakien, K. M. , Kampel, S. D. , Gordon, S. Y. , and Gallun, F. J. (2017). “ The benefits of increased sensation level and bandwidth for spatial release from masking,” Ear Hear. 38, e13–e21. 10.1097/AUD.0000000000000352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Jesse, A. , and Helfer, K. S. (2019). “ Lexical influences on errors in masked speech perception in younger, middle-aged, and older adults,” J. Speech Lang. Hear. Res. 62(4S), 1152–1166. 10.1044/2018_JSLHR-H-ASCC7-18-0091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Kidd, G., Jr. , and Colburn, H. S. (2017). “ Informational masking in speech recognition,” in The Auditory System at the Cocktail Party ( Springer, New York: ), pp. 75–109. [Google Scholar]
  • 65. Kidd, G., Jr. , Mason, C. R. , Best, V. , Roverud, E. , Swaminathan, J. , Jennings, T. , Clayton, K. , and Colburn, H. S. (2019). “ Determining the energetic and informational components of speech-on-speech masking in listeners with sensorineural hearing loss,” J. Acoust. Soc. Am. 145(1), 440. 10.1121/1.5087555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Kidd, G., Jr. , Mason, C. R. , Richards, V. M. , Gallun, F. J. , and Durlach, N. I. (2007). “ Informational Masking,” in The Auditory System at the Cocktail Party (Springer, New York: ), pp. 143–189. [Google Scholar]
  • 67. Klaus, J. , and Schriefers, H. (2016). “ Measuring working memory capacity: A reading span task for laboratory and web-based use,” 10.31219/osf.io/nj48x (Last viewed 22 April 2019). [DOI]
  • 68. Koffka, K. (1935). Principles of Gestalt Psychology ( Harcourt, Brace, and Company, New York: ), LCCN 35007711. [Google Scholar]
  • 69. Laures, J. S. , and Weismer, G. (1999). “ The effects of a flattened fundamental frequency on intelligibility at the sentence level,” J. Speech Lang. Hear. Res. 42, 1148–1156. 10.1044/jslhr.4205.1148 [DOI] [PubMed] [Google Scholar]
  • 70. Lavandier, M. , Buchholz, J. M. , and Rana, B. (2018). “ A binaural model predicting speech intelligibility in the presence of stationary noise and noise-vocoded speech interferers for normal-hearing and hearing-impaired listeners,” Acta Acust. united Ac. 104(5), 909–913. 10.3813/AAA.919243 [DOI] [Google Scholar]
  • 71. Lavandier, M. , and Culling, J. F. (2008). “ Speech segregation in rooms: Monaural, binaural, and interacting effects of reverberation on target and interferer,” J. Acoust. Soc. Am. 123, 2237–2248. 10.1121/1.2871943 [DOI] [PubMed] [Google Scholar]
  • 72. Leclère, T. , Lavandier, M. , and Deroche M. L. D. (2017). “ The intelligibility of speech in a harmonic masker varying in fundamental frequency contour, broadband temporal envelope, and spatial location,” Hear. Res. 350, 1–10. 10.1016/j.heares.2017.03.012 [DOI] [PubMed] [Google Scholar]
  • 73. Lee, J. H. , and Humes, L. E. (2012). “ Effect of fundamental-frequency and sentence-onset differences on speech-identification performance of young and older adults in a competing-talker background,” J. Acoust. Soc. Am. 132(3), 1700–1717. 10.1121/1.4740482 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Liberman, M. , Davis, K. , Grossman, M. , Martey, N. , and Bell, J. (2002). “ Emotional prosody speech and transcripts. LDC2002S28. Web Download,” Linguistic Data Consortium (Philadelphia, PA).
  • 75. Mackersie, C. L. (2003). “ Talker separation and sequential stream segregation in listeners with hearing loss: Patterns associated with talker gender,” J. Speech Lang. Hear. Res. 46(4), 912–918. 10.1044/1092-4388(2003/071) [DOI] [PubMed] [Google Scholar]
  • 76. Mackersie, C. L. , Dewey, J. , and Guthrie, L. A. (2011). “ Effects of fundamental frequency and vocal-tract length cues on sentence segregation by listeners with hearing loss,” J. Acoust. Soc. Am. 130(2), 1006–1019. 10.1121/1.3605548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Mattys, S. L. , Davis, M. H. , Bradlow, A. R. , and Scott, S. K. (2012). “ Speech recognition in adverse conditions: A review,” Lang. Cogn. Process. 27(7–8), 953–978. 10.1080/01690965.2012.705006 [DOI] [Google Scholar]
  • 78. Miller, S. E. , Schlauch, R. S. , and Watson, P. J. (2010). “ The effects of fundamental frequency contour manipulations on speech intelligibility in background noise,” J. Acoust. Soc. Am. 128(1), 435–443. 10.1121/1.3397384 [DOI] [PubMed] [Google Scholar]
  • 79. Misurelli, S. M. , and Litovsky, R. Y. (2015). “ Spatial release from masking in children with bilateral cochlear implants and with normal hearing: Effect of target-interferer similarity,” J. Acoust. Soc. Am. 138(1), 319–331. 10.1121/1.4922777 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Moore B. C. (1985). “ Frequency selectivity and temporal resolution in normal and hearing-impaired listeners,” Brit. J. Audiol. 19(3), 189–201. 10.3109/03005368509078973 [DOI] [PubMed] [Google Scholar]
  • 81. Moore, B. C. J. , and Glasberg, B. R. (1983). “ Suggested formulae for calculating auditory-filter bandwidths and excitation patterns,” J. Acoust. Soc. Am. 74(3), 750–753. 10.1121/1.389861 [DOI] [PubMed] [Google Scholar]
  • 82. Oleson, J. J. , Brown, G. D. , and McCreery, R. (2019). “ The evolution of statistical methods in speech, language, and hearing sciences,” J. Speech Lang. Hear. Res. 62, 498–506. 10.1044/2018_JSLHR-H-ASTM-18-0378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Oxenham, A. J. (2018). “ How we hear: The perception and neural coding of sound,” Ann. Rev. Psychol. 69, 27–50. 10.1146/annurev-psych-122216-011635 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Oxenham, A. J. , and Bacon, S. P. (2003). “ Cochlear compression: Perceptual measures and implications for normal and impaired hearing,” Ear Hear. 24(5), 352–366. 10.1097/01.AUD.0000090470.73934.78 [DOI] [PubMed] [Google Scholar]
  • 85. Patterson, R. D. , Nimmo-Smith, I. , Holdsworth, J. , and Rice, P. (1987). “ An efficient auditory filterbank on the gammatone function,” presented to the Institute of Acoustics speech group on auditory modeling at the Royal Signal Research Establishment.
  • 86. Paulmann, S. , and Pell, M. D. (2010). “ Contextual influences of emotional speech prosody on face processing: How much is enough?,” Cogn. Affect. Behav. Neurosci. 10(2), 230–242. 10.3758/CABN.10.2.230 [DOI] [PubMed] [Google Scholar]
  • 87. Pichora-Fuller, M. K. , Kramer, S. E. , Eckert, M. A. , Edwards, B. , Hornsby, B. , Humes, L. E. , Lemke, U. , Lunner, T. , Matthen, M. , Mackersie, C. L. , Naylor, G. , Phillips, N. A. , Richter, M. , Rudner, M. , Sommers, M. S. , Tremblay, K. L. , and Wingfield, A. (2016). “ Hearing impairment and cognitive energy: The framework for understanding effortful listening (FUEL),” Ear Hear. 37(S1), 5S–27S. 10.1097/AUD.0000000000000312 [DOI] [PubMed] [Google Scholar]
  • 88. Prud'homme, L. , Lavandier, M. , and Best, V. (2020). “ A harmonic-cancellation-based model to predict speech intelligibility against a harmonic masker,” J. Acoust. Soc. Am. 148, 3246–3254. 10.1121/10.0002492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Raphael, L. J. , Borden, G. J. , and Harris, K. S. (2007). Speech Science Primer: Physiology, Acoustics, and Perception of Speech, 5th ed ( Lippincott Williams & Wilkins, Baltimore, MD). [Google Scholar]
  • 90. Rhebergen, K. S. , and Versfeld, N. J. (2005). “ A speech intelligibility index based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners,” J. Acoust. Soc. Am. 117, 2181–2192. 10.1121/1.1861713 [DOI] [PubMed] [Google Scholar]
  • 91. Rhebergen, K. S. , Versfeld, N. J. , and Dreschler, W. A. (2005). “ Release from informational masking by time reversal of native and non-native interfering speech,” J. Acoust. Soc. Am. 118(3 Pt 1), 1274–1277. 10.1121/1.2000751 [DOI] [PubMed] [Google Scholar]
  • 92. Rönnberg, J. , Holmer, E. , and Rudner, M. (2019). “ Cognitive hearing science and ease of language understanding,” Int. J. Audiol. 58(5), 247–261. 10.1080/14992027.2018.1551631 [DOI] [PubMed] [Google Scholar]
  • 93. Rönnberg, J. , Lunner, T. , Zekveld, A. , Sorqvist, P. , Danielsson, H. , Lyxell, B. , Dahlstrom, O. , Signoret, C. , Stenfelt, S. , Pichora-Fuller, M. K. , and Rudner, M. (2013). “ The ease of language understanding (ELU) model: Theoretical, empirical, and clinical advances,” Front. Syst. Neurosci. 7, 31. 10.3389/fnsys.2013.00031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Rosen, S. , Souza, P. , Ekelund, C. , and Majeed, A. (2013). “ Listening to speech in a background of other talkers: Effects of talker number and noise vocoding,” J. Acoust. Soc. Am. 133(4), 2431–2443. 10.1121/1.4794379 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95. Russo, F. , and Pichora-Fuller, M. K. (2008). “ Tune in or tune out: Age-related differences in listening when speech is in the foreground and music is in the background,” Ear Hear. 29, 746–760. 10.1097/AUD.0b013e31817bdd1f [DOI] [PubMed] [Google Scholar]
  • 96. Shen, J. , and Souza, P. E. (2017). “ Do older listeners with hearing loss benefit from dynamic pitch for speech recognition in noise,” Am. J. Audiol. 26(3S), 462–466. 10.1044/2017_AJA-16-0137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97. Sommers, M. S. , and Danielson, S. M. (1999). “ Inhibitory processes and spoken word recognition in young and older adults: The interaction of lexical competition and semantic context,” Psychol. Aging 14(3), 458–472. 10.1037/0882-7974.14.3.458 [DOI] [PubMed] [Google Scholar]
  • 98. Vicente, T. , and Lavandier, M. (2020). “ Further validation of a binaural model predicting speech intelligibility against envelope-modulated noises,” Hear. Res. 390, 107937. 10.1016/j.heares.2020.107937 [DOI] [PubMed] [Google Scholar]
  • 99. Vicente, T. , Lavandier, M. , and Buchholz, J. M. (2020). “ A binaural model implementing an internal noise to predict the effect of hearing impairment on speech intelligibility in non-stationary noises,” J. Acoust. Soc. Am. 148, 3305–3317. 10.1121/10.0002660 [DOI] [PubMed] [Google Scholar]
  • 100. Wang, D. (2005). “ On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, edited by Divenyi P. ( Kluwer Academic, Dordrecht, the Netherlands: ), pp. 181–197. [Google Scholar]
  • 101. Warren, R. M. (1970). “ Perceptual restoration of missing speech sounds,” Science 167, 392–393. 10.1126/science.167.3917.392 [DOI] [PubMed] [Google Scholar]
  • 102. Watson, C. S. (2005). “ Some comments on informational masking,” Acta Acust. united Ac. 91, 502–512. [Google Scholar]
  • 103. Wingfield, A. , McCoy, S. L. , Peelle, J. E. , Tun, P. A. , and Cox, L. C. (2006). “ Effects of adult aging and hearing loss on comprehension of rapid speech varying in syntactic complexity,” J. Am. Acad. Audiol. 17(7), 487–497. 10.3766/jaaa.17.7.4 [DOI] [PubMed] [Google Scholar]
  • 104. Yost, W. A. (2013). Fundamentals of Hearing: An Introduction. 5th ed ( Brill Academic Publishers, Inc, Leiden, the Netherlands: ). [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES