Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2019 Aug 12;146(2):1065–1076. doi: 10.1121/1.5121314

The effect of target/masker fundamental frequency contour similarity on masked-speech recognition

Lauren Calandruccio 1,a),, Peter A Wasiuk 1, Emily Buss 2, Lori J Leibold 3, Jessica Kong 1, Ann Holmes 1, Jacob Oleson 4
PMCID: PMC6690832  PMID: 31472562

Abstract

Greater informational masking is observed when the target and masker speech are more perceptually similar. Fundamental frequency (f0) contour, or the dynamic movement of f0, is thought to provide cues for segregating target speech presented in a speech masker. Most of the data demonstrating this effect have been collected using digitally modified stimuli. Less work has been done exploring the role of f0 contour for speech-in-speech recognition when all of the stimuli have been produced naturally. The goal of this project was to explore the importance of target and masker f0 contour similarity by manipulating the speaking style of talkers producing the target and masker speech streams. Sentence recognition thresholds were evaluated for target and masker speech that was produced with either flat, normal, or exaggerated speaking styles; performance was also measured in speech spectrum shaped noise and for conditions in which the stimuli were processed through an ideal-binary mask. Results confirmed that similarities in f0 contour depth elevated speech-in-speech recognition thresholds; however, when the target and masker had similar contour depths, targets with normal f0 contours were more resistant to masking than targets with flat or exaggerated contours. Differences in energetic masking across stimuli cannot account for these results.

I. INTRODUCTION

Listeners can use differences in voice fundamental frequency (f0) to segregate target from masker speech, a finding that has been extensively documented in the literature (Assmann, 1999; Bird and Darwin, 1998; Brokx and Nooteboom, 1982; Darwin et al., 2003). Mean differences in f0 are believed to decrease the similarity between the target and masker speech, reducing informational masking (Brungart, 2001; Brungart et al., 2001). For young, normal-hearing listeners, dynamic changes in f0 also support effective speech recognition in noisy backgrounds (Binns and Culling, 2007; Laures and Weismer, 1999; Miller et al., 2010). Dynamic changes in f0 (or the fluctuations in f0 over time) provide essential suprasegmental cues that help distinguish word meaning and word boundaries, and convey information about talker intent (for a review see Cutler et al., 1997). What is less clear, however, is how dynamic changes in f0 aid the listener in segregating the target from the masker speech. The purpose of the present study was to explore how similarity between dynamic changes in f0 contour of target and masker speech affects masked-speech recognition.

A. The importance of f0 contours for masked-speech recognition

Dynamic f0 contours have been shown to help listeners recognize target speech in competing noise (Binns and Culling, 2007; Hillenbrand, 2003; Laures and Weismer, 1999; Miller et al., 2010). This observation has often been made by taking naturally produced speech, which has natural f0 variation or a “normal” pitch contour, and manipulating this speech using digital-signal processing. Such signal processing strategies are utilized to generate stimuli with reduced (e.g., Binns and Culling, 2007), exaggerated (e.g., Miller et al., 2010), flattened (e.g., Assmann, 1999; Laures and Weismer, 1999), or inverted (e.g., Hillenbrand, 2003) contours. Finally, listener's masked-speech recognition is compared across the different f0-manipulated stimulus sets. In many cases, flattening the f0 contour (removing dynamic changes in f0) has resulted in decreased masked-speech recognition (e.g., Laures and Weismer, 1999; Binns and Culling, 2007; Miller et al., 2010). However, there are also data consistent with the conclusion that masked-speech recognition is stable regardless of f0 contour type (normal, reduced, or flattened; see, Assmann, 1999; Chatterjee et al., 2010; Clarke et al., 2017; Darwin and Hukin, 2000).

Differences in mean f0 between the target and masker speech are one of the strongest and most studied monaural acoustic cues that help listeners segregate two competing voices (e.g., Broadbent and Ladefoged, 1957; Brokx and Nooteboom, 1982; Darwin and Hukin, 2000) and have been shown to be a beneficial segregation cue for young children (Leibold et al., 2018; Flaherty et al., 2018), older adults (Helfer and Freyman, 2008), and listeners with hearing loss (Arehart et al., 1997; Mackersie et al., 2011). The beneficial effect of increasing differences in mean f0 between the target and masker speech is largest when the mean baseline target and masker f0s are similar to each other (Darwin et al., 2003).

B. Target/masker similarity and f0 contours: Implications for masked-speech recognition

Darwin and colleagues (2003) used the Coordinate Response Measure speech corpus (Bolia et al., 2000) to explore the impact of changes in mean f0 on masked-speech recognition. In their experiment, the authors utilized all available talkers within the corpus, which included 4 females and 4 males. The Pitch-Synchronous Overlap-Add (PSOLA) algorithm (Charpentier and Stella, 1986; Moulines and Charpentier, 1990) was employed to shift the mean f0 to one of six semitone differences (including lower and higher frequency shifts), and target speech recognition was measured for each of the eight target talkers with their own voice (pitch shifted) in the background. Overall results (which were averaged for the eight talkers), indicated that the poorest recognition occurred when there was no difference between the target and masker f0. As f0 separations between the target and masker speech increased (up to 12 semitones), speech recognition performance increased. However, an analysis examining masked-speech recognition for each talker individually indicated significant variability in results across the eight talkers. Most notably were the results observed for Talker 5 (see Fig. 2, p. 2915 and Fig. 4, p. 2916). Talker 5 had a relatively flat f0 contour in the majority of his speech productions, with the exception of sentences spoken with the call sign “Baron,” which served as the target. Data for Talker 5 indicated extremely high-performance scores (over 80%) across even the hardest of signal-to-noise ratios (SNR) tested, and minimal differences in performance between the 0-semitone and 12-semitone pitch-shifted conditions. This led the authors to examine the improvement observed with a 12-semitone difference in f0 in relation to the original difference in f0 contour between the target and masker speech. They observed a significantly smaller benefit in performance with changes in mean f0 when f0 contours were less similar in the original target/masker pairs, suggesting that (1) differences between contour patterns were being used to segregate the two voices and (2) the benefit of further changes in mean f0 were not additive.

FIG. 2.

FIG. 2.

The f0 pitch contours for speech stimuli. The left most column shows the pitch contour from the same sentence spoken by Talker A in all three speaking styles. The middle column shows the pitch contour for a 5-s sample from masker Talker B. The right column shows the pitch contour after the voices of masker Talkers B and C were summed. The three voices (Talkers A, B, and C) produced speech with each of three f0 contours: flat (top panel), normal (middle panel), and exaggerated (bottom panel).

FIG. 4.

FIG. 4.

(Color online) Speech-in-speech sentence-recognition thresholds (dB SNR) for three different target speaking styles: flat (fl.), normal (norm.), and exaggerated (ex.) in three different two-talker maskers (flat [leftmost column], normal [middle column], and exaggerated [rightmost column]). Plot layout follows the same configuration as Fig. 3. Black diamonds are shown with individual data points to indicate outliers.

The importance of f0 contour similarity for speech-in-speech recognition is in line with our current understanding of the factors determining informational masking (Durlach et al., 2003; Kidd et al., 2008; Kidd and Colburn, 2017; Watson, 1987). Both target uncertainty and target/masker similarity tend to increase informational masking, or difficulty attending to and/or segregating the target speech from competing sounds. Brokx and Nooteboom (1982) provide an additional demonstration that similarity of f0 contour depth between the target and masker speech can impact masked-speech recognition. In their experiment, all changes in f0 contour and mean f0 were generated using the same male talker. This talker first produced sentences using a natural speaking style, with normal prosodic variation. He then repeated this process, but increased his voice pitch to approximate that of a female voice exemplar (which resulted in a mean f0 shift of ∼50 Hz, almost 7 semitones). Additionally, he recorded these materials with a flat pitch contour using either his natural or the high mean f0 production (shift of ∼110 Hz, or 12 semitones). This methodology allowed the researchers to explore changes in mean f0 and f0 contour depth without utilizing signal-processed speech. For the sentence-recognition task, the same male talker's voice was used as the competing speech, which included a continuous stream comprising a short story spoken with natural intonation. For the normal f0 contour target conditions, large improvements in recognition were observed for the pitch-shifted productions (+7 semitones); however, for the monotone targets, minimal differences were observed for the pitch shifted conditions (+12 semitones). The authors speculated that this result may have occurred due to listeners having an easier time tracking the flat contour in the presence of a fluctuating masker, regardless of the difference in mean f0 between the target and the masker streams. This result, which agrees with Darwin et al. (2003), suggests that decreasing similarity in f0 contour depth between the target and masker speech will in turn decrease informational masking, thus improving target speech recognition.

The purpose of this project was to systematically test whether similarity in f0 contour depth between the target and masker speech increases informational masking. We utilized only naturally produced speech recordings for this experiment; that is, no signal processing was used to manipulate f0 contours. Furthermore, we used a two-talker masker instead of one competing voice. The rationale for this approach is that two competing talkers tend to cause higher levels of informational masking for masked-sentence recognition than a single competing voice (Freyman et al., 2004; Iyer et al., 2010; Rosen et al., 2013). Significant amounts of informational masking would allow for large releases from masking when similarity between the f0 contour depths of the target and masker speech decreased. In addition, ideal time-frequency segregation (Anzalone et al., 2006; Brungart et al., 2006; Wang, 2005) was used to estimate contributions of energetic masking to performance in the different target/masker conditions. We hypothesized, similar to the results observed in Darwin et al. (2003) and Brokx and Nooteboom (1982), that recognition would be poor when f0 contours were most similar between the target and masker speech, due to difficulty segregating the target from the masker, and that performance would systematically increase as f0 contour depth differences increased between the target and masker speech.

II. METHODS

A. Participants

All participants were recruited from the Case Western Reserve University (CWRU) community and the local Cleveland, Ohio area. Participants were recruited through word-of-mouth and the Department of Psychological Sciences research subject pool. All recruitment and testing methods were approved by the CWRU Institutional Review Board (IRB). Participants recruited through word-of-mouth were compensated $15/h, while participants recruited through the research subject pool were given course credit for participation. All participants signed an informed consent document approved by the CWRU IRB. Participants first completed a demographic and linguistic questionnaire to confirm that they were native speakers of English (American dialect). Prior to experimental testing, all participants had their hearing tested using standard clinical audiometric test procedures to obtain hearing thresholds bilaterally at all octave frequencies between 250 and 8000 Hz (ANSI, 2009). For data to be included within analyses, participants needed to have thresholds ≤25 dB hearing level bilaterally at all test frequencies. Overall, a total of 82 participants were recruited for this study. Four participants did not meet the hearing threshold requirements during audiometric testing, and one participant did not meet the language requirements, as he was a non-native speaker of English. Two subjects were excluded because they displayed abnormally elevated speech recognition thresholds during experimental testing; in both cases, participants provided multiple incorrect responses at 25 dB SNR. The final dataset contained a total of 75 participants, of which 44 were female and 31 were male. Participants ranged in age from 18 to 22 years (M =20, SD = 1).

B. Stimulus development

All stimuli used throughout testing were recorded by three 21-year-old native-English speaking females. The three talkers were trained actors with no noticeable regional accent (referred to as Talkers A, B, and C). Sentences used for the target and masker speech were from the Bamford-Kowal-Bench (BKB) speech corpus (Bench et al., 1979). This corpus, which was developed using spoken utterances by children that were deaf and/or hard-of-hearing, uses elementary English syntax and vocabulary. Each of the 336 sentences within the corpus consists of between three and seven words, with three to four keywords being used for scoring. An example sentence from this corpus is the following: The boy has black hair (with scoreable keywords underlined).

All sentences were recorded using three different speaking styles. Specifically, during the first recording session, actors were instructed to say all 336 sentences using a natural speaking style. On a different day, the actors were instructed to maintain a monotone voice pitch, as if they were sad. On the third day of recordings, the actors were instructed to speak with wide variations in voice pitch, as if they were happy and excited. Examples of the flat and exaggerated speaking styles were provided to the actors prior to recording. All recording sessions were completed within a three-week period.

Stimuli were recorded in a double-walled sound-isolated booth. Actors stood six inches in front of a Shure KSM-42 omnidirectional cardioid condenser microphone. A pop-filter, used to reduce popping sounds in speech recordings and prevent saliva from getting on the microphone, was placed between the talker and the diaphragm of the microphone. The microphone was connected to an M-Audio M-Track 2 × 2 converter. All recorded sentences were edited into individual WAV files with minimal silence at the beginning and end of the sentences. The sentences were then root-mean-square (RMS) equalized to the same pressure level using praat software (Boersma and Weenink, 2017).

Talker A was chosen to be the target talker throughout all testing, while Talkers B and C were used to create three different two-talker maskers, each of which varied in speaking style (flat, normal, and exaggerated). For all two-talker masker streams, 50 sentences were first concatenated for talkers B and C into two separate WAV files (each lasting approximately 80 s). The same 50 sentences were used for both talkers in all three speaking styles. However, sentence order varied to ensure that the two talkers were never speaking the same sentence at the same time. After the audio files were created for each talker and for each speaking style, streams of the same speaking style were combined to produce three unique two-talker masker conditions (normal, flat, and exaggerated).

Average f0, maximum f0, minimum f0, and standard deviation of f0 values for each talker, aggregated across sentences for each speaking style, are shown in Table I. For all target types, the f0 contour fluctuated over time as these were naturally spoken sentences, as opposed to digitally signal processed recordings. However, sentences recorded using a flat speaking style fluctuated the least (an average standard deviation of 0.84 semitones; corresponding to an average of about 12 Hz), while recordings using an exaggerated speaking style fluctuated the most (an average standard deviation of 3.32 semitones; corresponding to an average of about 56 Hz). Figure 1 illustrates the distributions (shown in percent) of the mean f0 contour for 200 ms windows of the target and aggregate masker sentences spoken using all three speech production styles. As can be seen in Fig. 1, the width of the distribution of mean f0 increases from flat, to normal, to exaggerated speaking styles for both target and masker sentences. Figure 2 illustrates the pitch contour from an example target sentence spoken using all three target styles (leftmost column; spoken by Talker A), a 5-s sample of the pitch contour from one of the two streams of speech within the two-talker masker (middle column; spoken by Talker B) and a 5-s sample of the pitch contour from the three different two-talker masker files (rightmost column; the speech spoken by Talkers B and C after the two talkers voices were summed). These contours were extracted using praat. This figure illustrates that even after combining both talkers' voices, the pitch contours of each speaking style remain similar to those observed within the single-talker files.

TABLE I.

Mean fundamental frequency (f0; Hz), mean maximum and minimum frequency (Hz), and mean standard deviation of f0 (Hz) for sentences produced by Talker A (target) and Talkers B and C (maskers), measured across all sentences.

Speaking Style Mean f0 (Hz) Mean Maximum f0 (Hz) Mean Minimum f0 (Hz) Mean f0 Standard Deviation (Hz)
target Talker A masker Talker B masker Talker C target Talker A masker Talker B masker Talker C target Talker A masker Talker B masker Talker C target Talker A masker Talker B masker Talker C
Flat 231 244 233 283 315 301 208 203 191 12 17 17
Normal 231 237 226 316 348 317 130 178 148 42 35 35
Exaggerated 264 260 276 401 386 411 164 179 178 56 52 56

FIG. 1.

FIG. 1.

Histogram count (percent) of the mean f0 (within 200-ms windows) for flat, normal, and exaggerated target (left column) and masker (right column) speech productions. Masker histograms represent aggregated counts across the two masker talkers. The black vertical bar indicates the grand mean f0 for all stimuli within each panel.

C. Ideal-binary mask

Ideal-binary masks have been used as a way to estimate energetic masking for speech-in-noise conditions (Anzalone et al., 2006; Brungart et al., 2006; Wang, 2005). This approach simulates an ideal-segregation process by effectively eliminating time/frequency epochs that are energetically dominated by the masker signal. This stimulus manipulation was used to assess energetic masking associated with the three different two-talker maskers. Methods used to create the ideal-binary mask (processed using matlab) followed closely those described in Brungart et al. (2006). Speech was filtered into bands using a bank of 128 4th order gammatone filters (0.5–11 kHz). The SNR for each band was evaluated in sequential 20-ms Hanning windows, spaced at 10-ms intervals. Windows with an SNR of −6 dB or greater were retained, and those with lower SNRs were discarded. The bands were then recombined.

D. Procedures and equipment

Participants were randomly selected to be in one of three groups. Each group of 25 participants was assigned to hear one of the three target talker types (flat, normal, or exaggerated). Participant characteristics were similar across the three groups. Specifically, the flat target group included 16 females and 9 males (M age = 20 years, SD = 1), the normal target group included 13 females, and 12 males (M age = 20 years, SD = 1), and the exaggerated target group included 15 females and 10 males (M age = 19, SD = 1). Participants listened to the target speech type that they were assigned to in the presence of each of the two-talker masker conditions (flat, normal, and exaggerated), a speech-shaped noise masker, and ideal-binary masked processed versions of these four masker conditions. The spectrum of the noise masker was matched to the grand average of the long-term average spectrum of the three two-talker maskers. The noise masker condition was included to examine the intelligibility of each target type (flat, normal, and exaggerated f0 contour) under conditions of mainly energetic masking (but see Stone, Füllgrabe, and Moore, 2012). In total, each participant completed eight experimental conditions, presented in random order and blocked by spectrum type (full spectrum or ideal-binary mask processed speech). No sentences were repeated within a participant. Total testing time was approximately two hours.

Testing occurred in a double-walled, sound-isolated suite (Acoustic Systems, ETS-Lindgrend). Participants were seated directly underneath the sound field calibration point, while the examiner sat in the single-walled control room. Participants were instructed that they would be listening to a female target talker, and that the target talker would not change throughout the experiment. They were also told that other females would be talking at the same time, and their task was to ignore the competing speech and repeat back only what the target talker said. Participants were encouraged to guess after trials in which they were uncertain.

Stimuli were controlled using a custom matlab program, routed to an audiometer (GSI Audiostar Pro, Grason-Stadler) using the external A port, and presented to the participant in the sound field at equal levels from each of two loudspeakers (positioned −45° and +45° azimuth, on either side of the participant). This positioned the participant directly in front of the observation window allowing the examiner a clear view of the participant's mouth providing both auditory and visual cues for the examiner during scoring. Responses were spoken aloud while facing a mounted microphone. The input was routed through the audiometer so that the examiner could hear the response. All responses were scored on-the-fly. Any variations of the keyword (including pluralization and tense changes) were marked as incorrect.

The target speech was fixed at 75 dB sound pressure level. The level of the masker speech varied from trial to trial, dependent upon the participant's response. Initially, the SNR was set to +5 dB SNR to ensure that the participant could easily identify the target talker's voice and become familiar with the task. Two interleaved one-up, one-down adaptive tracks of 32 sentences were utilized to estimate the speech recognition threshold associated with 50% correct recognition of the target sentences. In this procedure, one of the two tracks employed a lax criterion, decreasing the SNR if the participant reported one or more keywords correctly, while the other track used a strict criterion, decreasing the SNR if the participant got three or more keywords correct. These two different stepping rules ensured that data points above and below 50% correct recognition were sampled. Word-level data were fitted with a logit function, defined as y=1/[1+exp((xα)/β)], where y is the proportion correct, x is the signal level in dB SNR, α is the speech recognition threshold, and β is the function slope. Fits were made by minimizing the sum of squared error, with the data at each SNR weighted by the number of observations.

III. RESULTS

For all statistical analyses, linear model assumptions were evaluated by examining the residuals; for the models reported below, there was no evidence of model assumptions being violated. All analyses were conducted using sas v9.4. The pairwise comparisons are contrasts performed within the linear model framework. sas defaults were used to determine degrees of freedom for each test.

First, we were interested in understanding the overall intelligibility of the target speech for each of the three speaking styles (flat, normal, and exaggerated). This could not be done in quiet, as all three target speaking styles were highly intelligible and ceiling performance would have been achieved for all three types. Further, intelligibility estimates for the three target speech productions could not be assessed in the two-talker masker without informational masking effects. Therefore, thresholds are reported for each target type when measured in the speech-shaped noise masker (between subjects). Data for all 75 participants are included, as all psychometric function fits were strong, with a median r2 = 0.939 (r2 interquartile range between 0.885 and 0.980). Figure 3 indicates a trend for higher thresholds for the exaggerated target speech than the two other target speaking styles. A linear-regression model was used to assess the main effect of target speech speaking style on sentence recognition thresholds (dB SNR). This statistical analysis, reported in Table II, confirmed the significant main effect of speaking style and post hoc testing indicated that exaggerated target speech resulted in significantly poorer thresholds compared to both flat and normal targets. There was no difference in thresholds between the flat and normal targets (see Table III for mean thresholds and standard deviations).

FIG. 3.

FIG. 3.

Sentence-recognition thresholds in speech-shaped noise (dB SNR) for normal-hearing participants for three distinct speaking styles (flat, normal, and exaggerated). Filled circles indicate individual listener thresholds. The edge of each box plot represents the 25th and 75th percentile of the data. Whiskers indicate the spread of data falling within 1.5 times the interquartile range. A lower threshold (more negative number) indicates the participant tolerated a poorer SNR at 50% correct recognition.

TABLE II.

Regression model data assessing the main effect of target speaking style (flat, normal, and exaggerated) in a noise masker for sentence-recognition thresholds.

Effect Target Vs Target Estimate Test p-value
Target Speaking Style F2,72 = 3.99 0.0200
Exag. Flat 0.7 t72 = 2.33 0.0228
Exag. Normal 0.8 t72 = 2.55 0.0129
Flat Normal 0.1 t72 = 0.23 0.8225

TABLE III.

Means and standard deviations for sentence-recognition thresholds for three target speaking styles in a speech-shaped noise masker.

Target Speaking Style Mean (dB SNR) SD
Flat −9.1 1
Normal −9.1 1
Exaggerated −8.4 1

To examine the effect of similarity in f0 contour depth between the target and masker speech, a linear-regression model evaluated the impact of the main effect of target speech speaking style (flat, normal, or exaggerated), masker speaking style (flat, normal, or exaggerated), and the interaction of these two effects on participants' sentence-recognition thresholds (dB SNR). Note that masker speaking style was included as a within subject variable, meaning that we accounted for the within subject correlation in the statistical model. To accomplish this, we allowed the residuals in the linear regression model to be correlated. Specifically, we allowed an unstructured correlation matrix. Not only was the effect of masker speech speaking style correlated as a within subject variable, but the variances for each type were different as well, which was accounted for by the unstructured matrix. Only thresholds that had a strong psychometric function fit were included in the data analysis with a median r2 = 0.879 (r2 interquartile range between 0.949 and 0.776). Five of the 225 total fits were poor (r2 < 0.50) and were excluded; this included two fits in the exaggerated target/normal masker condition, one in the exaggerated target/exaggerated masker condition, one in the normal target/flat masker condition, and one in the flat target/flat masker condition. Data are shown for all target/masker combinations in Fig. 4. This figure shows that thresholds are poorest when f0 contour depth is similar between the target and the masker speech for the flat and exaggerated conditions (leftmost and rightmost boxes in Fig. 4). However, there are smaller differences across the three target conditions for the normal masker speech speaking style (middle panel). Statistical analyses, reported in Table IV, corroborate these observations. There was a significant main effect of target speech speaking style, masker speech speaking style, and the interaction between them. To understand the statistically significant interaction, pairwise comparisons were examined. See Table V for means and standard deviations for all target/masker combinations.

TABLE IV.

Regression model data assessing the main effects of target and masker speaking style and the interaction of these two effects for sentence-recognition thresholds.

Effect Target Vs Target Masker Estimate Test p-value
Target Speaking Style F2,73 = 7.29 0.0013
Masker Speaking Style F2,73 = 14.24 <0.0001
Target and Masker Style Interaction F4,73 = 7.29 <0.0001
Flat Exag. Flat 5.3 t73 = 7.42 <0.0001
Flat Normal Flat 2.7 t73 = 3.74 0.0004
Exag. Normal Flat −2.6 t73 = −3.67 0.0010
Exag. Flat Exag. 8.7 t73 = 16.27 <0.0001
Exag. Normal Exag. 3.3 t73 = 6.13 <0.0001
Normal Flat Exag. 5.4 t73 = 10.13 <0.0001
Exag. Flat Normal −0.2 t73 = −0.19 0.8460
Normal Exag. Normal 2.9 t73 = 3.64 0.0005
Normal Flat Normal 2.7 t73 = 3.40 0.0011

TABLE V.

Means and standard deviations for sentence-recognition thresholds for three target speaking styles in three masker speech speaking styles.

Target speaking style Masker speaking style Mean (dB SNR) SD
Flat Flat −1.7 1.3
Normal Flat −4.4 1.5
Exaggerated Flat −6.9 3.7
Flat Normal −6.9 2.7
Normal Normal −4.2 1.2
Exaggerated Normal −7.2 4
Flat Exaggerated −9.5 2.6
Normal Exaggerated −4.1 0.9
Exaggerated Exaggerated −0.7 1.4

For the flat masker speech speaking style (leftmost column Fig. 4), all three target types resulted in significantly different thresholds. The flat target/flat masker combination was most difficult, while the exaggerated target/flat masker combination was the easiest. Thresholds in the flat target condition were significantly higher than those in the exaggerated target condition or the normal target condition. The exaggerated thresholds were significantly better than those reported for the normal target speech condition.

For the exaggerated masker speech conditions (rightmost column Fig. 4), the target means were all significantly different from one another, but in the opposite order than for the flat masker condition. The exaggerated target speech had the highest thresholds, while the flat target speech had the lowest thresholds; the normal target speech thresholds were intermediate. For the exaggerated masker conditions, thresholds for the exaggerated targets had significantly higher thresholds than the flat target, and the normal targets; while the normal targets had significantly higher thresholds than the flat targets.

In contrast to the pattern of results observed in the flat and exaggerated masker conditions, thresholds for the exaggerated and flat targets were not significantly different in the normal masker condition. Thresholds for the normal targets were higher than both the exaggerated targets and the flat targets in the normal masker condition.

The data support our initial hypothesis that participants would have the greatest difficulty with the sentence-recognition task when f0 contour depths were most similar between the target and masker speech. In the left and right most columns of Fig. 4, it is evident that performance changed systematically as the f0 contour depth differences increased (in an orderly fashion: flat → normal → exaggerated). This pattern of results is in line with the current conceptual framework of informational masking (Durlach et al., 2003; Kidd et al., 2008; Kidd and Colburn, 2017; Watson, 1987), in that greater similarity (in this case with respect to f0 contour depth) is associated with greater informational masking. For the normal targets, a similar pattern of results is observed in which normal target/normal masker resulted in the highest threshold; we observed a significant and similar improvement in performance when the flat and exaggerated targets were paired with the normal masker.

To explore how thresholds could have differed due to energetic masking differences between the nine conditions, participants also listened to experimental conditions after the speech was processed through an ideal-binary mask (see Fig. 5). It should be noted that once the ideal-binary mask is implemented the listener hears a degraded target signal. The target signal is spectrally and temporally sparse due to the portions of the target speech being removed that were masked by the competing signal. Further, the competing masker speech is not intelligible. Therefore, this task is expected to be much easier for the listener.

FIG. 5.

FIG. 5.

(Color online) Sentence-recognition thresholds (dB SNR) after ideal-binary mask processing for three different target speaking styles: flat (fl.), normal (norm.), and exaggerated (ex.) in three different two-talker maskers (flat [leftmost column], normal [middle column], and exaggerated [rightmost column]). Plot layout follows the same configuration as Fig. 4.

Of the 225 data sets (75 participants × 3 masker conditions), only three had poor psychometric fits (r2 < 0.5); these thresholds are excluded from the analysis. Of the remaining thresholds, fits were strong with a median r2 = 0.829 (r2 interquartile range between 0.747 and 0.892). To examine the amount of energetic masking caused by the different combinations of target/masker f0 contours, an additional linear-regression model was used to evaluate the main effect of target speech speaking style (flat, normal, or exaggerated), masker speech speaking style (flat, normal, or exaggerated), and the interaction of these two effects on participants' ideal-binary mask sentence-recognition thresholds (dB SNR). Statistical procedures used for the full-spectrum thresholds were also used here, and results are shown in Table VI. For the ideal-binary mask conditions, the interaction between target speech speaking style and masker speech speaking style was statistically significant, but only the main effect of masker speech speaking style was significant; the main effect of target speech speaking style was not. Due to the significant interaction, pairwise comparisons were evaluated. For the flat masker conditions, flat and exaggerated target thresholds were significantly different from each other, while there was no significant difference for exaggerated versus normal or flat versus normal thresholds. For the exaggerated masker conditions, none of the target types were significantly different from each other. There were no significant differences across target threshold types for the normal masker conditions. Means and standard deviations are shown in Table VII.

TABLE VI.

Regression model data assessing the main effects of target and masker speaking style, and the interaction of these two effects for sentence-recognition thresholds for IBM processed speech.

Effect Target Vs target Masker Estimate Test p-value
Target speaking style F2,74 = 1.81 0.1706
Masker speaking style F2,74 = 82.62 <0.0001
Target and masker style interaction F4,74 = 2.63 0.0412
Flat Exag. Flat 1.8 t74 = 3.67 0.0005
Flat Normal Flat 0.9 t74 = 1.78 0.0800
Exag. Normal Flat −0.9 t74 = −1.89 0.0626
Exag. Flat Exag. −0.6 t74 = −0.90 0.3705
Exag. Normal Exag. −0.3 t74 = −0.41 0.6814
Normal Flat Exag. 0.3 t74 = 0.49 0.6256
Exag. Flat Normal −0.1 t74 = −0.24 0.8144
Normal Flat Normal −0.3 t74 = −0.41 0.6843
Normal Exag. Normal −0.4 t74 = −0.65 0.5187

TABLE VII.

Means and standard deviations for sentence-recognition thresholds for three target speaking styles in three masker speech speaking styles for IBM processed speech.

Target speaking style Masker speaking style Mean (dB SNR) SD
Flat Flat −27.6 1.7
Exaggerated Flat −29.4 1.8
Normal Flat −28.4 1.7
Flat Normal −30.6 2.1
Exaggerated Normal −30.8 2.3
Normal Normal −30.9 2.1
Flat Exaggerated −31.2 2.3
Exaggerated Exaggerated −31.8 1.9
Normal Exaggerated −31.5 2.4

The ideal-binary mask data suggest that energetic masking was equivalent for the three target types in the normal and exaggerated masker conditions. On average, participants performed more poorly for the three target types in the flat masker condition. This result is not surprising because the flat masker displayed reduced spectral and temporal modulations when compared to the other two maskers (see Figs. 1 and 2 above). The difference in modulations likely resulted in fewer “dip listening” opportunities for the flat masker than the normal or exaggerated masker conditions, contributing to the ∼3 dB poorer performance in the flat masker condition.

To further explore the effect of target/masker similarity on speech-in-speech masking, data were compared between matched target/masker and mismatched target/masker threshold scores for the full spectrum and ideal-binary mask target/masker conditions. The assumption behind this analysis was that test conditions which included similar target/masker combinations (matched) should result in elevated sentence-recognition thresholds compared to test conditions in which target/masker f0 contour types differed (mismatched) for full spectrum stimuli only. For ideal-binary mask conditions, where informational masking is thought to be eliminated, target/masker similarity should not matter. Matched thresholds were included for flat/flat, normal/normal, and exaggerated/exaggerated target/masker combinations. Mismatched thresholds were included for the six remaining experimental conditions. For the full spectrum data, thresholds for the matched conditions are significantly elevated compared to the mismatched thresholds (matched M threshold = −2.2, SD = 2.0; mismatched M threshold = −6.5, SD = 3.3). However, once the stimuli were processed through the ideal-binary mask, no difference in thresholds between the matched and mismatched conditions were observed (matched M threshold = −30.1, SD = 2.6; mismatched M threshold = −30.3, SD = 2.3). This observation was confirmed by a statistical analysis using a similar structured regression model as described in Sec. III. After controlling for target and masker and examining the fixed effects of target, masker, and type (matched or mismatched) in the full-spectrum conditions, there was a significant effect of masker (F2,73 = 8.79, p = 0.0004) and type (F1,73 = 131.5, p < 0.0001), but not of target (F2,73 = 2.51, p = 0.0880). However, for a similar and separate analysis examining thresholds for the ideal-binary mask conditions, the analysis indicated a main effect of masker (F2,74 = 80.90, p < 0.0001), but no effect of target (F2,74 = 2.15, p = 0.1242) or type (F1,74 = 0.48, p = 0.4888). Therefore, there is a matched vs mismatched effect for the full spectrum conditions, but not for the ideal-binary mask conditions. This result was not unexpected, but highlights the effect of f0 contour depth similarity between the target and masker speech for informational masking.

IV. DISCUSSION

The purpose of this experiment was to evaluate the importance of target/masker f0-contour depth similarity on sentence recognition. No digital-signal processing was employed to produce these stimuli; that is, only natural speech recordings were used within the experiment. Ideal-binary mask conditions were also included to examine the potentially different contributions of energetic masking within each of the nine unique speech-in-speech conditions. For the speech-in-speech conditions, the poorest thresholds were observed when the target and masker f0 contours were most similar in depth, and the best thresholds were observed when the target and masker speech contours were most dissimilar (i.e., when the target speech was flat and the masker speech was exaggerated, or vice versa). Sentence-recognition thresholds obtained in a speech-shaped noise masker indicated no significant differences between the flat and normal f0-contour target conditions. These data suggest that our flat and normal f0 contour targets were equally intelligible. Thresholds for exaggerated targets were found to be significantly poorer than the two other target types in the speech-shaped noise masker, although this difference was small (∼0.7 dB). Ideal-binary mask thresholds indicated similar amounts of energetic masking across conditions, other than for the flat masker speech. One interpretation of this result is that the flat masker provided fewer spectral/temporal dip-listening opportunities and resulted in thresholds that were about 3 dB poorer.

A. Target/masker f0 contour similarity

The data from this experiment lend support to our initial hypothesis that, as similarity between target and masker speech f0 contour depth increases, listeners' thresholds also increase. For both the flat and exaggerated f0 contour targets, sentence-recognition continuously became more challenging the more similar the f0 contour was between the target and masker speech. Yet, a closer examination of the data indicates an interesting pattern of results: there is a clear benefit when the f0 contour of the target speech fluctuates less than the masker (flat target condition) or when the f0 of the target speech fluctuates more than the masker (exaggerated target condition; this pattern of results can also be easily observed in Fig. 4 center column). However, thresholds do not change for normal targets when the masker speech becomes less similar in either direction (flat or exaggerated masker speech) compared to when it is matched in f0 contour depth (M threshold range −4.1 to −4.4). For these particular conditions, when the target speech is produced with a normal f0 contour, the data do not support the idea that similarity between the target and masker f0 contour depth matters. This general finding was replicated in a different group of young, normal-hearing listeners (see supplemental materials1).

It was an unexpected finding that the recognition of target speech produced with normal f0 contours was insensitive to masker type. However, since listeners are exposed to speech with normal f0 contours more often than either flat or exaggerated f0 contours, listeners' greater exposure to these types of targets likely increase the prosodic predictability of these targets. The “prosodic predictability hypothesis” (Kakouros and Rasanen, 2016; Kakouros et al., 2018) proposes that humans are sensitive to regularities in the perceptual patterns of their environment, in that statistical or distributional learning mechanisms assist individuals in effectively perceiving stimuli present in everyday life in a cognitively economical fashion. That is, stimuli that are less predictable, or more novel, are likely to engage greater attentional and/or information processing resources than those that are more common or predictable in the sensory context. This finding may explain why, when target and masker speech f0 contour depths are matched for the flat and exaggerated speech conditions, performance is poor (M threshold = −1.2 dB SNR), but listeners' thresholds are relatively good when the target and masker speech f0 contour depths are matched for the normal speech conditions (M threshold = −4.2 dB SNR). By this account, listeners need fewer attentional resources to attend to the “predictable” normal f0 contour targets and therefore may have more resources available to deal with the acoustic similarity between the target and masker speech. In contrast, greater attentional cost is incurred when processing the flat and exaggerated f0 contour targets, leaving limited resources left to contend with the similarity in acoustic features between the target and masker speech (Kakouros et al., 2018).

It is less clear why mismatching the masker f0 contour depth is not more beneficial to listeners' recognition when listening to normal f0 contour target speech. One explanation is that, unlike the two other target conditions (flat and exaggerated) in which the matched conditions resulted in poor performance (M thresholds near 1 dB SNR), there was less room for masking release to be observed. Second, the change in f0 contour from the normal f0 stimuli was not as large in either direction (less fluctuation = flat; more fluctuation = exaggerated) as the potential change observed for either the flat f0 contour speech (more fluctuation = normal, even more fluctuation = exaggerated) or the exaggerated f0 contour speech (less fluctuation = normal, even less fluctuation = flat). For the two other target types, large differences between mean thresholds (∼7 dB SNR) are only observed when the difference between target and masker f0 contours change by two steps (flat/flat to flat/exaggerated, exaggerated/exaggerated to exaggerated/flat).

B. Differences between naturally produced speech and digitally manipulated speech that varies in f0 contour

Thresholds did not significantly differ between the flat and normal f0 contour target types in the speech-shaped noise masker across our stimulus sets. It should be noted that though we refer to our targets as “flat,” there is still dynamic movement of the f0 within these sentences (as shown in Figs. 1 and 2). Recall that these stimuli were produced by a trained actor, and though they sound very monotone compared to normal speech (see Mm. 1), they are unlike stimuli that are manipulated using signal processing techniques that result in a constant f0. One possible explanation for similar thresholds for the noise-masked flat and normal targets is that there was not enough of an f0 contour reduction in the flat condition to impact overall performance in steady-state noise. This result agrees with Binns and Culling (2007), who showed that sentence recognition was not detrimentally affected when contour fluctuations were reduced by either half or three quarters of their original value in a noise masker. Rather, significantly poorer performance was only observed when the f0 contour was inverted from its original shape (also see Miller et al., 2010). Binns and Culling also reported that reducing f0-contour modulation by half of its original level did not negatively impact sentence recognition in a one-talker masker. Consistent with the results presented here, Binns and Culling also observed larger differences between target conditions for the speech-in-speech task (Exp. 2) than the speech-in-noise task (Exp. 1).

Mm. 1.

Trained actor's production of stimulus sets. This is a file type of “wav” (142 KB).

Download audio file (138.8KB, wav)
DOI: 10.1121/1.5121314.1

Results of Binns and Culling differ from those of the present study in one salient respect. That previous study manipulated the f0 contour of the one-talker masker in their experiment, and no effect of masker (normal, monotone, or inverted f0 contour) was observed, nor was there a significant interaction between target and masker. This result is quite different from our data. The most obvious difference between the two paradigms is the use of a one-talker instead of a two-talker masker. For sentence recognition tasks, two competing talkers have been shown to cause significantly more informational masking compared to one competing talker (Brungart et al., 2001; Calandruccio et al., 2017; Freyman et al., 2004; Rosen et al., 2013). Further, there was a consistent nine semitone difference between the target and masker talker in the Binns and Culling experiments, whereas mean semitone differences were minimal across the three female talkers used in the present experiment. Differences between the target and masker talker's mean f0 may have served as a strong grouping cue in the Binns and Culling paradigm, negating the importance of the f0 contour difference between the target and masker talkers. The task demands when listening in a one-talker versus a two-talker masker (often cited as a multi-masker penalty; see Durlach, 2006) are rather large, and therefore comparisons between data sets utilizing these different types of maskers should be made with caution.

It was surprising that our exaggerated f0 contour targets resulted in poorer thresholds compared to thresholds for flat and normal targets in the speech-shaped noise masker. If anything, we would have expected that the exaggerated f0 fluctuation would be beneficial for speech recognition in that many speaking styles that are known to enhance recognition result in exaggerated f0 contours. This exaggerated dynamic movement in f0 contour is seen in clear speech (Krause and Braida, 2002; Picheny et al., 1985), Lombard speech (Garnier and Henrich, 2014), and infant-directed speech (Fernald et al., 1989; Song et al., 2010), all speaking styles that are associated with improved speech recognition. Although unexpected, the finding of poorer thresholds for exaggerated targets in noise is similar to the results of Miller et al. (2010), who reported significantly decreased recognition scores for speech processed with an exaggerated f0 at values of 1.75 times the normal f0 contour (however, see Clarke et al., 2017 for no difference in recognition between normal and 1.5 and 1.75 exaggerated f0 targets). It is possible that poorer thresholds for the exaggerated targets were observed because the increased fluctuation in f0 was less predictable and less linguistically appropriate (i.e., the exaggerated pitch was not semantically or situationally meaningful). Miller et al. (2010) reported a similar result for when artificially applied f0 frequency modulation (at rates of 2.5 and 5.0 Hz) and inverted f0 contours were imposed on the target stimuli. While the authors originally hypothesized that f0 frequency modulation within the target speech might benefit listeners via increased perceptual salience and enhanced stream segregation of the target speech, results indicated that these unnatural and linguistically inappropriate contours actually ended up being detrimental to listeners' perception of the target speech. Their results and the results from our exaggerated target stimuli in speech-shaped noise suggest that manipulations of normal f0 contours that significantly diverge from a listener's linguistic expectations can be detrimental for speech recognition in noise. Future studies could include stimuli with emotional content that would occur more typically in daily environments with either a flat or exaggerated f0 contour.

C. The interaction of f0 contour and sentence meaning

Another unexpected finding is the observation that performance with the normal f0 target was relatively insensitive to masker style, and data on this condition are mixed in the literature. Flat f0 contours have been shown to degrade speech recognition compared to speech produced with normal f0 contours (Binns and Culling, 2007; Laures and Weismer, 1999; Miller et al., 2010); yet, there are also several examples where no change in intelligibility is reported for masked-speech recognition across the two contour types (Assmann, 1999; Chatterjee et al., 2010; Clarke et al., 2017). A close examination of these studies shows an interesting trend. Sentences with low semantic predictability, paired with digitally flattened f0 contours, consistently result in a decrease in intelligibility. This was observed by Binns and Culling (2007), in which the authors employed sentences with low predictability taken from the IEEE corpus (Egan, 1948), Miller et al. (2010), who also used IEEE sentences, Darwin and Hukin (2000), who used sentences with very low semantic context (e.g., “You'll also hear the sound bead/globe played here,” in which the word “bead” or “globe” was a keyword), and Laures and Bunton (2003) and Laures and Weismer (1999), who used low-predictability sentences from the Speech Perception In Noise (SPIN) Test (Kalikow et al., 1977). In contrast, when semantically meaningful speech was presented with either normal f0 contours or flattened contours, no significant difference in intelligibility was observed. This was seen by Assmann (1999), who also used sentences from the SPIN but only included those with high-semantic predictability, Clarke et al. (2017), who used Dutch sentences with “high sentential-context,” and Chatterjee et al. (2010), who used sentences from the Hearing in Noise Test (HINT) (Nilsson et al., 1994). These data support the idea that f0 contour fluctuation may be more important for effective speech recognition when semantic information is not available. However, if strong semantic cues are accessible to the listener, the loss of f0 fluctuation may be less detrimental to performance due to the increased redundancy of the semantically meaningful speech stream. In line with these results, we used meaningful BKB sentences in the current study and did not observe a difference in thresholds between our flat and normal f0 contour targets. One caveat is that this may not hold in natural environments that are acoustically complex; under these conditions, it is plausible that regardless of semantic content, normal f0 fluctuations may be required for effective speech recognition (Darwin and Hukin, 2000).

V. CONCLUSIONS

These data support the idea that similarity between the f0 contour depth of the target and masker speech increases sentence recognition difficulty. The larger the difference in f0 contour depth between the target and masker (e.g., going from flat/flat to flat/exaggerated target/masker speech), the greater the change in sentence-recognition performance. Normal-hearing listeners seem to be less negatively affected by the similarity of target and masker f0 contour depths when attending to speech targets produced with normal f0 contours compared to flat or exaggerated contours. Further research is required to fully understand why normal f0 contour targets seem to be more resilient than flat or exaggerated targets to target/masker f0 contour matched conditions, though it is likely due to listeners' familiarity (or the predictable nature) of target speech with normal f0 contours.

ACKNOWLEDGMENTS

This work was funded by the National Institutes of Health (Grant No. NIDCD R03DC015074). Support was also provided by the Support of Undergraduate Research and Creative Endeavors (SOURCE) office at Case Western Reserve University. We thank the participants in this experiment and all of the undergraduate and graduate research assistants in our labs who helped with stimulus development, data collection, and data management.

Footnotes

1

See supplementary material at https://doi.org/10.1121/1.5121314 for sentence-recognition thresholds, a figure of sentence-recognition thresholds, and a linear regression analysis for sentence-recognition thresholds for a separate group of young-adult listeners tested across all nine listening conditions.

References

  • 1.ANSI (2009). S3.21 2004 (R2009): American National Standard Methods for Manual Pure-tone Threshold Audiometry ( Acoustical Society of America, New York: ). [Google Scholar]
  • 2. Anzalone, M. , Calandruccio, L. , Doherty, K. , and Carney, L. (2006). “ Determination of the potential benefit of time-frequency gain manipulation,” Ear Hear. 27(5), 480–492. 10.1097/01.aud.0000233891.86809.df [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Arehart, K. , King, C. , and McLean-Mudgett, K. (1997). “ Role of fundamental frequency differences in the perceptual separation of competing vowel sounds by listeners with normal hearing and listeners with hearing loss,” J. Speech Lang. Hear. Res. 40(6), 1434–1444. 10.1044/jslhr.4006.1434 [DOI] [PubMed] [Google Scholar]
  • 4. Assmann, P. F. (1999). “ Fundamental frequency and the intelligibility of competing voices,” in Proceedings of the 14th International Congress of Phonetic Sciences, pp. 179–182. [Google Scholar]
  • 5. Bench, J. , Kowal, Å. , and Bamford, J. (1979). “ The BKB (Bamford-Kowal-Bench) sentence lists for partially-hearing children,” Brit. J. Audiol. 13, 108–112. 10.3109/03005367909078884 [DOI] [PubMed] [Google Scholar]
  • 6. Binns, C. , and Culling, J. F. (2007). “ The role of fundamental frequency contours in the perception of speech against interfering speech,” J. Acoust. Soc. Am. 122(3), 1765–1776. 10.1121/1.2751394 [DOI] [PubMed] [Google Scholar]
  • 7. Bird, J. , and Darwin, C. J. (1998). “ Effects of a difference in fundamental frequency in separating two sentences,” in Psychophysical and Physiological Advances in Hearing, edited by Palmer A. R., Rees A., Summerfield A. Q., and Meedis R. ( Whurr, London: ), pp. 263–269. [Google Scholar]
  • 8. Boersma, P. , and Weenink, D. (2017). “ Praat: Doing phonetics by computer” [computer program], http://www.praat.org/ (Last viewed 1/10/2017).
  • 9. Bolia, R. , Nelson, W. , Ericson, M. , and Simpson, B. (2000). “ A speech corpus for multi-talker communications research,” J. Acoust. Soc. Am. 107, 1065–1066. 10.1121/1.428288 [DOI] [PubMed] [Google Scholar]
  • 10. Broadbent, D. E. , and Ladefoged, P. (1957). “ On the fusion of sounds reaching different sense organs,” J. Acoust. Soc. Am. 29(6), 708–710. 10.1121/1.1909019 [DOI] [Google Scholar]
  • 11. Brokx, J. P. L. , and Nooteboom, S. G. (1982). “ Intonation and the perceptual separation of simultaneous voices,” J. Phon. 10, 23–36. [Google Scholar]
  • 12. Brungart, D. S. (2001). “ Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Am. 109(3), 1101–1109. 10.1121/1.1345696 [DOI] [PubMed] [Google Scholar]
  • 13. Brungart, D. S. , Chang, P. S. , Simpson, B. D. , and Wang, D. (2006). “ Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Am. 120(6), 4007–4018. 10.1121/1.2363929 [DOI] [PubMed] [Google Scholar]
  • 14. Brungart, D. S. , Simpson, B. D. , Ericson, M. A. , and Scott, K. R. (2001). “ Informational and energetic masking effects in the perception of multiple simultaneous talkers,” J. Acoust. Soc. Am. 110(5), 2527–2538. 10.1121/1.1408946 [DOI] [PubMed] [Google Scholar]
  • 15. Calandruccio, L. , Buss, E. , and Bowdrie, K. (2017). “ Effectiveness of two-talker maskers that differ in talker congruity and perceptual similarity to the target speech,” Trends Hear. 21, 2331216517709385. 10.1177/2331216517709385 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Charpentier, F. , and Stella, M. (1986). “ Diphone synthesis using an overlap-add technique for speech waveforms concatenation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP'86, p. 11. [Google Scholar]
  • 17. Chatterjee, M. , Peredo, F. , Nelson, D. , and Baskent, D. (2010). “ Recognition of interrupted sentences under conditions of spectral degradation,” J. Acoust. Soc. Am. 127(2), EL37–EL41. 10.1121/1.3284544 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Clarke, J. , Kazanoğlu, D. , Başkent, D. , and Gaudrain, E. (2017). “ Effect of F0 contours on top-down repair of interrupted speech,” J. Acoust. Soc. Am. 142(1), EL7–EL12. 10.1121/1.4990398 [DOI] [PubMed] [Google Scholar]
  • 19. Cutler, A. , Dahan, D. , and van Donselaar, W. (1997). “ Prosody in the comprehension of spoken language: A literature review,” Lang. Speech 40(2), 141–201. 10.1177/002383099704000203 [DOI] [PubMed] [Google Scholar]
  • 20. Darwin, C. J. , Brungart, D. S. , and Simpson, B. D. (2003). “ Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers,” J. Acoust. Soc. Am. 114(5), 2913–2922. 10.1121/1.1616924 [DOI] [PubMed] [Google Scholar]
  • 21. Darwin, C. J. , and Hukin, R. W. (2000). “ Effectiveness of spatial cues, prosody, and talker characteristics in selective attention,” J. Acoust. Soc. Am. 107(2), 970–977. 10.1121/1.428278 [DOI] [PubMed] [Google Scholar]
  • 22. Durlach, N. I. (2006). “ Auditory masking: Need for improved conceptual structure,” J. Acoust. Soc. Am. 120, 1787–1790. 10.1121/1.2335426 [DOI] [PubMed] [Google Scholar]
  • 23. Durlach, N. I. , Mason, C. R. , Kidd, G., Jr. , Arbogast, T. , Colburn, H. S. , and Shinn-Cunningham, B. G. (2003). “ Note on informational masking (L),” J. Acoust. Soc. Am. 113(6), 2984–2987. 10.1121/1.1570435 [DOI] [PubMed] [Google Scholar]
  • 24. Egan, J. P. (1948). “ Articulation testing methods,” Laryngoscope 58(9), 955–991. 10.1288/00005537-194809000-00002 [DOI] [PubMed] [Google Scholar]
  • 25. Fernald, A. , Taeschner, T. , Dunn, J. , Papousek, M. , de Boysson-Bardies, B. , and Fukui, I. (1989). “ A cross-language study of prosodic modifications in mothers' and fathers' speech to preverbal infants,” J. Child Lang. 16(3), 477–501. 10.1017/S0305000900010679 [DOI] [PubMed] [Google Scholar]
  • 26. Flaherty, M. , Buss, E. , and Leibold, L. (2019). “ Developmental effects in children's ability to benefit from F0 differences between target and masker speech,” Ear Hear. 40, 927–937. 10.1097/AUD.0000000000000673 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Freyman, R. L. , Balakrishnan, U. , and Helfer, K. S. (2004). “ Effect of number of masking talkers and auditory priming on informational masking in speech recognition,” J. Acoust. Soc. Am. 115(5), 2246–2256. 10.1121/1.1689343 [DOI] [PubMed] [Google Scholar]
  • 28. Garnier, M. , and Henrich, N. (2014). “ Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?,” Comput. Speech Lang. 28, 580–597. 10.1016/j.csl.2013.07.005 [DOI] [Google Scholar]
  • 29. Helfer, K. S. , and Freyman, R. L. (2008). “ Aging and speech-on-speech masking,” Ear Hear. 29(1), 87–98. 10.1097/AUD.0b013e31815d638b [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Hillenbrand, J. M. (2003). “ Some effects of intonation contour on sentence intelligibility,” J. Acoust. Soc. Am. 114(4), 2338. 10.1121/1.4781079 [DOI] [Google Scholar]
  • 31. Iyer, N. , Brungart, D. S. , and Simpson, B. D. (2010). “ Effects of target-masker contextual similarity on the multimasker penalty in a three-talker diotic listening task,” J. Acoust. Soc. Am. 128(5), 2998–3010. 10.1121/1.3479547 [DOI] [PubMed] [Google Scholar]
  • 32. Kakouros, S. , and Rasanen, O. (2016). “ Perception of sentence stress in speech correlates with the temporal unpredictability of prosodic features,” Cogn. Sci. 40(7), 1739–1774. 10.1111/cogs.12306 [DOI] [PubMed] [Google Scholar]
  • 33. Kakouros, S. , Salminen, N. , and Rasanen, O. (2018). “ Making predictable unpredictable with style—Behavioral and electrophysiological evidence for the critical role of prosodic expectations in the perception of prominence in speech,” Neuropsychologia 109, 181–199. 10.1016/j.neuropsychologia.2017.12.011 [DOI] [PubMed] [Google Scholar]
  • 34. Kalikow, D. N. , Stevens, K. N. , and Elliott, L. L. (1977). “ Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61(5), 1337–1351. 10.1121/1.381436 [DOI] [PubMed] [Google Scholar]
  • 35. Kidd, G., Jr. , and Colburn, H. S. (2017). “ Informational masking in speech recognition,” in The Auditory System at the Cocktail Party ( Springer, Cham: ), pp. 75–109. [Google Scholar]
  • 36. Kidd, G., Jr. , Mason, C. R. , Richards, V. M. , Gallun, F. J. , and Durlach, N. I. (2008). “ Informational masking,” in The Auditory System at the Cocktail Party ( Springer, Boston: ), pp. 143–189. [Google Scholar]
  • 37. Krause, J. , and Braida, L. (2002). “ Investigating alternative forms of clear speech: The effects of speaking rate and speaking mode on intelligibility,” J. Acoust. Soc. Am. 112(5), 2165–2172. 10.1121/1.1509432 [DOI] [PubMed] [Google Scholar]
  • 38. Laures, J. , and Bunton, K. (2003). “ Perceptual effects of a flattened fundamental frequency at the sentence level under different listening conditions,” J. Commun. Disord. 36(6), 449–464. 10.1016/S0021-9924(03)00032-7 [DOI] [PubMed] [Google Scholar]
  • 39. Laures, J. S. , and Weismer, G. (1999). “ The effects of a flattened fundamental frequency on intelligibility at the sentence level,” J. Speech Lang. Hear. Res. 42, 1148–1156. 10.1044/jslhr.4205.1148 [DOI] [PubMed] [Google Scholar]
  • 40. Leibold, L. , Buss, E. , and Calandruccio, L. (2018). “ Developmental effects in masking release for speech-in-speech perception due to a target/masker sex mismatch,” Ear Hear. 39(5), 935–945. 10.1097/AUD.0000000000000554 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Mackersie, C. L. , Dewey, J. , and Guthrie, L. A. (2011). “ Effects of fundamental frequency and vocal-tract length cues on sentence segregation by listeners with hearing loss,” J. Acoust. Soc. Am. 130(2), 1006–1019. 10.1121/1.3605548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Miller, S. E. , Schlauch, R. S. , and Watson, P. J. (2010). “ The effects of fundamental frequency contour manipulations on speech intelligibility in background noise,” J. Acoust. Soc. Am. 128(1), 435–443. 10.1121/1.3397384 [DOI] [PubMed] [Google Scholar]
  • 43. Moulines, E. , and Charpentier, F. (1990). “ Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones,” Speech Commun. 9(5-6), 453–467. 10.1016/0167-6393(90)90021-Z [DOI] [Google Scholar]
  • 44. Nilsson, M. , Soli, S. , and Sullivan, J. (1994). “ Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 95(2), 1085–1099. 10.1121/1.408469 [DOI] [PubMed] [Google Scholar]
  • 45. Picheny, M. , Durlach, N. , and Braida, L. (1985). “ Speaking clearly for the hard of hearing. I: Intelligibility differences between clear and conversational speech,” J. Speech Hear. Res. 28(1), 96–103. 10.1044/jshr.2801.96 [DOI] [PubMed] [Google Scholar]
  • 46. Rosen, S. , Souza, P. , Ekelund, C. , and Majeed, A. (2013). “ Listening to speech in a background of other talkers: Effects of talker number and noise vocoding,” J. Acoust. Soc. Am. 133(4), 2431–2443. 10.1121/1.4794379 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Song, J. , Demuth, K. , and Morgan, J. (2010). “ Effects of the acoustic properties of infant-directed speech on infant word recognition,” J. Acoust. Soc. Am. 128(1), 389–400. 10.1121/1.3419786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Stone, M. A. , Füllgrabe, C. , and Moore, B. C. J. (2012). “ Notionally steady background noise acts primarily as a modulation masker of speech,” J. Acoust. Soc. Am. 132, 317–326. 10.1121/1.4725766 [DOI] [PubMed] [Google Scholar]
  • 49. Wang, D. (2005). “ On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, edited by Divenyi P. ( Kluwer Academic, Dordrecht: ), pp. 181–197. [Google Scholar]
  • 50. Watson, C. S. (1987). “ Uncertainty, informational masking, and the capacity of immediate auditory memory,” in Auditory Processing of Complex Sounds, edited by Yost W. A. and Watson C. S. ( Lawrence Erlbaum, Mahwah, NJ: ), pp. 267–277. [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES