Masking release due to linguistic and phonetic dissimilarity between the target and masker speech

Lauren Calandruccio; Susanne Brouwer; Kristin J Van Engen; Sumitrajit Dhar; Ann R Bradlow

doi:10.1044/1059-0889(2013/12-0072)

. Author manuscript; available in PMC: 2013 Jun 27.

Published in final edited form as: Am J Audiol. 2013 Jun;22(1):157–164. doi: 10.1044/1059-0889(2013/12-0072)

Masking release due to linguistic and phonetic dissimilarity between the target and masker speech

Lauren Calandruccio ^1,^*, Susanne Brouwer ², Kristin J Van Engen ^2,^°, Sumitrajit Dhar ³, Ann R Bradlow ²

PMCID: PMC3694489 NIHMSID: NIHMS469085 PMID: 23800811

Abstract

Purpose

To investigate masking release for speech maskers for linguistically and phonetically close (English and Dutch) and distant (English and Mandarin) language pairs.

Method

Twenty monolingual speakers of English with normal-audiometric thresholds participated. Data are reported for an English sentence recognition task in English, Dutch and Mandarin competing speech maskers (Experiment I) and noise maskers (Experiment II) that were matched either to the long-term-average-speech spectra or to the temporal modulations of the speech maskers from Experiment I.

Results

Results indicated that listener performance increased as the target-to-masker linguistic distance increased (English-in-English < English-in-Dutch < English-in-Mandarin).

Conclusions

Spectral differences between maskers can account for some, but not all, of the variation in performance between maskers; however, temporal differences did not seem to play a significant role.

Keywords: masking, native and non-native English speech perception

I. INTRODUCTION

Recognizing speech in the presence of competing speech can be difficult. Speech-recognition performance in noise can improve for listeners (i.e., they benefit from a masking release) when the relationship between the target and competing stimuli is manipulated (see Miller and Licklider, 1950; Festen and Plomp, 1990; Helfer and Freyman, 2008; Bernstein and Grant, 2009; and many others). Of particular interest for the present study is the suggestion from previous work that manipulations in the linguistic content of a masking speech signal can have a substantial influence on recognition of speech in the target signal. A masking release, or a decrease in overall masking, has been reported when the competing speech signal contained syntactically normal, but grammatically anomalous speech rather than meaningful linguistic content (Brouwer, Van Engen, Calandruccio, Dhar and Bradlow, 2012). In addition, several studies have reported a release from masking for first (L1) and second language (L2) speech perception when the target speech and masker speech were not spoken in the same language (e.g., Garcia Lecumberri and Cooke, 2006; Van Engen and Bradlow, 2007; Calandruccio, Van Engen, Dhar, and Bradlow, 2010). Even non-natives attending to their L2 obtained a masking release when the competing speech was changed from their L2 to their L1 (i.e., they benefitted when the target and masker speech were not spoken in the same language, regardless of their proficiency in the two different competing languages; see Van Engen, 2010 and Brouwer et al., 2012).

It was hypothesized that since a release from masking has been observed for both speech maskers that are less meaningful as well as linguistically different from the target speech, a difference in the magnitude of masking release should be observed when the competing speech varies along a continuum in the degree of linguistic/phonetic similarity to the target speech. Testing this hypothesis will further our understanding of the contributions of linguistic/phonetic information to overall masking that could potentially improve signal-processing strategies within assistive listening devices for hearing-impaired listeners.

The goal of this research was to investigate masking release for foreign speech maskers that varied in the degree of linguistic/phonetic similarity to the target speech. Specifically, we were interested in comparing the magnitude of the masking release for linguistically and phonetically close (English and Dutch) and distant (English and Mandarin) language pairs. The three degrees of target-masker linguistic similarity included: (a) identical target-masker (English-in-English recognition), (b) linguistically close target-masker (English-in-Dutch recognition), and (c) linguistically distant target-masker (English-in-Mandarin recognition). We predicted that listeners would obtain a greater masking release when the competing language was more distant from the target speech than when it was close, because there should be greater differences in linguistic sound structure at the level of the phoneme inventories, syllable- and phrase-level phonetic structures, and rhythmic structure (and in turn, less overall masking). That is, we predicted that even when meaning was removed from the speech signal, the degree of similarities in such variables as rhythm class, phonemes, and syllable structures would be positively related to the extent of confusion between the target and masker signals. Data that supports this prediction will be presented. A follow-up investigation of the influence of spectral and temporal differences between the maskers will also be presented.

II. EXPERIMENT I: Linguistically and phonetically close and distant masker pairs

A. METHODS

Listeners

Twenty normal-hearing listeners (audiometric thresholds < 25 dB HL bilaterally at octave frequencies between 250 and 8000 Hz) participated in the experiment. All listeners were monolingual speakers of American English and included 13 females and 7 males (M age = 21 years, SD = 2.4 years). Listeners were recruited from the student body at Northwestern University in Evanston, IL and were paid for their participation.

Stimuli

Target stimuli included sentences from the Bamford-Kowal-Bench (BKB) sentence lists (Bench, Kowal and Bamford, 1979; ® Cochlear Corporation) spoken by a native-English female speaker and recorded at Northwestern University. An example from the BKB sentences is, “The clown had a funny face”, in which the keywords used for scoring are underlined.

The competing speech stimuli consisted of three different two-talker maskers, spoken in English, Dutch, and Mandarin. The two non-English masker languages differ from the target language, English, in various ways (see Table I). For example, Dutch and English are both from the West Germanic language family and have similar rhythm (both traditionally considered stress-timed) and phonotactics (wide range of permissible syllable structures). Mandarin is a Sino-Tibetan language; it has a much more restricted range of syllable structures (primarily CV syllables) compared to English and Dutch, and is a tonal language. During the experiment, subjects were also tested using a Croatian masker and a semantically anomalous English masker, but these results are not reported in this manuscript (see Calandruccio, Van Engen, et al., 2010 for a reported masking release for native-English speaking listeners listening to English in the presence of Croatian two-talker babble compared to English two-talker babble; see Brouwer et al. (2012) for results on masker effectiveness for meaningful and anomalous speech).

Table I.

Languages used for the masker conditions

Language of the masker	# of Vocalic Phonemes	# of Consonantal Phonemes	Linguistic Family	Syllable Structure	Lexical Tones	Rhythm Class
English_¹	14	24	Indo-European (West Germanic)	(C)³ V(C)⁴	No	Stress-timed
Dutch_²	13	26	Indo-European (West Germanic)	(C)³ V(C)⁴	No	Stress-timed
Mandarin_³	35	28	Sino-Tibetan	(C)V(C)	Yes	Syllable-timed

Open in a new tab

Dryer and Haspelmath, 2011

Booij, 1995

Li and Thompson, 1989

The Dutch sentences used during testing in Experiment I were direct translations (made by the second author who is a native-Dutch speaker) of the Nye and Gaitenby (1974) sentences that are syntactically correct but semantically anomalous. An example of these sentences is: The great car met the milk. An example of the same sentence translated into Dutch is: De geweldige auto ontmoette de melk. The Mandarin sentences, originally used in Van Engen and Bradlow (2007), are also syntactically correct, but semantically anomalous materials. The English masker consisted of syntactically correct, meaningful sentences spoken in English taken from the Harvard/IEEE sentence lists (IEEE, 1969). An example of a sentence from these lists is: Rice is often served in round bowls. It should be noted that though the English competing sentences were meaningful whereas the Dutch and Mandarin competing sentences were semantically anomalous, all listeners were monolingual speakers of English and had no knowledge of either Dutch or Mandarin. Brouwer et al. (2012) reported data for monolingual English listeners in the presence of meaningful and anomalous Dutch maskers. Results indicated no significant differences between the masker conditions; therefore, we would expect that since the listeners in the present study were all monolingual English speakers the fact that Dutch and Mandarin maskers were anomalous should not matter.

Six different female voices were used to create the three two-talker maskers (two native speakers each of English, Dutch, and Mandarin). The two-talker maskers were created by concatenating sentences spoken by each talker with no silent intervals between sentences. Though each of the two talkers spoke the same sentences in each language, the order of concatenation differed between the talkers in each masker condition. The sentences were equalized to the same root-mean-square (RMS) pressure level using Praat (Boersma and Weenink, 2012) prior to concatenation. The two strings of sentences were combined into a single audio file using Audacity®. The final audio files (one for each masker condition) were RMS equalized to the same overall pressure. Lastly, the ends of the audio files were digitally trimmed so that all three maskers were 34 seconds in length.

Instrumentation

The target and masker speech were mixed in real time using custom software created using MaxMSP (distributed by Cycling ’74) running on an Apple Macintosh computer. Stimuli were passed to a MOTU 828 MkII input/output firewire device for digital-to-analog conversion (24 bit), passed through a Behringer Pro XL headphone amplifier and output to MB Quart 13.01HX drivers. Stimuli were then presented to the listeners via disposable foam insert earphones (13 mm) while seated in a comfortable chair within a double-walled sound-treated audiometric suite.

Experimental Testing

Listeners first participated in a pre-experiment with an easier signal-to-noise ratio (SNR) of −3 dB on the same day of testing. This experience allowed our listeners to be very comfortable with the speech-in-speech task and very familiar with the target voice. Also, these initial 80 practice trials helped to alleviate learning effects within listeners’ performance (see Felty, Buchwald, Pisoni, 2009).

Throughout testing, the level of the target speech remained fixed at 65 dB SPL, while the level of the competing (two-talker masker) speech was fixed at 70 dB SPL, resulting in a −5 dB SNR. The presentation order of the masker conditions (English, Dutch, Mandarin) was randomly varied across listeners and 16 sentences (1 BKB list; 50 keywords) were presented per masker condition.

Stimuli were presented binaurally. One target sentence was presented to the listener on each trial and a random portion of the appropriate two-talker masker was chosen and presented one second longer in duration compared to the target sentence (500 ms prior to the beginning of the target sentence, and 500 ms at the end of the target sentence). Listeners were asked to orthographically record what they heard on each trial. The written responses were scored as incorrect if the keyword was missing, incomplete, morphologically incorrect, or just wrong. Incorrect spelling of a word, however, was not considered incorrect.

B. RESULTS

The following statistical analyses are based on percent-correct data. The analysis was conducted to test whether English-sentence recognition differed among the three two-talker masker conditions. A mixed effects model with listener as a random variable was utilized (Baayen, Davidson, Bates, 2008). The fixed effect of masker was significant (F = 36.04, p < .0001). The least square means (LSM) for the three maskers were English = 38.9 (SE = 3.29), Dutch = 56.4 (SE = 3.29) and Mandarin = 72.4 (SE = 3.29). A post-hoc LSM Differences Tukey Honestly Significant Difference (HSD) test (Tukey, 1953) indicated a significant grouping difference between all three maskers. Data are illustrated in Figure 1 using boxplots. The length of the box indicates the interquartile range of performance scores, while the intermediate horizontal line indicates the median. The whiskers are calculated using the following two formulae: upper whisker = 3^rd quartile + 1.5*(interquartile range), lower whisker = 1^st quartile − 1.5*(interquartile range).

Sentence recognition performance (percent correct) in the presence of three two-talker maskers spoken in English, Dutch and Mandarin. Boxplots for each linguistic masker are shown. The length of the box indicates the interquartile range of performance scores, while the intermediate horizontal line indicates the median. The whiskers are calculated using the following two formulae: upper whisker = 3^rd quartile + 1.5*(interquartile range), lower whisker = 1^st quartile − 1.5*(interquartile range). Individual data points are also indicated for all 20 listeners within each boxplot.

A post-hoc analysis was conducted to examine masking release relative to the most difficult condition (i.e. the English masker condition). Specifically, masking release was calculated by taking the within-participant difference in performance scores between (a) the Dutch and English masker conditions and, (b) the Mandarin and English masker conditions. A mixed effects regression model with subject as a random variable was conducted to test for a difference in masking release between Dutch-English and Mandarin-English. Results indicated a significant effect in masker language with respect to masking release (p = .0099). That is, there was a significantly larger masking release for the Mandarin-English condition, than the Dutch-English condition (see Figure 2). In addition, one-way t-tests also indicated that the masking release observed for both Dutch and Mandarin were significantly different than zero (t₍₁₉₎ = 3.59, p = .0019 and t₍₁₉₎ = 9.99, p < .0001, respectively).

Masking release for data reported in Experiment I. Masking release was calculated by subtracting each subject’s sentence recognition performance in the presence of the foreign language masker minus their performance in the English masker (i.e., Dutch minus English, and Mandarin minus English). The magnitude of the masking release was significantly different between the two foreign languages. Specifically, Mandarin allowed for a significantly greater masking release than Dutch. The masking release for both languages was significant.

C. DISCUSSION

Data from monolingual English speakers indicate that when listening to English sentences in competing speech, a competing English masker is most effective, followed by Dutch, and further followed by Mandarin. These data support the original hypothesis that masker effectiveness for a target signal decreases as the competing speech becomes more distant phonetically from the target speech compared to competing speech that is (more) similar to the target language. These data suggest that similar phonemes, phonotactics, and other phonetic or phonological structure similarities between a target and a masker speech signal can increase overall masking. However, it must be considered that the different voices used to create the two-talker maskers had different spectral and temporal properties. A close examination of the long-term average speech spectra (LTASS) between the three maskers can be observed in Figure 3. The Mandarin masker has noticeably less energy above 5000 Hz than the English and Dutch maskers. Therefore, it is possible that differences other than those that are linguistically driven between the maskers might have contributed to the significant results observed in Experiment I. The purpose of Experiment II was to attempt to isolate some of these potential spectral-temporal signal-related features across the three two-talker maskers.

Long-term average speech spectra for the three linguistic maskers used in Experiment I. The Mandarin masker has noticeably less energy above 5000 Hz than the English and Dutch maskers.

III. EXPERIMENT II: Spectrally matched steady-state and temporally modulated white-noise maskers

In an attempt to examine spectral and temporal differences between the masker conditions that could potentially be impacting the results of Experiment I, a second experiment was conducted using noise (rather than speech) maskers. Two different sets of noise maskers were created. The first set of noise maskers were spectrally matched to the three two-talker maskers (English, Dutch, and Mandarin) used in Experiment I. This manipulation removed temporal differences between the three maskers, while preserving the long-term spectral content of the original maskers. The second set of noise maskers included three white-noise maskers temporally modulated to match the low-frequency modulations of the three two-talker maskers used in Experiment I. Thus, this manipulation removed all spectral differences between the three maskers, but preserved the low-frequency temporal modulations of the original two-talker maskers.