Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2020 Apr 15;63(4):1002–1017. doi: 10.1044/2020_JSLHR-19-00345

Vowel and Sibilant Production in Noise: Effects of Noise Frequency and Phonological Similarity

Kevin J Reilly a,
PMCID: PMC7242990  PMID: 32293944

Abstract

Purpose

This study investigated vowel and sibilant productions in noise to determine whether responses to noise (a) are sensitive to the spectral characteristics of the noise signal and (b) are modulated by the contribution of vowel or sibilant contrasts to word discrimination.

Method

Vowel and sibilant productions were elicited during serial recall of three-word sequences that were produced in quiet or during exposure to speaker-specific noise signals. These signals either masked a speaker's productions of the sibilants /s/ and /ʃ/ or their productions of the vowels /a/ and /æ/. The contribution of the vowel and sibilant contrasts to word discrimination in a sequence was manipulated by varying the number of times that the target sibilant and vowel pairs occurred in the same word position in each sequence.

Results

Spectral noise effects were observed for both sibilants and vowels: Responses to noise were larger and/or involved to more acoustic features when the noise signal masked the acoustic characteristics of that phoneme class. Word discrimination effects were limited and consisted of only small increases in vowel duration. Interaction effects between noise and similarity indicated that the phonological similarity of sequences containing both sibilants and/or both vowels influenced articulation in ways not related to speech clarity.

Conclusion

The findings of this study indicate that sensorimotor control of speech exhibits some sensitivity to noise spectral characteristics. However, productions of sibilants and vowels were not sensitive to their importance in discriminating the words in a sequence. In addition, phonological similarity effects were observed that likely reflected processing demands related to the recall and sequencing of high-similarity words.


Background noise elicits several characteristic changes in speech output that include increases in speech intensity and vocal fundamental frequency (F0) as well as decreases in spectral tilt and speaking rate (Junqua, 1996; Pittman & Wiley, 2001; Rivers & Rastatter, 1985; Tartter et al., 1993; Van Summers et al., 1988). Such findings provide important insights into sensorimotor control policies governing intelligible speech production in adverse conditions (Garnier & Henrich, 2014; Patel & Schell, 2008; Perkell et al., 2007). This study investigates responses to background noise to determine whether the control of speech output is influenced by the spectral characteristics of noise signals and/or the confusability of words in a sequence.

Speech responses to background noise are sensitive to noise levels such that increases in intensity elicit progressively larger changes in speech output (Hanley & Steer, 1949; Perkell et al., 2007; Tartter et al., 1993; Van Summers et al., 1988). It is less clear whether speech responses are also sensitive to the spectral content of noise signals. For example, Junqua et al. (1999) manipulated the spectral tilt of background noise signals and reported that speakers communicating with a speech recognizer increased the distribution of spectral energy in frequency bands where the intensity of the background noise was high. In addition, Stowe and Golob (2013) reported that responses to background noise were greater for noise signals that overlapped more with the speech spectrum than for those that overlapped less with the speech spectrum. In contrast to these findings, Lu and Cooke (2009) did not observe any differences between speech responses to low-pass and high-pass filtered noise signals, even when narrow-band filtered signals were used. Similarly, Garnier and Henrich (2014) found that speech changes in response to a cocktail party noise signal with energy concentrated below 1 kHz were largely similar to speech changes in response to a broadband noise signal with energy concentrated below 10 kHz. However, Garnier and Henrich also observed subtle, “secondary” changes involving, for example, shifts in vocal F0 to regions of minimal energy in the cocktail noise signal. These authors suggested that secondary, frequency-sensitive speech changes supported the communicative function of speech production.

While some component of speech responses to background noise is intended to improve the monitoring of speech accuracy in adverse speaking conditions (Mahl, 1972; Siegel & Pick, 1974), there is considerable evidence supporting the notion that speech changes in noise are also listener oriented and organized to preserve successful communication of a speech message in adverse speaking conditions. For example, speech responses to background noise are larger in contexts that place a premium on intelligible communication than those that do not (Junqua et al., 1998, 1999; Lane & Tranel, 1971). Related to these findings are the findings of Patel and Schell (2008), who reported that noise-induced changes to word duration and peak F0 were greater during information-bearing words compared to function words. Moreover, Perkell et al. (2007) reported increases in the contrast distances separating the vowels /i/, /u/, /ɛ/, and /æ/ at high and moderate background signal-to-noise levels. Since increases in vowel contrasts are associated with greater speech intelligibility (Bradlow et al., 1996; Moon & Lindblom, 1994; Picheny et al., 1986), Perkell et al. suggested that speakers enhanced intervowel contrast distances to increase the clarity of their speech and counter the adverse effects of background noise on speech communication. Interestingly, Perkell et al. did not observe similar effects for the sibilants /s/ and /ʃ/, as the contrast between these sounds, measured as the distance between the spectral means, tended to decrease at even high signal-to-noise levels. The authors suggested that, because these sibilants are differentiated by the characteristics of their noise spectra, control of the sibilant contrast in noise would be limited since auditory information regarding that contrast would be more susceptible to masking by the noise signal. Together, these findings indicate that speech adjustments promoting comprehension of speech in noise both exploit and are constrained by the frequency content of the noise signal.

In summary, an important question regarding sensorimotor control of speech in noise concerns whether speech responses are sensitive to the spectral characteristics of background noise and whether such responses increase the distinctiveness of individual speech segments. The present question is related to the broader question of goals for speech production. While contemporary models of speech production account for a range of speech phenomena (Guenther, 2016; Houde & Nagarajan, 2011; Tourville & Guenther, 2011), they are not currently constructed to respond to factors that affect listeners' speech comprehension. The potential of a background noise to interfere with a speech message may be predictable (i.e., learned as a forward prediction) given that speech production is an overlearned behavior (Netsell, 1982) and that speakers engage in this behavior in a diverse range of noise conditions A similar prediction regarding the comprehensibility of an utterance likely mediates the fine-tuned changes in intensity observed with increases in speaker-to-listener distance (Zahorik & Kelly, 2007). When communication conditions are sufficiently adverse, Garnier and Henrich (2014) proposed three possible response types: (a) boosting strategies that consist of global intensity increases, especially in frequency bands where background noise is maximal; (b) bypass strategies that involve shifting spectral energy or, at least, the spectral energy of important cues for speech comprehension to frequency bands where background noise is minimal; and (c) modulation strategies that increase the modulation of vocal F0 and intensity. In this study, changes in the production of vowels and sibilants in noise were evaluated and compared to the boosting and bypassing strategies proposed by Garnier and Henrich.

This study focused on production of vowel and sibilant sounds because background noise has been shown to elicit different acoustic changes in these sound classes and because of the suggestion that these differences are due in part to their different spectral characteristics (Perkell et al., 2007). The effects of noise spectra were investigated by analyzing production of the sibilants /s/ and /ʃ/ and the vowels /a/ and /æ/ in quiet and in background noise signals that were shaped to individual speakers' sibilant productions (sibilant-shaped noise) and vowel productions (vowel-shaped noise). The vowels /a/ and /æ/ were selected because they have been shown to be particularly confusable in noise (Phatak & Allen, 2007), and the sibilants /s/ and /ʃ/ were selected because of the similarities in their articulation and because noise-induced changes in these sounds have been documented previously (Perkell et al., 2007). Target phonemes were embedded in words and elicited using an immediate serial recall paradigm that involved overt recall of three-word sequences. The effects of word confusability were investigated by systematically varying the importance of the sibilant and vowel contrasts to the discrimination of sequence words. This manipulation was accomplished by varying the phonological similarity of words in a sequence. Sibilant and vowel productions were investigated to assess whether production of these sounds (a) is differentially affected by noise signals that were spectrally similar versus dissimilar to speakers' production of those sounds, (b) is modulated by the contribution of sibilant and/or vowel contrasts to discrimination of words in a sequence, and (c) is modulated by the interaction between noise spectral characteristics and contrastive importance.

Method

Participants

Participants in this study were 10 male speakers (M = 25.7 years, SD = 6.2) and four female speakers (M = 24.3 years, SD = 5.7). Participants were all native speakers of English with no reported history of speech, language, or hearing impairments. All protocols, including participant recruitment and experimental procedures, were approved by the institutional review board where the research was conducted.

Experimental Protocol

The experiment was conducted in a single-wall audio booth (Acoustic Systems, Model RE-147 S) containing a computer monitor for displaying speech stimuli. The experimental protocol consisted of three runs containing 60 trials each. The task was an immediate serial recall task involving three-word sequences that were presented one word at a time on the computer monitor. The length of the sequences was limited to three to avoid especially large error rates associated with longer, phonologically similar sequences (Baddeley, 1968; Drewnowski & Murdock, 1980; Henson, 1996; Jared & Seidenberg, 1990; Reilly & Spencer, 2013a). Each word was presented for a duration of 1.5 s, and the interval between presentations was .75 s. After a variable delay of between .75 and 2.0 s, the prompt 〈recall〉 appeared on the monitor, and speakers produced the three words in the order they were presented.

Speech signals were transduced with a headworn directional microphone (AKG Model C520) placed approximately 5.5 cm from the speaker's lips. The microphone signal was preamplified (Mackie VLZ3) and digitized using an external sound card (Delta 44, M-Audio) at a sampling rate of 48 kHz. Speakers received auditory feedback of their speech output by streaming the incoming audio signal to a digital output buffer and routing it back to the speaker via Etymotic ER4 microPro earphones. The delay between the input and output speech streams was approximately 15 ms, and the gain of the output stream was approximately 12 dB SPL relative to the input stream. Following each trial, a copy of the speech microphone signal was saved to the computer's hard drive.

On one third of the trials, speakers produced the sequences in quiet. On another third of trials, speakers heard a vowel-shaped noise signal, and on the remaining third of trials, speakers heard a sibilant-shaped noise signal. The different noise signals were presented on random trials in a run, and the same noise condition was presented on no more than two consecutive trials. Noise signals were mixed with the speech output stream from the external sound card. The noise was presented 500 ms prior to the display of the first word in a sequence and continued throughout the presentation and recall of word sequences. As a result, speakers heard the noise signal and their speech feedback during recall of the sequences. The signal-to-noise ratio of the mixture and the method for deriving it are described below (see Noise Signals section).

Stimuli

Stimuli contained multiple instances of the target sibilants /s/ and /ʃ/ and the target vowels /a/ and /æ/ in sequences where these contrasts were more or less important to discrimination of the sequence words.

Discriminability was expressed in terms of phonological similarity, which was coded as either “low,” “medium,” or “high” based on the word-initial consonants and vowels in a sequence. Sequences with high phonological similarity contained a pair of sibilant–vowel words, either “sock” and “shack” or “sat” and “shot,” and a stop diphthong word (e.g., “doubt”). An example of a high-similarity sequence is “shack sock doubt.” Medium-similarity sequences contained a sibilant–vowel word, a nasal vowel word (either “knock,” “knack,” “gnat,” or “knot”), and a stop diphthong word. The sequence “knack sock doubt” is an example of a medium-similarity sequence. In these sequences, the sibilant–vowel word and the nasal vowel word had the same final consonant but different target vowel (e.g., “sat” was paired with “knot,” and “shack” was paired with “knock”). Medium-similarity sequences were less similar than high-similarity sequences because of the differences in the initial consonants of their words. Sequences with low phonological similarity were composed of a sibilant–vowel word and two stop diphthong words. An example of a low-similarity sequence is “guide sock boat.” The final consonants of words in these sequences were either stops or affricates. These sequences were considered minimally similar because the vowels consisted of both monophthongs and diphthongs did not contain words whose contrast depended on discrimination of /a/ and /æ/ or /s/ and /ʃ/. For these reasons and because listeners are quite accurate at discriminating diphthongs in noise (Cutler et al., 2004), they were unlikely to be confused in noise. High-, medium-, and low-similarity sequences were presented an equal number of times in the experiment (i.e., 60 trials). This resulted in 150 productions of the target vowels /a/ and /æ/ and 120 productions of the target sibilants /s/ and /ʃ/.

Noise Signals

During the presentation and production of each sequence, speakers were exposed to one of three noise conditions: (a) quiet, (b) vowel-shaped noise, and (c) sibilant-shaped noise. The noise signals were generated by passing white noise through speaker-specific sibilant and vowel filter functions reflecting the distribution of spectral energy for each speakers' productions of /a/ and /æ/ (vowel filter) and /s/ and /ʃ/ (sibilant filter). Recordings of speakers' productions of the target phonemes were obtained from a preexperiment practice session, during which speakers produced the individual words comprising the three-word sequences used in the experiment. The sibilant–vowel words were each presented three times during the practice session, resulting in six productions of /s/, /ʃ/, /a/, and /æ/.

Immediately following the practice session, sound files containing productions of the target phonemes were loaded into a graphical user interface (GUI), which displayed automatically derived onsets and offsets for vowel and sibilant phonemes as well as the first and second formants (F1 and F2) of the target vowel(s). Onsets and offsets were automatically derived based on the rate of change of spectral features associated with sibilant and vowel sounds (Reilly & Spencer, 2013a, 2013b). The experimenter had the option of correcting inaccuracies in the estimated onsets and offsets, and then root-mean-square (RMS) amplitudes of target vowel and sibilant productions were calculated. The vowel and sibilant RMS amplitudes were used to set the intensity levels of the noise signals (see below). F1 and F2 time series were derived for vowel productions using linear predictive coding (LPC) analysis. The order of the LPC model was adjusted to the value that yielded the most accurate formant estimates based on an examination of the corresponding spectrogram. Following this, LPC spectra of the vowels produced during the practice session were calculated using the model order that had provided the best fit to the speaker's data. The average LPC spectra for /a/ and /æ/ were then derived for each speaker and provided the basis for constructing speaker-specific vowel filter functions. For each sibilant produced during the practice session, a power spectrum was calculated using Welch's method. The power spectrum for a sibilant was obtained segmenting the signal into windows of 512 points (approximately 11 ms) that were overlapped by 50% and then averaging them. The average spectra of /s/ and /ʃ/ were calculated and used to derive the speaker-specific sibilant filter functions.

Filter parameters were derived from the averaged spectra and used to construct Butterworth bandpass filters corresponding to each speaker's vowel and sibilant filter functions. For the vowel filter, the first passband frequency was set to the lower of the F1 averages for /a/ and /æ/, and the second passband frequency was set to the average F2 for /æ/, which is higher than the F2 for /a/. The stopband frequencies were set to the 10-dB down points in the averaged spectra located above and below each passband. Attenuation at the stopband frequencies was set to 6 dB.

For the sibilant filter, the first bandpass frequency was set to the lowest frequency local peak whose amplitude was within 3 dB of the global peak of the averaged spectrum for /ʃ/. The second bandpass frequency was set to the highest frequency local peak whose amplitude was within 3 dB of the global peak of the average spectrum for /s/. The stopband frequencies were set to the 20-dB down points of the averaged spectra above and below the bandpass frequencies. Attenuation at the stopband frequencies was set to 6 dB.

The decision to set the stopband of the vowel filter to the 10-dB point and the stopband of the sibilant filter to the 20-dB down point was motivated by the distributional characteristics of vowel and sibilant spectra revealed during pilot testing. The spectrum for /ʃ/ tended to be quite skewed, such that the drop-off from the peak amplitude on the low-frequency side was quite steep. As a result, it was not possible to reliably fit a sibilant filter function to the 3-dB down point (the first pass band) and the 10-dB down point (the first stop band). In addition, using a 10-dB down stopband did not adequately “cover” the low-frequency energy of /ʃ/. At the same time, a vowel filter function with a 20-dB down stopband overlapped the F0 of many speakers, which could have altered responses to vowel noise in ways that sibilant noise did not. As a result, different stopbands were used to better accommodate the different spectral characteristics of the vowel and sibilant sounds.

Vowel- and sibilant-shaped noise signals were generated by passing white noise through the filter functions for each speaker. The amplitudes of the noise signals were set to the average of the vowel and sibilant RMS amplitudes derived earlier. The noise signals were set to identical dB A levels by dividing them by their A-weighted RMS values. This latter step controlled for differences in the perceived loudness of the noise signals at different frequencies. Analysis of the pretest productions and derivation of the noises signals was typically completed in less than 5 min. The left panel of Figure 1 displays the individual and averaged spectra (thin and thick lines, respectively) from one speaker's tests productions of /s/ (black) and /ʃ/ (gray). The solid vertical lines denote the passband frequencies, and the dashed vertical lines denote the stopband frequencies. The thick red line represents the sibilant filter function. The corresponding vowel spectra and filter function are displayed in the right panel of Figure 1. As evident in this figure, the method for scaling the intensities of the noise signals produced lower signal-to-noise ratios for sibilant phonemes than for vowel phonemes. This was a consequence of the decision to set the amplitudes of the noise signals to the same level and to the fact that the vowel sounds had larger amplitudes than the sibilant sounds.

Figure 1.

Figure 1.

Sibilant (left) and vowel (right) spectral data and filter functions for one speaker. For each panel, spectra from individual pretest productions are indicated with thin lines, and the average spectra for each target sibilant (left) and vowel (right) are indicated with thick lines. The average spectra were used to derive the passband frequencies (solid vertical lines) and the stopband frequencies (dashed vertical lines) for the sibilant and vowel filter functions (thick red lines in each panel).

Analysis

Data from each trial were loaded into a GUI that displayed the microphone signal, the preemphasized microphone signal, and the broadband spectrogram. Playback of the microphone signal was used to identify errors in recall of the three-word sequence and trials containing errors were excluded from acoustic analysis. Information displayed in the GUI was used to manually identify the onset(s) and offset(s) of target sibilant and vowel productions in each trial. Seven acoustic features were extracted from the parsed productions of /s/ and /ʃ/: spectral moments 1–4 (i.e., spectral mean, spectral variance, skewness, and kurtosis), spectrum peak frequency, sibilant intensity, and sibilant duration. Spectral moments and sibilant intensities were derived using a 40-ms window that was passed through the sibilant waveform in 10-ms increments. Since sibilant durations were rarely if ever divisible by 10 ms, it was inevitable that a portion of each sibilant production was left “unwindowed” and excluded from analysis. The amount of unwindowed data was divided equally between the beginning and end of the production so that data sampling did not favor sibilant onset or sibilant offset.

The intensity of a sibilant production was measured by calculating the RMS amplitude of each 40-ms window of data spanning the sibilant's production. The resulting time series of RMS amplitudes was then converted to dB using an arbitrary reference and the average of the dB values that represented sibilant's intensity. To calculate spectral moments, each 40-ms window of data was preemphasized, and then a Hamming window was applied to the signal. A 1,024-point fast Fourier transform was then calculated, and the first four spectral moments were derived from the resulting energy by frequency distribution using the methods described in previous studies (Forrest et al., 1988; Jongman et al., 2000). Time series of the spectral moments were then averaged for each sibilant production. Spectrum peak frequency was calculated from a 40-ms window centered at the middle of the sibilant. Lastly, sibilant duration was determined as the time span between the sibilant and vowel onsets.

Four acoustic features were extracted from each vowel production: vowel F1, F2, intensity, and duration. For this analysis, the microphone signal was down-sampled by a factor of four to a sampling rate of 12000 Hz, and a 24-ms analysis window was incremented in 12-ms steps through the vowel waveform. Like the measurement of sibilant intensity, each vowel's intensity was measured by calculating the RMS amplitude of each data window spanning the vowel's production. These RMS amplitudes were converted to dB, and the average dB value represented the intensity of that vowel. To calculate vowel F1 and F2, each data window was preemphasized, and a Hamming window was applied to the resulting signal. LPC analysis was used to calculate the F1 and F2 values of the resulting data. Time series of the F1 and F2 values were superimposed on a broadband spectrogram of the speech signal to evaluate their accuracy, and a user adjusted the order of the LPC model as necessary to ensure a good fit between the derived formant values and the amplitude peaks in the spectrogram. In general, the model order for female speakers varied between 10 and 11, and the model order for male speakers varied between 12 and 14. The “steady–state” portion of the vowel was then extracted, and the average F1 and F2 values were calculated. The vowel steady state was identified as the F1 and F2 values following the formant transition out of the initial consonant and preceding the formant transition into the final consonant.

Results

The means and standard errors of speakers' sibilant and vowel acoustic features are displayed in Table 1. Data are displayed by phoneme for each sex. Sibilant and vowel intensities are omitted from this table because the dB levels for these variables were derived using an arbitrary reference. The values for M1 in Table 1 differ from those of previous studies and, specifically, are larger than previously reported (Fox & Nissen, 2005; Jongman et al., 2000). This is likely a result of the higher sampling rate (i.e., 48 kHz) used in this study compared to previous studies of spectral moments, which typically use a sampling rate of 22050 Hz. In contrast, Nittrouer (1995) sampled audio at 50 Hz and reported values that were comparable to the ones observed in this study. To evaluate further the reason for the higher M1 values, the acoustic signals were down-sampled to 24 kHz, and the spectral measures were recalculated. These results are displayed in Table 2 and are generally consistent with the findings of studies that sampled at 22050 Hz.

Table 1.

Means and standard errors (parentheses) of sibilant (left) and vowel (right) features across experimental conditions by phoneme and sex.

Sex Phoneme Sibilant feature averagesa
Vowel feature averagesa
Spectral mean (M1) Spectral variance (M2) Skewness (M3) Kurtosis (M4) Peak frequency Duration F1 F2 Duration
Female /s/ 10301 Hz 3158 kHz 0.874 3.170 9338 Hz 0.169 s /a/ 939 Hz 1568 Hz 0.195 s
(192) (348) (0.10) (0.47) (287) (0.016) (29) (34) (0.007)
/ʃ/ 6327 Hz 7827 kHz 1.440 4.217 3758 Hz 0.168 s /æ/ 962 Hz 2030 Hz 0.197 s
(160) (1081) (0.09) (0.46) (157) (0.014) (45) (26) (0.007)
Male /s/ 8378 Hz 5156 kHz 0.756 4.237 6376 Hz 0.166 s /a/ 760 Hz 1382 Hz 0.160 s
(296) (541) (0.14) (0.72) 495 Hz (0.006) (17) (33) (0.008)
/ʃ/ 5327 Hz 7697 kHz 1.594 2.782 2772 Hz 0.169 s /æ/ 744 Hz 1770 Hz 0.163 s
(270) (600) (0.17) (0.36) 103 Hz (0.006) (14) (35) (0.007)

Note. Sibilant values in the table were derived using a sampling rate of 48 kHz.

a

Sibilant and vowel intensities are omitted as these values were derived using an arbitrary reference.

Table 2.

Means and standard errors (parentheses) of sibilant spectral moments after down-sampling audio to 24 kHz.

Sex Phoneme Sibilant feature averages (down-sampled)
Spectral mean (M1) Spectral variance (M2) Skewness (M3) Kurtosis (M4) Peak frequency
Female /s/ 8680 Hz 986 kHz −1.242 0.402 9048 Hz
(111) (40) (0.18) (0.80) (298)
/ʃ/ 5425 Hz 3156 kHz 0.656 0.923 3758 Hz
(94) (266) (0.07) (0.10) (157)
Male /s/ 7261 Hz 2326 kHz −0.040 0.801 6360 Hz
(280) (225) (0.28) (0.40) 496 Hz
/ʃ/ 4532 Hz 3485 kHz 0.923 −0.016 2772 Hz
(196) (392) (0.16) (0.26) 103 Hz

Sibilant Productions

Effects of Noise Type

For each speaker, the seven sibilant acoustic features were averaged by noise condition, phonological similarity, and phoneme (/s/ or /ʃ/). Figure 2A displays the mean values of speakers' spectral moments by noise type for low-similarity (left column), medium-similarity (middle column), and high-similarity (right column) sequences. From top to bottom, the panels depict results for spectral mean (M1), spectral variance (M2), skewness (M3), and kurtosis (M4). Figure 2B displays the corresponding findings for peak frequency (top row), intensity (middle row), and duration (bottom row). In each panel, the results for /s/ are depicted with circles, and the results for /ʃ/ are depicted with squares. The dashed gray line in each panel denotes the averaged findings for /s/ and /ʃ/.

Figure 2.

Figure 2.

(A) The mean values of speakers' spectral moments by noise type for low-similarity (left column), medium-similarity (middle column), and high-similarity (right column) sequences. From top to bottom, the panels display the results for spectral mean (M1), spectral variance (M2), skewness (M3), and kurtosis (M4). In each panel, the results for /s/ are depicted with circles, the results for /ʃ/ are depicted with squares, and the dashed gray line denotes the averaged findings for /s/ and /ʃ/. (B) The corresponding findings for peak frequency (top row), intensity (middle row), and duration (bottom row). Med = medium.

Three thousand five hundred eighty-three vowel productions and 2,811 sibilant productions were available for the analysis of noise effects. Repeated-measures analyses of variance (ANOVAs) were performed on each feature to evaluate the effects of noise (“quiet,” “vowel noise,” and “sibilant noise”), similarity (“low,” “medium,” and “high”), and phoneme (/s/ and /ʃ/) on sibilant production. For each of the ANOVAs, Mauchly's test of sphericity was performed to identify violations of the homogeneity of variance assumption. For tests that violated this assumption, the adjusted degrees of freedom and Greenhouse–Geisser corrected p values are reported. Bonferroni correction was used to control for Type I error associated with performing seven ANOVAs, and a Bonferroni-corrected p value of .0071 was used to determine statistical significance. Acoustic differences between /s/ and /ʃ/ have been documented in previous studies (Jongman et al., 2000; Nittrouer, 1995; Shadle & Mair, 1996) and are not reported here.

The effects of noise type on sibilant features are summarized in Table 3. These analyses revealed main effects of noise type for spectral mean (M1), F(2, 26) = 20.30, p < .0001; kurtosis (M4), F(2, 26) = 7.76, p < .005; spectral peak frequency, F(2, 26) = 7.14, p < .005; sibilant intensity, F(2, 26) = 40.39, p < .0001; and sibilant duration, F(1.24, 16.01) = 18.51, p < .0005. Noise type effects were not observed for spectral variance (M2), F(2, 26) = 1.46, p = .25, or spectral skew (M3), F(2, 26) = 2.69, p = .07.

Table 3.

Summary of results from repeated-measures analysis of variance tests evaluating the main effects of noise, similarity, and their interaction on sibilant features.

Sibilant feature Main effects of noise type and similarity on sibilant features
F p Pairwise differences between similarity conditions p
M1 Noise 20.30 < .001 Sibilant–quiet 132 Hz < .005
Sibilant–vowel
Vowel–quiet 116 Hz < .0005
Similarity 1.10 .348
Noise × Sibilant 2.87 .075
Noise × Similarity 0.34 .852
Sibilant × Similarity 0.78 .471
Noise × Sibilant × Similarity 1.15 .344
M2 Noise 1.46 .25
Similarity 1.15 .331
Noise × Sibilant 5.43 .011
Noise × Similarity 0.77 .495
Sibilant × Similarity 0.68 .513
Noise × Sibilant × Similarity 1.42 .257
M3 Noise 2.69 .070
Similarity 0.30 .741
Noise × Sibilant 3.52 .044
Noise × Similarity 0.78 .542
Sibilant × Similarity 0.20 .820
Noise × Sibilant × Similarity 0.67 .618
M4 Noise 7.76 < .005 Sibilant–quiet
Sibilant–vowel
Vowel–quiet
Similarity 0.21 .810
Noise × Sibilant 2.79 .080
Noise × Similarity 0.36 .836
Sibilant × Similarity 0.29 .752
Noise × Sibilant × Similarity 0.81 .522
Spectral peak Noise 7.14 < .005 Sibilant–quiet 181 Hz < .0167
Sibilant–vowel
Vowel–quiet 152 Hz < .01
Similarity 0.70 .504
Noise × Sibilant 0.09 .918
Noise × Similarity 1.01 .411
Sibilant × Similarity 0.76 .480
Noise × Sibilant × Similarity 2.37 .126
Relative intensity Noise 42.21 < .001 Sibilant–quiet 1.0 dB < 0.0001
Sibilant–vowel 0.4 dB < .01
Vowel–quiet 0.6 dB < .0001
Similarity 2.02 .153
Noise × Sibilant 0.98 .389
Noise × Similarity 0.10 .982
Sibilant × Similarity 0.14 .872
Noise × Sibilant × Similarity 2.57 .048
Sibilant duration Noise 18.51 < .001 Sibilant–quiet 11 ms < .001
Sibilant–vowel 9 ms < .001
Vowel–quiet
Similarity 0.86 .434
Noise × Sibilant 3.75 .037
Noise × Similarity 0.42 .796
Sibilant × Similarity 0.14 .870
Noise × Sibilant × Similarity 0.52 .720

Note. The main effects of phoneme (/s/ vs. /ʃ/) are not displayed as these have been previously documented in the research literature. Dashes in the table denote nonsignificant findings.

Multiple comparisons were carried out to evaluate pairwise differences between noise conditions for each of the sibilant features that exhibited a main effect of noise condition. A Bonferroni-adjusted threshold of p < .0167 was used to identify significant differences. This analysis revealed that, compared to quiet, spectral means (M1) increased significantly during both vowel noise, t(13) = 5.64, p < .0005 (mean increase = 116 Hz), and sibilant noise, t(13) = 3.28, p < .005 (mean increase = 132 Hz, p < .005). No difference was observed between the spectral means during vowel versus sibilant noise, t(13) = 0.29, p = .61. Pairwise comparisons of kurtosis values during the different noise types did not yield any significant differences across noise conditions at the corrected threshold for significance. To evaluate whether kurtosis changes were a general response to noise, Helmert contrasts were used to compare kurtosis values in quiet to those in vowel and sibilant noise. This comparison identified a significant difference between quiet and noise, t(13) = 2.48, p < .0167, with kurtosis values decreasing in noise (i.e., spectra were flatter and less peaked).

Spectral peak frequencies were also significantly greater during vowel and sibilant noise conditions compared to quiet. On average, peak frequencies increased by 152 Hz during vowel noise, t(13) = 3.01, p < .01, and 181 Hz during sibilant noise, t(13) = 2.44, p < .0167. Differences between peak frequencies during vowel noise and sibilant noise were not significant, t(13) = 1.05, p = .90. Compared to quiet, sibilant intensities were also significantly greater during vowel noise, t(13) = 6.04, p < .0001 (mean increase = 0.57 dB), and sibilant noise, t(13) = 6.62, p < .0001 (mean increase = 1.0 dB). In addition, sibilant intensities were significantly greater during sibilant noise than they were during vowel noise, t(13) = 3.00 (mean increase = 0.46 dB). Lastly, sibilant durations were significantly greater during sibilant noise than during quiet, t(13) = 3.80, p < .001 (mean increase = .011 s), and vowel noise, t(13) = 3.80, p < .001 (mean increase = .010 s). Differences in sibilant duration were not observed between quiet and vowel noise, t(13) = 1.20, p = .13.

Interaction effects between sibilant phoneme and noise type were performed to evaluate whether responses to noise were different during production of /s/ versus /ʃ/. The results of the analysis are displayed in Table 3 and indicate that this analysis failed to identify differences in the production of /s/ versus /ʃ/ during the different noise types for any of the seven sibilant features (p > .01).

Effects of Phonological Similarity

Sibilant acoustic features were also analyzed to determine whether the phonological similarity of words in a sequence altered production of the sibilants /s/ and /ʃ/. Similarity effects on sibilant productions are displayed in Figures 2A and 2B. In these figures, means and standard errors of sibilant features during low-similarity sequences are shown in the panels on the left, sibilant values during medium-similarity sequences results are shown in the middle column of panels, and values during high-similarity sequences are shown in the panels on the right. The results of this analysis are shown in Table 3, which indicates that main effects of phonological similarity were not observed for any of the sibilant features (p > .16).

Interaction Effects During Sibilant Productions

Interaction effects between sibilant phoneme and noise type were performed to evaluate whether responses to noise were different during production of /s/ versus /ʃ/. The results of the analysis are displayed in Table 3, which shows that this analysis failed to identify differences in the production of /s/ versus /ʃ/ during the different noise types for any of the seven sibilant features (p > .01). In addition, interactions between sibilant phoneme and similarity were not significant for any sibilant features (p > .47; see Table 3).

Lastly, an analysis of interaction effects between noise and similarity was performed to determine whether responses to noise were modulated by the similarity of words in a sequence. As indicated in Table 3, this analysis failed to identify any interactions between noise condition and phonological similarity for any of the seven sibilant features (p > .11). Similarly, the three-way interaction between noise, sibilant phoneme, and similarity was not significant for any sibilant features (p > .048).

Vowel Productions

Effects of Noise Type

Vowel formant, intensity, and duration values were averaged by noise condition, phonological similarity, and vowel phoneme (/a/ or /æ/) for each speaker and analyzed to determine whether vowel production changed in the different noise conditions. Mean values of vowel features are displayed by noise type in the panels of Figure 3. Panel columns depict the effects of low-similarity (left), medium-similarity (middle) and high-similarity (right) sequences. From top to bottom, panel rows depict vowel F1, F2, intensity, and duration. Mean values for /a/ are depicted with circles, and those for /æ/ are depicted with squares. Gray dashed lines denote the averages of /a/ and /æ/.

Figure 3.

Figure 3.

Each panel displays the mean values of vowel features by noise type. Panel columns depict the effects of low-similarity (left), medium-similarity (middle), and high-similarity (right) sequences. From top to bottom, panel rows depict vowel F1, F2, intensity, and duration. Mean values for /a/ are depicted with circles, and those for /æ/ are depicted with squares; gray dashed lines denote the averages of /a/ and /æ/.

Separate repeated-measures ANOVAs evaluated the main effects of noise on the four vowel features using a Bonferroni-corrected threshold of p < .0125 to identify significant differences. The results of these analyses are summarized in Table 4. Main effects of vowel phoneme (i.e., /a/ vs. /æ/) are not displayed in this table as the acoustic differences between these vowels have been previously documented (Stevens, 2000).

Table 4.

Summary of results from repeated-measures analysis of variance tests evaluating main effects of noise type, similarity, and their interaction on vowel features.

Vowel Feature Main and interaction effects of noise type and similarity on vowel features
F p Pairwise differences between similarity conditions p
F1 Noise 14.04 < .001 Sibilant–quiet
Sibilant–vowel
Vowel–quiet 14 Hz < .001
Similarity 2.68 .090
Noise × Vowel 2.57 .096
Noise × Similarity 0.69 .502
Vowel × Similarity 1.21 .302
Noise × Vowel × Similarity 0.65 .629
F2 Noise 5.18 .013
Similarity 4.52 .020
Noise × Vowel 2.86 .076
Noise × Similarity 0.72 .580
Vowel × Similarity 3.91 .033
Noise × Vowel × Similarity 2.99 .027
Relative intensity Noise 21.81 < .001 Sibilant–quiet 0.6 dB < .001
Sibilant–vowel
Vowel–quiet 0.7 dB < .0005
Similarity 5.12 .013
Noise × Vowel 9.99 .001 /a/ - /æ/ Sibilant–quiet
Sibilant–vowel
Vowel–quiet 0.4 dB < .001
Noise × Similarity 3.84 .008 Vowel–quiet Med–low
High–low
High–medium
Sibilant–quiet Med–low
High–low
High–medium
Vowel - Sibilant Med–low
High–low −0.4 dB < .01
High–medium 0.5 dB < .0167
Vowel × Similarity 0.18 .839
Noise × Vowel × Similarity 1.66 .173
Vowel duration Noise 2.86 .080
Similarity 10.29 < .001 Medium–low 6 ms < .0167
High–low
High–medium
Noise × Vowel 3.66 .040
Noise × Similarity 2.10 .094
Vowel × Similarity 1.15 .333
Noise × Vowel × Similarity 2.86 .032

Note. The main effects of phoneme (/a/ vs. /æ/) are not displayed as these have been previously documented in the research literature. Dashes in the table denote nonsignificant findings.

Main effects of noise type were observed for F1, F(2, 26) = 14.04, p < .0001, and vowel intensity, F(2, 26) = 21.81, p < .0001. Noise type effects were not observed for F2, F(2, 26) = 5.18, p = .013, or vowel duration, F(26, 26) = 2.86, p = .08. Multiple comparisons analysis revealed that F1 values were significantly higher during the vowel noise than during quiet, t(13) = 3.84, p < .001 (mean difference = 14 Hz). No difference between F1 values during quiet and sibilant noise, t(13) = 2.39, p = .019, or between vowel and sibilant noise, t(13) = 1.29, p = .11, was detected. A comparison of vowel intensity levels across noise types revealed that, compared to quiet, vowel intensity increased an average of 0.68 dB during vowel noise, t(13) = 3.98, p < .001, and an average of 0.67 dB during sibilant noise, t(13) = 5.89, p < .001. Differences in intensity levels were not observed between the vowel noise and sibilant noise conditions, t(13) = 2.06, p = .97.

Vowel Productions and Similarity

Similarity effects were also evaluated in speakers' vowel productions. For this analysis, vowels from nasal vowel words (i.e., “knock,” “knack,” “gnat,” or “knot”) in the medium-similarity condition were excluded as nasalization of a vowel can increase the F1 of that vowel (Dickson, 1962; Fujimura, 1960; House & Stevens, 1956). In the present analysis, such an increase would potentially increase F1 in medium-similarity sequences in ways that were unrelated to phonological similarity. Exclusion of these productions also ensured that each condition had the same number of vowel samples. As a result, there were 2,699 vowel productions available for the similarity analysis. The effects of similarity on vowel features are displayed by column in the panels of Figure 3. No effects of phonological similarity on vowel production were observed for F1, F(2, 26) = 2.68, p = .09; F2, F(2, 26) = 4.52, p = .02; or vowel intensity, F(2, 26) = 5.12, p = .013. Similarity effects were observed for vowel duration, F(2, 26) = 10.29, p < .001. Comparisons of vowel durations across levels of similarity revealed that small but significant increases in vowel durations were observed during medium- versus low-similarity sequences, t(13) = 2.69, p < .0167 (mean increase = 6 ms), but no differences were detected between medium- and high-similarity sequences, t(13) = 1.87, p = .04, or between low- and high-similarity sequences, t(13) = 0.98, p = .17.

Interaction Effects During Vowel Productions

Interaction effects between vowel phoneme and noise type were observed for vowel intensity, F(2, 26) = 9.99, p < .001, but not for any other vowel features (p > .04; see Table 4). The nature of the interaction between vowel phoneme and noise type is displayed in Figure 4, which depicts means and standards of the vowel intensities for /a/ and /æ/ during quiet (open circles), vowel noise (gray circles), and sibilant noise (black circles). Multiple comparisons analysis revealed that increases in intensity from quiet to vowel noise were 0.4 dB larger during production of /a/ than during /æ/. No interactions were observed between vowel phoneme and similarity for any vowel features (p > .03; see Table 4).

Figure 4.

Figure 4.

Means and standards of the vowel intensities for /a/ and /æ/ during quiet (open circles), vowel noise (gray circles), and sibilant noise (black circles).

An interaction between phonological similarity and noise type was observed for vowel intensity, F(4, 52) = 4.64, p < .005. The interaction effect is displayed in Figure 5, which depicts means and standard errors of the vowel intensities for each noise type during low-similarity (open circles), medium-similarity (dark circles), and high-similarity (gray circles) sequences. As shown in this figure, intensity increases quiet to vowel noise conditions were smaller during high-similarity question sequences than low- and medium-similarity sequences. Pairwise comparisons of this interaction effect confirmed that increases in vowel intensity during vowel noise versus quiet were significantly smaller during high-similarity sequences than during medium-similarity sequences, t(13) = 3.19, p < .0167 (mean difference = −0.4 dB), and low-similarity sequences, t(13) = 2.86, p < .01 (mean difference = −0.4 dB). The direction of the interaction was opposite what was expected and is addressed in the Discussion section.

Figure 5.

Figure 5.

Means and standard errors of vowel intensities for each noise type during low-similarity (open circles), medium-similarity (dark circles), and high-similarity (gray circles) sequences. This figures depicts the interaction between noise type and similarity. Intensity increases during vowel noise versus quiet were smaller during high-similarity question sequences than low- and medium-similarity sequences. Med = medium.

Intervowel and Intersibilant Spectral Contrast Distances

The previous analyses evaluated main and interaction effects of noise and similarity on individual acoustic features for sibilants and vowels. An analysis of intersibilant and intervowel contrast distances provided a means of analyzing changes in multiple acoustic features simultaneously. The effects of noise condition and phonological similarity on intersibilant and intervowel contrast distances were evaluated to determine whether these conditions affected distinctiveness of the sounds. Sibilant contrast distances were calculated from the spectral mean (M1), peak frequency, and kurtosis (M4). Spectral mean, kurtosis, and peak frequency were selected because these variables exhibited main effects of noise type. The Bhattacharyya distances were calculated between the /s/ and /ʃ/ feature sets for each experimental condition. Bhattacharyya distance was chosen over Euclidean distance because it does not require normalizing variables and because Bhattacharyya distances are well suited to the calculation of distances between distributions of values (Basseville, 2013; Bhattacharyya, 1943), such as the distribution of features for the experimental conditions in this study. For each speaker, the inputs to the contrast analysis were n × 3 matrices, in which the three columns corresponded to the three acoustic features for each of the n productions of either /s/ or /ʃ/ in a condition. As such, 18 matrices (3 noise conditions × 3 similarity conditions × 2 sibilants) were created for each speaker, and derivation of the Bhattacharyya distances between sibilants for each condition produced nine distance values. A repeated-measures ANOVA did not detect a main effect of noise, F(1.3, 17.3) = 0.05, p = .89, or similarity, F(1.1, 14.3) = 1.82, p = .20, on the Bhattacharyya distances. An interaction effect between noise and similarity was also not detected, F(2.1, 28.0) = 1.41, p = .26.

Although vowel spectral changes were limited to F1 in the noise condition, an identical analysis was performed to determine whether these F1 changes produced a significant increase in the Bhattacharyya distances between the vowels /a/ and /æ/ across noise and similarity conditions. A repeated-measures ANOVA of Bhattacharyya distances did not reveal a main effect of noise condition, F(2, 26) = 0.33, p = .72, or similarity, F(2, 26) = 1.25, p = .30. Finally, no interaction between noise and similarity was present, F(4, 52) = 1.58, p = .19.

Error Analysis

Speakers produced errors on 188 trials or about 8% of trials. The average number of trials containing an error was 13.3 (SD = 9.4), and the minimum and maximum number of errors were 2 and 37, respectively. The error rate produced by each speaker in each of the similarity and noise conditions was calculated and evaluated using a mixed-effects logistic regression model. Individual speakers were coded as random effects, and similarity and noise were coded as fixed effects. This analysis identified a main effect of noise, F(2, 117) = 11.70, p < .005, and Bonferroni-corrected pairwise comparisons revealed that vowel noise was associated with a significantly higher error rate than either quiet (p < .01; mean increase = 4.6% or about three errors) or similarity (p < .01; mean increase = 4.1%). No effect of similarity was observed, F(2, 117) = 0.29, p < 0.75, and the interaction between noise and similarity was not significant, F(4, 117) = 1.54, p < .19.

Discussion

The present investigation evaluated the effects of noise spectral characteristics and phonological similarity on production of sibilants and vowels. Target sounds were embedded in three-word sequences that speakers recalled in quiet and during exposure to speaker-specific noise signals that masked sibilant and vowel productions. The results indicated that speech responses to noise were partly dependent on the characteristics of the noise spectrum: Sibilant changes in noise were larger and involved more acoustic features during sibilant-shaped noise than vowel-shaped noise; vowel changes were larger and involved more features during vowel-shaped noise. The phonological similarity of words in a sequence did not alter production of sibilants and was only associated with small increases in vowel duration. An interaction between noise type and similarity condition was observed that consisted of decreases in vowel intensity during high-similarity sequences in the presence of vowel noise. The direction of the interaction would not increase the clarity of vowel productions and instead likely reflects processing demands related to selecting and sequencing phonologically similar words in noise.

Noise Spectrum Effects on Vowels and Sibilants

Sibilant production in noise was associated with increases in spectral mean (M1), spectral peak frequency, sibilant intensity, and sibilant duration. The findings for spectral mean (M1) are consistent with previous findings by Perkell et al. (2007). Noise-induced changes in higher spectral moments have not been evaluated previously, but this study did not find evidence that spectral variance or skewness of sibilants was sensitive to background noise. However, kurtosis was found to decrease during both vowel and sibilant noise types. Together, these findings indicate that sibilant production in noise is associated with longer durations, increased intensity, and a slight flattening of the spectrum accompanied by a shift in the spectrum toward higher frequencies. The sibilant changes in noise were generally consistent with the changes observed during clear versus conversational fricative productions (Maniwa et al., 2009) and have also been associated with better recognition of fricative place of articulation in noise (Maniwa et al., 2008). The findings of Maniwa and colleagues (Maniwa et al., 2008, 2009) suggest that sibilant productions in noise were clearer than those produced in quiet. The mechanism responsible for these increases in clarity was likely increases in vocal effort rather than fine-tuned articulatory adjustments. That is, increases in volume velocity during sibilants tend to increase the amplitudes of higher frequencies more than those of lower frequencies (Krane, 1999; Shadle & Mair, 1996). As a result, increases in volume velocity can account not only for the increases in sibilant intensity but also for the increases in spectral mean and peak frequency that were observed in noise. In this regard, the sibilant changes in noise are mechanistically like the changes observed in the speech spectrum where many noise-induced changes (e.g., the increases in vocal intensity, flatter spectral tilt, the rises in f0 and F1) are due to general increases in vocal effort (Cooke et al., 2014; Lu & Cooke, 2009).

Several of the sibilant changes observed in noise were specific to the spectral characteristics of the noise signal. Increases in sibilant intensity and duration were greater in sibilant noise than in vowel noise. In addition, sibilant durations were greater during sibilant noise than quiet, but no duration differences were observed between vowel noise and quiet. At the same time, no sibilant changes were specific to the vowel noise condition. The sensitivity of sibilant sounds to sibilant noise indicates that speech responses to noise were affected by the frequency content of background noise. Corresponding changes in the production of vowels were not observed, possibly because the sibilant noise did not mask the acoustic distinction between vowels. Thus, while sibilant changes can be explained by increases in vocal effort, these increases were applied in a selective manner that resulted in greater changes to sibilants during sibilant-shaped noise compared to vowels.

Vowel changes in noise were also affected by the spectral characteristics of the noise signal. Specifically, vowel-shaped noise was associated with greater increases in vowel intensity than sibilant-shaped noise. In addition, F1 values increased significantly during vowel-shaped noise compared to quiet, but changes in F1 were not observed between sibilant noise and quiet. Lastly, vowel changes specific to sibilant-shaped noise were not observed. Increases in vowel F1 and intensity have been reported in previous investigations of speech production in noise (Bond et al., 1989; Garnier & Henrich, 2014; Perkell et al., 2007; Van Summers et al., 1988). As noted by both Lu and Cooke (2009)) and Garnier and Henrich (2014), increases in F1 during speech in noise are likely a consequence of increased vocal effort associated with increasing speech intensity (Sundberg & Nordenberg, 2006; Titze & Sundberg, 1992). As such, the increases in vowel F1 and intensity may be attributable to a common mechanism that did not involve active articulatory adjustments.

It is worth noting that noise-induced speech changes were generally small in magnitude and did not include some variables, such as increases in vowel duration, typically associated with speech production in noise. One possible reason for the reduced response to noise is the level of the background noise signal. The intensity levels of the vowel and sibilant noise were set to the intensity levels of speakers' vowel and sibilant productions produced in quiet during a preexperiment session. This methods contrasts with previous studies (Tartter et al., 1993; Van Summers et al., 1988) that have used noise intensity levels that are considerably larger than the intensity of speech produced in quiet. Moreover, the perceived loudness of speech auditory feedback was likely larger than that of the noise signals, owing to the combined effects of ear canal occlusion and bone conduction (Hood, 1962). As a result, the perceived loudness of the background noise may not have been sufficient to elicit more robust changes in speech output. This would have been especially true for low-frequency sounds like vowels and may explain why speech changes in noise involved more vowel features than sibilant features. In addition, the finding of greater changes for vowels may have been to the relative scaling of the noise signals, which produced more intense vowel noise signals than sibilant noise signals.

The smaller responses to noise in this study may also have been due to the speech task. Previous studies have revealed that speech responses to background are larger in communicative contexts than in noncommunicative contexts (Garnier et al., 2010; Junqua, 1993; Lane & Tranel, 1971). The fact that speakers were not communicating with a listener and the focus on accurate recall of speech sequences reduced the communicative nature of the speech task and may have contributed to the magnitude of speech responses in this study.

Together, the vowel and sibilant findings indicate that the motor control of speech in noise possesses the capacity to alter production of sounds in ways that are sensitive to the spectral characteristics of background noise. The direction of these responses is consistent with a compensation for the noise signal. The magnitude of many of these responses was small, which is consistent with the magnitude of the noise signals. It is not clear whether these changes would be perceptible to a listener or whether responses to noise would be different if a listener was present. The nature of these responses to noise is similar to speech responses to perturbations of auditory feedback. Such responses are often small (Katseff et al., 2012; Reilly & Dougherty, 2013; Tourville et al., 2008; Villacorta et al., 2007) and may not be perceptually identifiable. Despite their small magnitude, these responses to auditory perturbations provide insight into the sensorimotor control of speech production.

Phonological Similarity Effects on Vowels and Sibilants

This study also evaluated whether speakers modulated their production of sibilants and vowels in noise depending on the importance of those phonemes to discriminating the words in each sequence. This question was investigated by manipulating the phonological similarity of the consonant–vowel–consonant words in each sequence: In high-similarity sequences, two of the three words were identical except for their word-initial sibilant (/s/ vs. /ʃ/) and their vowel (/a/ vs. /æ/); medium-similarity sequences contained two words that differed in their vowels, /a/ vs. /æ/, but no other phonemes; and low-similarity sequences did not contain words that differed in either their word-initial sibilant or their vowel. Analyses of sibilant and vowel productions in each similarity condition revealed that increases in similarity of words were not associated with changes in any sibilant features. In addition, no changes were detected in speakers' vowel formants or vowel intensity levels. The only effect associated with phonological similarity was a small increase in vowel duration during medium- and high-similarity sequences compared to low-similarity sequences.

It is notable that similarity effects were not observed for any sibilant features and that, unlike vowels productions in vowel noise, sibilant features were not affected by sibilant noise during high-similarity sequences. Cutler et al. (2000) have posited that consonant production may be more constrained and less variable than vowel production, since vowels are more susceptible to phonetic context and because vowel variability is less likely to affect word perception than consonants variability. Constraints on consonant production, if present, could have contributed to the absence of similarity effects on sibilant productions in this study.

Interaction Effect Between Noise Type and Similarity

The near absence of main effects of similarity indicates that this factor was not associated with robust or widespread changes in sibilant or vowel productions. Changes in vowel intensity during the different noise types were not uniform but were dependent on the similarity of the sequence they were embedded in. Specifically, vowel intensities for low- and medium-similarity sequences were nearly identical across noise types but were significantly greater than the vowel intensities for high-similarity sequences during vowel noise. During quiet and sibilant noise, vowel intensities were comparable to those of the low- and medium-similarity sequences. The direction of interaction between noise type and similarity was inconsistent with several predictions. First, it was expected that the intensities of medium- and high-similarity sequences would be larger than those of the low-similarity sequences as the latter sequences would be less confusable in noise. This was not the case as the low-similarity sequence vowel intensities not lower than the intensities of the sequences for any noise conditions. These findings failed to support the prediction that speakers would modulate the intensity of their speech according to the acoustic confusability of the words in a sequence. Moreover, vowel intensities for low and medium-similarity sequences were actually greater than the intensities produced during high similarity in one noise type—the vowel noise. This finding indicates that high similarity among words in a sequence was associated with smaller intensity values in noise.

A possible explanation for these findings concerns the general redundancy between acoustical and phonological similarity. Despite this redundancy, these features affect different aspects of spoken language processing. In this study, acoustic similarity was expected to make the words in a sequence more confusable, and it was predicted that a speaker would respond to increase their discriminability. This was not observed. At the same time, the words in medium- and high-similarity sequences shared phonological features that may have also affected the articulation of sibilants and vowels in this study. Previous research has found that phonological similarity increases serial processing demands in reaction time and speeded speech tasks as indicated by increased vowel and pause durations (Reilly & Spencer, 2013a), reductions in speaking rate (Sevald & Dell, 1994), longer speech reaction times (Meyer & Gordon, 1985; Rogers & Storkel, 1998), and higher error rates (Meyer & Gordon, 1985; Reilly & Spencer, 2013a; Rogers & Storkel, 1998). It was believed that limiting the sequences to three words and not requiring rapid production of the sequence would minimize such phonological similarity effects. However, the smaller increases in vowel intensity during high-similarity sequences suggests that the demands of sequencing phonologically similar words partly attenuated the speech system's capacity to respond to background noise. This finding indicates that speaker- or production-internal processes, such as phonological retrieval/sequencing (Baese-Berk & Goldrick, 2009), interact and possibly compete with listener-oriented processes and that the demands of the former can affect the latter in the form of reduced intensity. This account is consistent with the findings of a dual-task study by Hansen and Patil (2007), who reported decreases in word and vowel RMS while speakers performed two computer tasks (i.e., single and dual tracking tasks).

Conclusions

The findings of this study are generally consistent with the proposal by Garnier and Henrich (2014) that there are two levels of speech responses to background noise. The primary level consists of global increases in vocal effort to increase the intensity level of speech acoustic output. In this study, evidence for this type of global change in speech output included increases in sibilant spectral mean (M1), kurtosis (M4), peak frequency, intensity, and duration and increases in vowel F1 and intensity. According to Garnier and Henrich, secondary adjustments consist of subtle changes that enhance the separation of the speech signal from the background noise to improve speech communication. Fine-tuned adjustments related to such secondary adjustments were indicated by sibilant changes that were specific to the sibilant noise condition (i.e., increases in duration and larger increases in sibilant intensity than vowel intensity) and the vowel changes that were specific to the vowel noise condition (i.e., increases in F1 and the larger increases in vowel intensity). The absence of either sibilant-specific or vowel-specific changes during spectrally dissimilar noise signals reinforces the argument that these changes were elicited by the spectral overlap or masking of the background noise signals. In the case of sibilant and vowel sounds, the secondary adjustments would have increased the speech energy in the region of the spectrum where noise energy was prominent.

This study also investigated the importance of phonemic contrasts to word discrimination in noise. This method elicited only limited similarity effects, and the observed interaction between noise type and similarity likely reflected the increased processing demands of recalling and sequencing phonologically similarity sequences. The possibility that retrieval and/or sequencing phonological similarity words can influence speech responses to noise is interesting and warrants further investigation.

Acknowledgment

This research was supported by National Institute on Deafness and Other Communication Disorders Grant R03-DC011159 awarded to the author.

Funding Statement

This research was supported by National Institute on Deafness and Other Communication Disorders Grant R03-DC011159 awarded to the author.

References

  1. Baddeley A. D. (1968). How does acoustic similarity influence short term memory? Quarterly Journal of Experimental Psychology, 20(3), 249–263. https://doi.org/10.1080/14640746808400159 [DOI] [PubMed] [Google Scholar]
  2. Baese-Berk M., & Goldrick M. (2009). Mechanisms of interaction in speech production. Language and Cognitive Processes, 24(4), 527–554. https://doi.org/10.1080/01690960802299378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Basseville M. (2013). Divergence measures for statistical data processing—An annotated bibliography. Signal Processing, 93(4), 621–633. https://doi.org/10.1016/j.sigpro.2012.09.003 [Google Scholar]
  4. Bhattacharyya A. (1943). On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35, 99–109. [Google Scholar]
  5. Bond Z., Moore T. J., & Gable B. (1989). Acoustic–phonetic characteristics of speech produced in noise and while wearing an oxygen mask. The Journal of the Acoustical Society of America, 85(2), 907–912. https://doi.org/10.1121/1.397563 [DOI] [PubMed] [Google Scholar]
  6. Bradlow A. R., Torretta G. M., & Pisoni D. B. (1996). Intelligibility of normal speech. I: Global and fine-grained acoustic–phonetic talker characteristics. Speech Communication, 20(3), 255–272. https://doi.org/10.1016/S0167-6393(96)00063-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cooke M., King S., Garnier M., & Aubanel V. (2014). The listening talker: A review of human and algorithmic context-induced modifications of speech. Computer Speech & Language, 28(2), 543–571. https://doi.org/10.1016/j.csl.2013.08.003 [Google Scholar]
  8. Cutler A., Sebastián-Gallés N., Soler-Vilageliu O., & Van Ooijen B. (2000). Constraints of vowels and consonants on lexical selection: Cross-linguistic comparisons. Memory & Cognition, 28(5), 746–755. https://doi.org/10.3758/BF03198409 [DOI] [PubMed] [Google Scholar]
  9. Cutler A., Weber A., Smits R., & Cooper N. (2004). Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America, 116(6), 3668–3678. https://doi.org/10.1121/1.1810292 [DOI] [PubMed] [Google Scholar]
  10. Dickson D. R. (1962). An acoustic study of nasality. Journal of Speech and Hearing Research, 5(2), 103–111. https://doi.org/10.1044/jshr.0502.103 [DOI] [PubMed] [Google Scholar]
  11. Drewnowski A., & Murdock B. B. Jr. (1980). The role of auditory features in memory span for words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 6(3), 319–332. https://doi.org/10.1037/0278-7393.6.3.319 [PubMed] [Google Scholar]
  12. Forrest K., Weismer G., Milenkovic P., & Dougall R. N. (1988). Statistical analysis of word-initial voiceless obstruents: Preliminary data. The Journal of the Acoustical Society of America, 84(1), 115–123. https://doi.org/10.1121/1.396977 [DOI] [PubMed] [Google Scholar]
  13. Fox R. A., & Nissen S. L. (2005). Sex-related acoustic changes in voiceless English fricatives. Journal of Speech, Language, and Hearing Research, 48(4), 753–765. https://doi.org/10.1044/1092-4388(2005/052) [DOI] [PubMed] [Google Scholar]
  14. Fujimura O. (1960). Spectra of nasalized vowels. Research Laboratory of Electronics, Quarterly Progress Reports, 58, 214–218. [Google Scholar]
  15. Garnier M., & Henrich N. (2014). Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? Computer Speech & Language, 28(2), 580–597. https://doi.org/10.1016/j.csl.2013.07.005 [Google Scholar]
  16. Garnier M., Henrich N., & Dubois D. (2010). Influence of sound immersion and communicative interaction on the Lombard effect. Journal of Speech, Language, and Hearing Research, 53(3), 588–608. https://doi.org/10.1044/1092-4388(2009/08-0138) [DOI] [PubMed] [Google Scholar]
  17. Guenther F. H. (2016). Neural control of speech. MIT Press; https://doi.org/10.7551/mitpress/10471.001.0001 [Google Scholar]
  18. Hanley T. D., & Steer M. D. (1949). Effect of level of distracting noise upon speaking rate, duration and intensity. Journal of Speech and Hearing Disorders, 14(4), 363–368. https://doi.org/10.1044/jshd.1404.363 [DOI] [PubMed] [Google Scholar]
  19. Hansen J., & Patil S. (2007). Speech under stress: Analysis, modeling and recognition. In Mülle C. (Ed.), Speaker classification I (pp. 108–137). Springer; https://doi.org/10.1007/978-3-540-74200-5_6 [Google Scholar]
  20. Henson R. N. (1996). Unchained memory: Error patterns rule out chaining models of immediate serial recall. The Quarterly Journal of Experimental Psychology: Section A, 49(1), 80–115. https://doi.org/10.1080/713755612 [Google Scholar]
  21. Hood J. (1962). Bone conduction: A review of the present position with especial reference to the contributions of Dr. Georg von Békésy. The Journal of the Acoustical Society of America, 34(9B), 1325–1332. https://doi.org/10.1121/1.1918339 [DOI] [PubMed] [Google Scholar]
  22. Houde J. F., & Nagarajan S. S. (2011). Speech production as state feedback control. Frontiers in Human Neuroscience, 5, 82 https://doi.org/10.3389/fnhum.2011.00082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. House A. S., & Stevens K. N. (1956). Analog studies of the nasalization of vowels. Journal of Speech and Hearing Disorders, 21(2), 218–232. https://doi.org/10.1044/jshd.2102.218 [DOI] [PubMed] [Google Scholar]
  24. Jared D., & Seidenberg M. S. (1990). Naming multisyllabic words. Journal of Experimental Psychology: Human Perception and Performance, 16(1), 92–105. https://doi.org/10.1037/0096-1523.16.1.92 [DOI] [PubMed] [Google Scholar]
  25. Jongman A., Wayland R., & Wong S. (2000). Acoustic characteristics of English fricatives. The Journal of the Acoustical Society of America, 108(3), 1252–1263. https://doi.org/10.1121/1.1288413 [DOI] [PubMed] [Google Scholar]
  26. Junqua J. C. (1993). The Lombard reflex and its role on human listeners and automatic speech recognizers. The Journal of the Acoustical Society of America, 93(1), 510–524. https://doi.org/10.1121/1.405631 [DOI] [PubMed] [Google Scholar]
  27. Junqua J. C. (1996). The influence of acoustics on speech production: A noise-induced stress phenomenon known as the Lombard reflex. Speech Communication, 20(1–2), 13–22. https://doi.org/10.1016/S0167-6393(96)00041-6 [Google Scholar]
  28. Junqua J. C., Fincke S., & Field K. (1998, Nov-Dec). Influence of the speaking style and the noise spectral tilt on the lombard reflex and automatic speech recognition. Paper presented at the International Conference on Spoken Language Processing, Sydney, Australia. [Google Scholar]
  29. Junqua J. C., Fincke S., & Field K. (1999). The Lombard effect: A reflex to better communicate with others in noise. Paper presented at the 1999 IEEE International Conference on Acoustic, Speech, and Signal Processing, Phoenix, AZ, United States. [Google Scholar]
  30. Katseff S., Houde J., & Johnson K. (2012). Partial compensation for altered auditory feedback: A tradeoff with somatosensory feedback? Language and Speech, 55(2), 295–308. https://doi.org/10.1177/0023830911417802 [DOI] [PubMed] [Google Scholar]
  31. Krane M. (1999). Fluid dynamic effects in speech. The Journal of the Acoustical Society of America, 105(2), 1159–1159. https://doi.org/10.1121/1.425507 [Google Scholar]
  32. Lane H., & Tranel B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research, 14(4), 677–709. https://doi.org/10.1044/jshr.1404.677 [Google Scholar]
  33. Lu Y., & Cooke M. (2009). Speech production modifications produced in the presence of low-pass and high-pass filtered noise. The Journal of the Acoustical Society of America, 126(3), 1495–1499. https://doi.org/10.1121/1.3179668 [DOI] [PubMed] [Google Scholar]
  34. Mahl G. (1972). People talking when they can't hear their voices. In Siegman A. & Pope B. (Eds.), Studies in dyadic communication. Pergamon Press; https://doi.org/10.1016/B978-0-08-015867-9.50014-9. [Google Scholar]
  35. Maniwa K., Jongman A., & Wade T. (2008). Perception of clear fricatives by normal-hearing and simulated hearing-impaired listeners. The Journal of the Acoustical Society of America, 123(2), 1114–1125. https://doi.org/10.1121/1.2821966 [DOI] [PubMed] [Google Scholar]
  36. Maniwa K., Jongman A., & Wade T. (2009). Acoustic characteristics of clearly spoken English fricatives. The Journal of the Acoustical Society of America, 125(6), 3962–3973. https://doi.org/10.1121/1.2990715 [DOI] [PubMed] [Google Scholar]
  37. Meyer D. E., & Gordon P. C. (1985). Speech production: Motor programming of phonetic features. Journal of Memory and Language, 24(1), 3–26. https://doi.org/10.1016/0749-596x(85)90013-0 [Google Scholar]
  38. Moon S. J., & Lindblom B. (1994). Interaction between duration, context, and speaking style in English stressed vowels. The Journal of the Acoustical Society of America, 96(1), 40–55. https://doi.org/10.1121/1.410492 [Google Scholar]
  39. Netsell R. (1982). Speech motor control and selected neurologic disorders. In Grillner S., Lindstrom M. J., Lubker J., & Persson A. (Eds.), Speech motor control (pp. 247–261). Pergamon Press; https://doi.org/10.1016/B978-0-08-028892-5.50024-4 [Google Scholar]
  40. Nittrouer S. (1995). Children learn separate aspects of speech production at different rates: Evidence from spectral moments. The Journal of the Acoustical Society of America, 97(1), 520–530. https://doi.org/10.1121/1.412278 [DOI] [PubMed] [Google Scholar]
  41. Patel R., & Schell K. W. (2008). The influence of linguistic content on the Lombard effect. Journal of Speech, Language, and Hearing Research, 51(1), 209–220. https://doi.org/10.1044/1092-4388(2008/016) [DOI] [PubMed] [Google Scholar]
  42. Perkell J. S., Denny M., Lane H., Guenther F., Matthies M. L., Tiede M., Vick J., Zandipour M., & Burton E. (2007). Effects of masking noise on vowel and sibilant contrasts in normal-hearing speakers and postlingually deafened cochlear implant users. The Journal of the Acoustical Society of America, 121(1), 505–518. https://doi.org/10.1121/1.2384848 [DOI] [PubMed] [Google Scholar]
  43. Phatak S. A., & Allen J. B. (2007). Consonant and vowel confusions in speech-weighted noise. The Journal of the Acoustical Society of America, 121(4), 2312–2326. https://doi.org/10.1121/1.2642397 [DOI] [PubMed] [Google Scholar]
  44. Picheny M. A., Durlach N. I., & Braida L. D. (1986). Speaking clearly for the hard of hearing. II: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research, 29(4), 434–446. https://doi.org/10.1044/jshr.2904.434 [DOI] [PubMed] [Google Scholar]
  45. Pittman A. L., & Wiley T. L. (2001). Recognition of speech produced in noise. Journal of Speech, Language, and Hearing Research, 44(3), 487–496. https://doi.org/10.1044/1092-4388(2001/038) [DOI] [PubMed] [Google Scholar]
  46. Reilly K. J., & Dougherty K. E. (2013). The role of vowel perceptual cues in compensatory responses to perturbations of speech auditory feedback. The Journal of the Acoustical Society of America, 134(2), 1314–1323. https://doi.org/10.1121/1.4812763 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Reilly K. J., & Spencer K. A. (2013a). Sequence complexity effects on speech production in healthy speakers and speakers with hypokinetic or ataxic dysarthria. PLOS ONE, 8(10), e77450 https://doi.org/10.1371/journal.pone.0077450 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Reilly K. J., & Spencer K. A. (2013b). Speech serial control in healthy speakers and speakers with hypokinetic or ataxic dysarthria: Effects of sequence length and practice. Frontiers in Human Neuroscience, 7, 665 https://doi.org/10.3389/fnhum.2013.00665 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Rivers C., & Rastatter M. P. (1985). The effects of multitalker and masker noise on fundamental frequency variability during spontaneous speech for children and adults. Journal of Auditory Research, 25(1), 37–45. [PubMed] [Google Scholar]
  50. Rogers M. A., & Storkel H. L. (1998). Reprogramming phonologically similar utterances: The role of phonetic features in pre-motor encoding. Journal of Speech, Language, and Hearing Research, 41(2), 258–274. https://doi.org/10.1044/jslhr.4102.258 [DOI] [PubMed] [Google Scholar]
  51. Sevald C. A., & Dell G. S. (1994). The sequential cuing effect in speech production. Cognition, 53(2), 91–127. https://doi.org/10.1016/0010-0277(94)90067-1 [DOI] [PubMed] [Google Scholar]
  52. Shadle C. H., & Mair S. J. (1996). Quantifying spectral characteristics of fricatives. Paper presented at the Fourth International Conference on Spoken Language Processing, Philadelphia, PA, United States. [Google Scholar]
  53. Siegel G. M., & Pick H. L. (1974). Auditory feedback in the regulation of voice. The Journal of the Acoustical Society of America, 56(5), 1618–1624. https://doi.org/10.1121/1.1903486 [DOI] [PubMed] [Google Scholar]
  54. Stevens K. N. (2000). Acoustic phonetics (Vol. 30). MIT Press. [Google Scholar]
  55. Stowe L. M., & Golob E. J. (2013). Evidence that the Lombard effect is frequency-specific in humans. The Journal of the Acoustical Society of America, 134(1), 640–647. https://doi.org/10.1121/1.4807645 [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Sundberg J., & Nordenberg M. (2006). Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. The Journal of the Acoustical Society of America, 120(1), 453–457. https://doi.org/10.1121/1.2208451 [DOI] [PubMed] [Google Scholar]
  57. Tartter V. C., Gomes H., & Litwin E. (1993). Some acoustic effects of listening to noise on speech production. The Journal of the Acoustical Society of America, 94(4), 2437–2440. https://doi.org/10.1121/1.408234 [DOI] [PubMed] [Google Scholar]
  58. Titze I. R., & Sundberg J. (1992). Vocal intensity in speakers and singers. The Journal of the Acoustical Society of America, 91(5), 2936–2946. https://doi.org/10.1121/1.402929 [DOI] [PubMed] [Google Scholar]
  59. Tourville J. A., & Guenther F. H. (2011). The DIVA model: A neural theory of speech acquisition and production. Language and Cognitive Processes, 26(7), 952–981. https://doi.org/10.1080/01690960903498424 [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Tourville J. A., Reilly K. J., & Guenther F. H. (2008). Neural mechanisms underlying auditory feedback control of speech. NeuroImage, 39(3), 1429–1443. https://doi.org/10.1016/j.neuroimage.2007.09.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Van Summers W., Pisoni D. B., Bernacki R. H., Pedlow R. I., & Stokes M. A. (1988). Effects of noise on speech production: Acoustic and perceptual analysis. The Journal of the Acoustical Society of America, 84(3), 486–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Villacorta V. M., Perkell J. S., & Guenther F. H. (2007). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. The Journal of the Acoustical Society of America, 122(4), 2306–2319. https://doi.org/10.1121/1.2773966 [DOI] [PubMed] [Google Scholar]
  63. Zahorik P., & Kelly J. W. (2007). Accurate vocal compensation for sound intensity loss with increasing distance in natural environments. The Journal of the Acoustical Society of America, 122(5), EL143–EL150. https://doi.org/10.1121/1.2784148 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES