Effects of noise on speech production: Acoustic and perceptual analyses

W Van Summers; David B Pisoni; Robert H Bernacki; Robert I Pedlow; Michael A Stokes

doi:10.1121/1.396660

. Author manuscript; available in PMC: 2012 Nov 27.

Published in final edited form as: J Acoust Soc Am. 1988 Sep;84(3):917–928. doi: 10.1121/1.396660

Effects of noise on speech production: Acoustic and perceptual analyses

W Van Summers ¹, David B Pisoni ¹, Robert H Bernacki ¹, Robert I Pedlow ¹, Michael A Stokes ¹

PMCID: PMC3507387 NIHMSID: NIHMS418734 PMID: 3183209

Abstract

Acoustical analyses were carried out on a set of utterances produced by two male speakers talking in quiet and in 80, 90, and 100 dB SPL of masking noise. In addition to replicating previous studies demonstrating increases in amplitude, duration, and vocal pitch while talking in noise, these analyses also found reliable differences in the formant frequencies and short-term spectra of vowels. Perceptual experiments were also conducted to assess the intelligibility of utterances produced in quiet and in noise when they were presented at equal S/N ratios for identification. In each experiment, utterances originally produced in noise were found to be more intelligible than utterances produced in the quiet. The results of the acoustic analyses showed clear and consistent differences in the acoustic–phonetic characteristics of speech produced in quiet versus noisy environments. Moreover, these acoustic differences produced reliable effects on intelligibility. The findings are discussed in terms of: (1) the nature of the acoustic changes that take place when speakers produce speech under adverse conditions such as noise, psychological stress, or high cognitive load; (2) the role of training and feedback in controlling and modifying a talker’s speech to improve performance of current speech recognizers; and (3) the development of robust algorithms for recognition of speech in noise.

INTRODUCTION

It has been known for many years that a speaker will increase his/her vocal effort in the presence of a loud background noise. Informal observations confirm that people talk much louder in a noisy environment such as a subway, airplane, or cocktail party than in a quiet environment such as a library or doctor’s office. This effect, known as the Lombard reflex, was first described by Etienne Lombard in 1911 and has attracted a moderate degree of attention by researchers over the years. The observation that speakers increase their vocal effort in the presence of noise in the environment suggests that speakers monitor their vocal output rather carefully when speaking. Apparently, speakers attempt to maintain a constant level of intelligibility in the face of degradation of the message by the environmental noise source and the corresponding decrease in auditory sidetone at their ears. Lane and his colleagues (Lane et al., 1970; Lane and Tranel, 1971) have summarized much of the early literature on the Lombard effect and have tried to account for a wide range of findings reported in the literature. The interested reader is encouraged to read these reports for further background and interpretation.

Despite the extensive literature on the Lombard effect over the last 30 years, little, if any, data have been published reporting details of the acoustic–phonetic changes that take place when a speaker modifies his vocal output while speaking in the presence of noise. A number of studies have reported reliable changes in the prosodic characteristics of speech produced in noise. However, very few studies have examined changes in the spectral properties of speech produced in masking noise.

In an earlier study, Hanley and Steer (1949) found that in the presence of masking noise, speakers reduce their rate of speaking and increase the duration and intensity of their utterances. In another study, Draegert (1951) examined the relations between a large number of physical measures of voice quality and speech intelligibility in high levels of noise and found a similar pattern of results. His interest was focused primarily on factors that correlated with measures of speech intelligibility rather than on a description of the acoustic–phonetic changes that take place in the speaker’s speech. In addition to changes in duration and intensity, Draegert reported increases in vocal pitch and changes in voice quality due to a shift in the harmonic structure. The change in harmonic structure was shown by a difference in intensity between the low- and high-frequency components. To obtain these measures, the speech was bandpass filtered to obtain estimates of the locations of the major concentrations of energy in the spectrum. Unfortunately, no measurements of the size of the effects were reported in this article.

In another study on the intelligibility of speech produced in noise, Dreher and O’Neill (1957) reported that, when presented at a constant speech-to-noise ratio, speech produced by a speaker with noise in his ears is more intelligible than speech produced in quiet. This result was observed for both isolated words and sentences. In each case, for the noise condition, a broadband random noise source was presented over the speaker’s headphones during production.

Related findings have been reported by Ladefoged (1967, pp. 163–165) in an informal study designed to examine how eliminating auditory feedback affects a speaker’s speech. Auditory feedback was eliminated by presenting a loud masking noise over headphones at an intensity level that prevented the subject from hearing his/her voice even via bone conduction. Subjects read a prepared passage and also engaged in spontaneous conversation. According to Ladefoged, although subjects’ speech remained intelligible, it became “very disorganized” by removal of auditory feedback through the presentation of masking noise. Of special interest to us was the observation by Ladefoged that the length and quality of many of the vowel sounds were affected quite considerably by the masking noise. Some sounds became more nasalized, others lost appropriate nasalization. Pitch increased and there appeared to be much less variability in the range of pitch. Ladefoged also noticed a striking alteration in voice quality brought about by the tightening of the muscles of the pharynx. These findings were summarized informally by Ladefoged in his book without reporting any quantitative data. To our knowledge, these results have never been published. Nonetheless, they are suggestive of a number of important changes that may take place when speakers are required to speak under conditions of high masking noise.

The Dreher and O’Neill (1957) results suggest that masking noise which does not eliminate auditory feedback to the subject may have a positive influence on speech intelligibility. On the other hand, the Ladefoged (1967) findings suggest that this may not be the case when environmental noise is so loud that all auditory feedback is eliminated.

The present investigation is concerned with the effects of masking noise on speech production. Our interest in this problem was stimulated, in part, by recent efforts of the Air Force to place speech recognition devices in noisy environments such as the cockpits of military aircraft. Although it is obvious that background noise poses a serious problem for the operation of any speech recognizer, the underlying reasons for this problem are not readily apparent at first glance. While extensive research efforts are currently being devoted to improving processing algorithms for speech recognition in noise, particularly algorithms for isolated speaker-dependent speech recognition, a great deal less interest has been devoted to examining the acoustic–phonetic changes that take place in the speech produced by talkers in high ambient noise environments. If it is the case, as suggested by the published literature, that speakers show reliable and systematic changes in their speech as the noise level at their ears increases, then it would be appropriate to examine these differences in some detail and to eventually incorporate an understanding of these factors into current and future algorithm development. Thus the problem of improving the performance of speech recognizers may not only be related to developing new methods of extracting the speech signal from the noise but may also require consideration of how speakers change their speech in noisy or adverse environments.

As noted earlier, a search through the literature on speech communication and acoustic–phonetics published over the last 40 years revealed a number of studies on the effects of noise on speech production and speech intelligibility. While changes in duration, intensity, and vocal pitch have been reported, and while changes in voice quality have been observed by a number of investigators, little is currently known about the changes that take place in the distribution of spectral energy over time such as modifications in the patterns of vowel formant frequencies or in the short-term spectra of speech sounds produced in noise. The present investigation was aimed at specifying the gross acoustic–phonetic changes that take place when speech is produced under high levels of noise as might be encountered in an aircraft cockpit. We expected to find reliable changes in prosodic parameters such as amplitude, duration, and vocal pitch, which have previously been reported in the literature. We were also interested in various segmental measures related to changes in formant frequencies and in the distribution of spectral energy in the short-term spectra of various segments. These measures might reflect changes in the speaker’s source function as well as the articulatory gestures used to implement various classes of speech sounds. In the present study, digital signal processing techniques were used to obtain quantitative measures of changes in the acoustic–phonetic characteristics of speech produced in quiet and in three ambient noise conditions. A second aspect of the study involved perceptual testing with these utterances to verify Dreher and O’Neill’s earlier finding that speech produced in noise was more intelligible than speech produced in quiet when the two conditions were presented at equivalent S/N ratios (Dreher and O’Neill, 1957).

I. ACOUSTIC ANALYSES

A. Method

1. Subjects

Two male native English speakers (SC and MD) were recruited as subjects. SC was a graduate student in psychology and was paid $5.00 for his participation. MD was a member of the laboratory staff and participated as part of his routine duties. Both speakers were naive to the purpose of the study and neither speaker reported a hearing or speech problem at the time of testing. Both speakers served for approximately 1 h.

2. Stimulus materials

Stimulus materials consisted of the 15 words in the Air Force speech recognition vocabulary: the digits “zero,” “one,” “two,” “three,” “four,” “five,” “six,” “seven,” “eight,” “nine”; and the control words “enter,” “frequency,” “step,” “threat,” and “CCIP.” These words were typed into computer files and different randomizations of the list of 15 words were printed out for the subjects to read during the course of the experiment.

3. Procedure

Subjects were run individually in a single-walled sound-attenuated booth (IAC model 401 A). The subject was seated comfortably in the booth and wore a pair of matched and calibrated TDH-39 headphones. An Electrovoice condenser microphone (model C090) was attached to the headset with an adjustable boom. Once adjusted, the microphone remained at a fixed distance of 4 in. from the subject’s lips throughout the experiment.

The masking noise consisted of a broadband white noise source that was generated with a Grason–Stadler noise generator (model 1724). The noise was low-pass filtered at 3.5 kHz, using a set of Krohn-Hite filters (model 3202R) with a roll-off of 24 dB per octave, and passed through a set of adjustable attenuators. The masking noise was presented binaurally through the headphones. Subjects wore the headphones during the entire experiment.

Subjects read the words on the test lists under four conditions: quiet, 80, 90, or 100 dB of masking noise in their earphones. The quiet condition measured from 33- to 37-dB SPL background noise with the attenuators set to their maximum setting. Measurements of the noise were made with a B&K sound level meter and artificial ear connected to the earphones.

After the headset was adjusted and the subject became familiar with the environment, a sheet of written instructions was provided to explain the procedures that would be followed. Subjects were informed that they would be reading English words from a list and that they should say each word clearly with a pause of about 1–2 s between words. They were told that masking noise at various levels of intensity would be presented over their headphones during the course of the experiment and that their task was to read each word as clearly as possible into the microphone. They were also told that the experimenter would be listening to their speech outside the booth while the recording was being made. Before the actual recordings were made, both subjects were given about 15 min of practice reading lists of the vocabulary with no noise over the headphones. This was done to familiarize the subjects with the specific vocabulary and the general procedures to be used in making the audiotapes.

Data were collected from subjects reading the lists under all four noise conditions. The noise levels were randomized within each block of four lists with the restriction that over the entire experimental session, every noise level was followed by every other noise level except itself. Subjects took about 40 s to read each list. After each list was read, the masking noise was turned off for about 40 s during which the subjects sat in silence. Each list of 15 words was read in each of the four masking conditions five times, for a total of 300 responses from each subject. Recordings were made on an Ampex AG-500 tape recorder running at $7 \frac{1}{2}$ ips.

4. Speech signal processing

Productions of the digits “zero,” “one,” “two,” “three,” “four,” “five,” “six,” “seven,” “eight,” and “nine” were analyzed using digital signal processing techniques. These 400 utterances (ten words × five repetitions × four noise levels × two talkers) were digitized using a VAX 11/750 computer. The utterances were first low-pass filtered at 4.8 kHz and then sampled at a rate of 10 kHz using a 16-bit A/D converter (Digital Sound Corporation model 2000). Each utterance was then digitally edited using a cursor-controlled waveform editor and assigned a file name. These waveform files were then used as input to several digital signal processing analyses.

Linear predictive coding (LPC) analysis was performed on each waveform file. LPC coefficients were calculated every 12.8 ms using the autocorrelation method with a 25.6-ms Hamming window. Fourteen linear prediction coefficients were used in the LPC analyses. The LPC coefficients were used to calculate the short-term spectrum and overall power level of each analysis frame (window). Formant frequencies, bandwidths, and amplitudes were also calculated for each frame from the LPC coefficients. In addition, a pitch extraction algorithm was employed to determine if a given frame was voiced or voiceless and, for voiced frames, to estimate the fundamental frequency (F0).

Total duration for each utterance was determined by visual inspection and measurement from a CRT display that simultaneously presented the utterance waveform along with time-aligned, frame-by-frame plots of amplitude, F0 (for voiced frames), and formant parameters. Cursor controls were used to locate the onset and offset of each utterance. Following identification of utterance boundaries, a program stored the total duration, mean F0, and mean rms energy for each utterance. The onset and offset of the initial vowel of each utterance were also identified and labeled. For each utterance, mean formant frequencies from this vowel segment were also stored. In the case of the word “zero,” the initial vowel /i/ could not be reliably segmented apart from the following voiced segments; thus the entire /ire/ segment was used as the initial vowel for this utterance. Similarly, for the utterances “three” and “four,” the semivowel /r/ was included as part of the vowel during segmentation.

Finally, the peak amplitude frame (25.6-ms window) from the stressed vowel of each utterance was identified and a regression line was fit to the spectrum of this analysis frame. The slope of this regression line was taken as a measure of “spectral tilt,” to quantify the relative distribution of spectral energy at different frequencies.

B. Results and discussion

The influence of ambient noise on various acoustic characteristics of the test utterances is described below. In each case, an analysis of variance was used to determine whether noise level had a significant effect on a given acoustic measure. Separate analyses were carried out for the two talkers. The analyses used word (“zero” through “nine”) and noise level as independent variables. The presentation of results will focus on the effect of noise on the various acoustic measures. The “word” variable will be discussed only in cases where a significant word × noise interaction was observed.

1. Amplitude

Mean rms energy for utterances spoken at each noise level are shown for each talker in Fig. 1. The data are collapsed across utterances. For each talker, the measured amplitudes show a consistent increase with an increase in noise level at the talker’s ears. The largest increase occurred between the quiet condition and the 80-dB noise condition. Analyses of variance revealed that, for each talker, noise level had a significant effect on amplitude [F(3,160)=190.41, p < 0.0001 for talker MD, and F(3,160) =211.15, p < 0.0001 for talker SC]. Newman–Keuls multiple range analyses revealed that, for each talker, each increase in noise led to a significant increase in amplitude (all ps < 0.01). For talker MD, there was also a significant word × noise interaction [F(27,160) = 1.72, p < 0.03]. For both speakers, the pattern of increased masking noise producing an increase in amplitude was present for every word. The word × noise interaction for speaker MD is due to variability across words in the amount of amplitude increase.

2. Duration

Mean word durations for utterances spoken at each noise level are shown for each speaker in Fig. 2. The data are again collapsed across utterances. The pattern is similar to that observed for amplitude: Word duration shows a consistent increase with each increase in noise at the speakers’ ears. However, for speaker MD, the change in duration between the 80- and 90-dB conditions is very small (6 ms). For SC, there is only a slight (15-ms) change in duration across the 80-, 90-, and 100-dB noise conditions. Analyses of variance demonstrated that, for each speaker, noise had a significant effect on word duration [F(3,160) = 23.08, p < 0.0001 for speaker MD, and F(3,160) = 25.31, p < 0.0001 for speaker SC]. Newman–Keuls analyses revealed that, for speaker MD, word duration was significantly shorter in the quiet condition than in any of the other conditions (ps < 0.01), and significantly longer in the 100-dB condition than in the other conditions (p s < 0.01). Durations did not significantly differ in the 80- and 90-dB conditions for MD. For speaker SC, Newman–Keuls tests revealed that duration in the quiet condition was significantly shorter than in the other three conditions (ps < 0.01), but that duration did not significantly vary among the 80-, 90-, and 100-dB noise conditions.

3. Fundamental frequency

Mean fundamental frequencies for utterances spoken at each noise level are plotted separately for each speaker in Fig. 3. The data demonstrate a larger change in F0 across noise conditions for speaker SC than for MD. For MD, F0 showed a small increase as the noise increased from quiet to 80 dB to 90 dB, followed by a slight drop in F0 between the 90- and 100-dB conditions. For SC, a large jump in F0 occurred between the quiet and 80-dB noise conditions, followed by small additional increases in F0 in the 90- and 100-dB conditions. Analyses of variance showed a significant effect of noise on F0 for each speaker [F(3,160) = 3.53, p < 0.02 for speaker MD, and F(3,160) = 42.07, p < 0.0001 for speaker SC]. Newman–Keuls analyses revealed a significant change in F0 between the quiet and 90-dB condition for speaker MD (p < 0.05). For speaker SC, the Newman–Keuls tests showed that F0 in the quiet condition was significantly lower than in any of the other noise conditions (ps < 0,01).

4. Spectral tilt

As mentioned earlier, a regression line was fit to the spectrum of a representative frame from each token. The peak amplitude frame from the initial vowel was identified and used for these measurements. The slope of the regression line was taken as a measure of “spectral tilt” to index the relative energy at high versus low frequencies. Mean spectral tilt values for utterances spoken at each noise level are plotted for each speaker in Fig. 4. For each speaker, there was a decrease in spectral tilt accompanying each increase in noise. This decrease in tilt reflects a change in the relative distribution of spectral energy so that a greater proportion of energy is located in the high-frequency end of the spectrum when utterances are produced in noise. Analyses of variance demonstrated a significant change in spectral tilt across noise conditions for each speaker [F(3,160) = 56.82, p < 0.0001 for speaker MD, and F(3,160)= 23.85, p < 0.0001 for speaker SC]. Newman–Keuls analyses revealed a significant decrease in spectral tilt with each increase in noise for speaker MD (ps < 0.01). For SC, spectral tilt was significantly greater in the quiet condition than in any of the other noise conditions (ps < 0.01). In addition, tilt was significantly greater in the 80-dB noise condition than in the 100-dB condition (p < 0.05).

FIG. 4 — Mean spectral tilt values for words produced in quiet, 80, 90, and 100 dB of masking noise. Values are collapsed across utterances and presented separately for each speaker.

On first examination, it appears that the decrease in spectral tilt observed in the high-noise conditions may be due to the increases in F0 also observed in these conditions. However, a close examination of these two sets of results suggests that the relative increase in spectral energy at high frequencies in the high-noise conditions is not entirely due to increases in F0. For speaker MD, F0 did not change a great deal across noise conditions (see Fig. 3); the change in F0 was significant only in the quiet versus 90-dB comparison. Yet each increase in noise led to a significant decrease in spectral tilt for speaker MD. For speaker SC, the 80- and 100-dB noise conditions did not differ in the analysis of F0, yet a significant decrease in spectral tilt was obtained between these two conditions.

5. Formant frequencies

The influence of masking noise level on vowel formant frequencies was analyzed next. Mean F1 and F2 frequencies from the initial vowel of each utterance were examined. Noise had a consistent effect on the formant data for speaker SC and a less consistent effect for speaker MD. Mean F1 frequencies for utterances produced in each noise condition are shown in Fig. 5. The data for speaker MD appear in the left-hand portion of the figure and the data for speaker SC appear in the middle of the figure. A significant main effect of noise on F1 frequency was observed for speaker SC [F(3,160) = 14.91, p < 0.0001], along with a marginally significant noise × word interaction [F(27,160)=1.5, p < 0.07]. For this speaker, F1 frequency tended to increase as the noise level increased. Newman–Keuls tests revealed that, for this speaker, F1 was significantly lower in the quiet condition than in any of the other noise conditions. The marginally significant noise × word interaction for SC suggests that the pattern of an increase in F1 accompanying an increase in noise may not hold for all ten utterances. The consistency of this pattern can be seen by examining Fig. 6. This figure displays F1 and F2 frequency data for the quiet and 100-dB noise conditions for each of the ten utterances produced by SC. With the exception of the utterance “one,” F1 was greater in the 100-dB condition than in the quiet condition for all utterances.

FIG. 6 — Mean first and second formant frequencies for words produced in quiet and 100 dB of masking noise by speaker SC. Values are presented separately for each utterance.

The main effect of noise and the noise × word interaction did not reach significance in the analysis of F1 frequency for speaker MD. As Fig. 5 shows, the change in mean F1 frequency across noise conditions was less than 3 Hz for this speaker. The F1 and F2 data for speaker MD are broken down by utterance in Fig. 7. Although the noise × word interaction was not significant for MD, the pattern of results shown in this figure suggests that the presence of masking noise may have produced a compacting, or reduction, in the range of F1 for this speaker. In the majority of cases, utterances with low F1 frequencies showed an increase in F1 in noise, while utterances with high F1 frequencies showed a decrease in F1.

FIG. 7 — Mean first and second formant frequencies for words produced in quiet and 100 dB of masking noise by speaker MD. Values are presented separately for each utterance.

The mean values shown in Figs. 3 and 5 demonstrate a striking similarity between the F0 data and the F1 data for each speaker. For speaker MD, there was little change in F0 across noise conditions and no significant influence of noise on F1 frequency. For speaker SC, both F0 and F1 were significantly higher in the 80-, 90-, and 100-dB noise conditions than in the quiet condition. These data suggest a close relationship between F0 and F1; apparently, an increase in fundamental frequency leads to an increase in F1. We carried out one additional analysis to further test this conclusion.

In order to determine whether F0 and F1 were, in fact, directly related, a second analysis was run on speaker SC’s data. In this analysis, the effects of word and noise level on initial-vowel F1 frequency were again tested but with initial-vowel F0 entered as a covariate in the analysis. Mean F1 frequencies at each noise level based on the adjusted cell means from this analysis (in which F0 is covaried out) appear in the right-hand portion of Fig. 5. The results of this analysis were nearly identical to those observed in the original analysis off 1 frequency for SC. The main effect of noise on F1 frequency remained significant [F(3,159) = 5.32, p < 0.0017]. Also, as in the original analysis off F1 for speaker SC, the noise × word interaction fell short of significance [F(27,159)=1.44, p < 0.09]. Finally, Newman–Keuls tests comparing F1 frequencies in the various noise conditions revealed the identical pattern observed in the original analysis: F1 frequency was significantly lower in the quiet condition than in any of the other noise conditions. Thus, for speaker SC, it appears that noise had an influence on F1 frequency independent of its influence on F0.

Turning to the F2 data, masking noise did not produce a significant main effect on F2 frequency for speaker SC. However, a significant noise × word interaction was present [F(27,160) = 1.92, p < 0.008]. An examination of Fig. 6 suggests that the range of F2 frequencies was reduced in the presence of noise for speaker SC. Utterances containing high F2 frequencies showed a decrease in F2 in the 100-dB condition, while utterances with low F2 frequencies showed increases in F2 when noise was increased.

The main effect of noise and the noise × word interaction did not approach significance in the analysis of F2 frequency for speaker MD. An examination of Fig. 7 shows that, for most utterances, F2 showed little change between the quiet and 100-dB noise condition for this speaker.

Fundamental frequency, amplitude, and duration all tended to increase in the presence of noise. In addition, the results demonstrated consistent differences in the spectral characteristics of vowels produced in noise versus quiet. Vowels from utterances produced in noise had relatively flat spectra, with a relatively large proportion of their total energy occurring in higher frequency regions. Vowels from utterances produced in quiet had steeper spectra with relatively little energy present in high-frequency regions. First formant frequencies also appeared to be influenced by the presence of noise for at least one speaker. For SC, F1 frequencies were higher for vowels from utterances produced in the three noise conditions than for vowels produced in the quiet. There was little change in F2 frequencies across noise conditions for either speaker.

The present results demonstrated several clear differences in the acoustic characteristics of speech produced in quiet compared to speech produced in noise. Previous research by Dreher and O’Neill (1957) suggests that the changes in the spectral and temporal properties of speech which accompany the Lombard effect improve speech intelligibility. We carried out two separate perceptual experiments to verify their earlier conclusions.

II. PERCEPTUAL ANALYSES—EXPERIMENT I

In experiment I, subjects identified utterances from the quiet condition and the 90-dB masking noise condition in a forced-choice identification task. Utterances from the quiet and 90-dB noise condition were mixed with broadband noise at equivalent S/N ratios and presented to listeners for identification. If Dreher and O’Neill’s conclusion concerning the intelligibility of speech produced in noise versus quiet is correct, subjects should identify utterances produced in the 90-dB noise condition more accurately than utterances produced in the quiet condition.