Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Feb 1.
Published in final edited form as: Speech Commun. 2011 Feb 1;53(2):195–209. doi: 10.1016/j.specom.2010.09.001

Perception of Place of Articulation for Plosives and Fricatives in Noise

Abeer Alwan 1,*, Jintao Jiang 1,*,1, Willa Chen 1
PMCID: PMC3076800  NIHMSID: NIHMS239989  PMID: 21499546

Abstract

This study aims at uncovering perceptually-relevant acoustic cues for the labial versus alveolar place of articulation distinction in syllable-initial plosives {/b/,/d/,/p/,/t/} and fricatives {/f/,/s/,/v/,/z/} in noise. Speech materials consisted of naturally-spoken consonant-vowel (CV) syllables from four talkers where the vowel was one of {/a/,/i/,/u/}. Acoustic analyses using logistic regression show that formant frequency measurements, relative spectral amplitude measurements, and burst/noise durations are generally reliable cues for labial/alveolar classification. In a subsequent perceptual experiment, each pair of syllables with the labial/alveolar distinction (e.g., /ba,da/) was presented to listeners in various levels of signal-to-noise-ratio (SNR) in a 2-AFC task. A threshold SNR was obtained for each syllable pair using sigmoid fitting of the percent correct scores. Results show that the perception of the labial/alveolar distinction in noise depends on the manner of articulation, the vowel context, and interaction between voicing and manner of articulation. Correlation analyses of the acoustic measurements and threshold SNRs show that formant frequency measurements (such as F1 and F2 onset frequencies and F2 and F3 frequency changes) become increasingly important for the perception of labial/alveolar distinctions as the SNR degrades.

Keywords: speech perception, place of articulation, plosives, fricatives, noise, psychoacoustics

1. Introduction

Research on speech perception and human auditory processes, particularly in the presence of background noise, helps to improve and calibrate such practical applications as noise-robust automatic speech recognition systems (e.g., Hermansky, 1990; Strope and Alwan, 1997) and aids for the hearing impaired (e.g., Shannon et al., 1995). The present study examines the contributions of various acoustic characteristics to the perceptual distinction between labial and alveolar places of articulation in syllable-initial plosive and fricative consonant-vowel (CV) syllables in quiet conditions and in the presence of additive white Gaussian noise.

The focus is on labial/alveolar syllable pairs that differ in manner of articulation (plosives {/b,d/,/p,t/} versus fricatives {/v,z/,/f,s/}) and voicing (voiced {/b,d/,/v,z/} versus voiceless {/p,t/,/f,s/})in the vowel contexts {/a/,/i/,/u/}. Plosive consonants are produced by first forming a complete closure in the vocal tract via a constriction at the place of articulation, during which there is generally no sound. The vocal tract is then opened suddenly, releasing the pressure built up behind the constriction; this is characterized acoustically by a transient source and/or a short-duration noise burst (Stevens, 1998). The period between the release and the vowel onset is called the voice onset time (VOT) during which there is silence and/or aspiration noise. In contrast, fricatives are characterized by turbulence in the region of maximum constriction in the vocal tract. The excitation source is noise for voiceless fricatives, while it is noise and a quasi-periodic source for voiced fricatives. Labials and alveolars have noise source energy concentrations at different frequency regions due to differences in the location of the maximum constriction in the vocal tract.

1.1. Plosive consonants

Formant frequencies have been examined extensively in acoustic studies of the place of articulation in naturally-spoken plosives (e.g., Potter et al., 1947; Fant, 1973; Kewley-Port, 1982). Fant (1973) analyzed spectrograms of six Swedish plosives in nine vowel contexts and concluded that F2 and F3 formant transitions did not sufficiently reflect place of articulation. Kewley-Port (1982) measured the F1, F2, and F3 transitions for voiced plosives (/b,d,g/) in eight vowel contexts and found that F2 and F3 transition onset values were not sufficient to cue place of articulation.

Other studies have focused on the characteristics of the noisy burst of the plosives (e.g., Zue, 1976; Blumstein and Stevens, 1979; Stevens and Blumstein, 1978). Zue (1976) found that an alveolar burst had a broad shaped spectral peak in the high frequency region, while a velar burst had a compact peak in the mid-frequency region when followed by a front vowel and in the lower frequency region when followed by a back vowel. However, for labials, the study did not find consistent burst characteristics. Stevens and Blumstein (Blumstein and Stevens, 1979; Stevens and Blumstein, 1978) defined labial bursts to be “diffused falling” (widespread spectral energy with a concentration at the low to mid-frequency region), alveolar bursts to be “diffused rising” (widespread spectral energy with a concentration at the high-frequency region), and velars to have a compact mid-frequency spectral peak.

More recently, researchers have suggested that the spectral amplitude of the consonant portion of a CV syllable relative to that of the vowel onset cues place of articulation (e.g., Stevens et al., 1999; Suchato, 2004). In (Stevens et al., 1999), three relative spectral quantities were measured, as well as F1 and F2 frequencies. The first relative measure was the peak spectrum amplitude of the burst in the frequency range above 3500 Hz for female talkers and 3000 Hz for male talkers (Ahi) relative to the average of spectral peaks in the F2 and F3 range in the burst (A23), denoted as Ahi-A23, measuring the spectral tilt of the burst. The second quantity was the spectrum amplitude of the F1 prominence in the vowel onset (Av) relative to Ahi, denoted as Av-Ahi, measuring the burst amplitude relative to the vowel amplitude. The third quantity was the difference between Av and the peak spectrum amplitude in the F2 to F3 range (pA23) in the burst, denoted as Av-pA23, measuring the mid-frequency spectral prominence. These measurements were performed on a number of syllable-initial plosives, 15 tokens each, drawn from 100 sentences spoken by two male and two female talkers. The results showed that Ahi was smaller than A23 for labials but relatively similar to A23 for alveolars; Av-Ahi measurements indicated that the labial burst in high frequencies was weaker than the alveolar burst; Av-pA23 was the best indicator of velars because velars had the most prominent mid-frequency peak; and place of articulation classification with these measurements showed the effects of talker and vowel context. Suchato (2004) found that attributes relating to the burst spectrum in relation to that of the vowel were most effective for automatically classifying place of articulation, while attributes relating to formant transitions were somewhat less effective.

Perceptual experiments with synthetic plosives modeling a male adult voice have also been conducted to find perceptually relevant acoustic cues for the place of articulation (e.g., Liberman et al., 1954; Delattre et al., 1955; Ohde and Stevens, 1983). Liberman et al. (1954) found that the F2 transition cued the place of articulation for plosives. Delattre et al. (1955) provided further specification of the F2 onset frequencies (720 Hz for /b/ and 1800 Hz for /d/). Hedrick and colleagues (Hedrick and Jesteadt, 1996; Hedrick et al., 1995) varied the burst amplitude in the F4-F5 region relative to the vowel onset amplitude and the F2 and F3 onset frequencies in synthetic voiceless plosive CV syllables. They showed that increasing the relative presentation level of the burst yielded more alveolar responses, that the increase in alveolar responses also co-varied with the F2 and F3 onset frequencies, and that burst amplitude relative to vowel onset amplitude in the F4-F5 region seemed to cue voiceless labial/alveolar place of articulation.

Locus equations have also been examined as cues for place of articulation for plosives (Sussman et al., 1991, 1993, 1995; Fruchter and Sussman, 1997). These equations are linear regressions of the F2 onset frequency on F2 vowel frequency (midvowel nucleus) for a single consonant across a range of vowels. The derived slope and intercept values have been used as predictors of place of articulation. Sussman et al. (1991) investigated locus equations in naturally-spoken voiced syllable-initial plosives. The authors found linear regression functions with distinct slopes and intercepts as a function of place. Fruchter and Sussman (1997) comprehensively sampled the F2 onset-F2 vowel acoustic space in the vicinity of /b,d,g/ locus equations using synthetic CV stimuli. The authors found that locus equations serve as important perceptual cues for place of articulation.

In summary, the relative spectral amplitudes, formant transitions, and burst characteristics have been found to be important cues to place of articulation for plosive consonants in acoustic and perceptual studies.

1.2. Fricative consonants

You (1979) found that the duration of frication noise varied with place of articulation. Shadle and Mair (1996) measured spectral moments, dynamic amplitude, and spectral slope in fricatives with different effort levels and vowel contexts. The authors found that spectral moments varied significantly by frequency ranges.

Perceptual experiments with fricatives were also conducted to find perceptually relevant acoustic cues for place of articulation. Harris (1958) and Heinz and Stevens (1961) used natural and synthetic tokens, respectively, and showed that spectral properties of frication noise were critical perceptual attributes for place of articulation. Heinz and Stevens (1961) varied the initial frequencies for the fricatives and F2 onset frequencies of the vowel and then varied the amplitude of the fricative noise relative to the vowel. The results showed that stimuli with resonance frequencies of 6500 to 8000 Hz usually produced /f/ and /θ/ responses, but these responses only began to emerge when the fricative noise was −15 and −25 dB relative to the vowel. Guerlekian (1981) used several synthesized stimuli with conflicting cues and found that low and high amplitude of noise relative to the vowel was perceived as /fa/ and /sa/, respectively, by both Spanish and English listeners. Jongman (1988) edited the frication noise duration in naturally-spoken CV syllables to include 20 to 70 ms in 10-ms steps as well as the entire frication noise. Perceptual results indicated that the listeners did not require the entire fricative-vowel syllable in order to correctly perceive a fricative and that perception of fricative place of articulation was much more affected by a decrease in frication duration than perception of voicing or manner of articulation.

Other perceptual studies suggested the importance of the amplitude of noise relative to that of the vowel onset at different frequency regions (Hedrick and Ohde, 1993; Stevens, 1985). Hedrick and Ohde (1993) showed that the amplitude of the frication noise relative to the vowel in the F3-F5 region affected perception of place across different vowel contexts and frication durations. Overall, labial and alveolar fricatives seemed to have weaker and stronger noise relative to the vowel, respectively. However, Behrens and Blumstein (1988) found that perception of place of articulation of fricatives was generally not influenced by overall frication amplitude. Nevertheless, the authors suggested that the relevant property of amplitude may not be the “overall” amplitude of the frication, but rather a change in amplitude of the fricative noise relative to the vowel in a specific region.

Similar to that for plosive consonants, the relative spectral amplitudes have been found to be important cues to place of articulation for fricative consonants in both acoustic and perceptual studies; and fricative characteristics (especially fricative duration) have aslo been found to be effective place cues.

1.3. Speech perception in noise

The studies discussed above were all conducted in quiet environments; however, speech is often heard in the presence of background noise. Finding perceptually salient acoustic cues in noise has important practical implications for ASR systems and hearing aids.

One of the earliest studies on perceptual confusions between consonants in the presence of noise was conducted by Miller and Nicely (1955). Their study used 200 naturally-spoken utterances, each consisting of one of 16 consonants followed by the vowel /a/, in varying levels of white noise and bandpass filtering conditions. The selected consonants varied along five articulatory features (voicing, nasality, affrication, duration, and place of articulation). Their work showed that place information was difficult to distinguish at SNRs less than +6 dB. They also found that perception of plosives was much less robust than that of fricatives in a noisy environment. Other researchers used an information-theoretic approach to model confusion matrices of speech in noise (Soli and Arabie, 1979; Wang and Bilger, 1973). These studies attempted to find out which cues account for perceptual results in noise by analyzing confusion matrices statistically. For example, Soli and Arabie (1979) analyzed the consonant confusion data from (Miller and Nicely, 1955) and suggested (qualitatively) that consonant confusion data could be better explained by the acoustic properties of the consonants than by phonetic features.

The perceptual effect of noise on place of articulation cues, however, is not clear and has not been systematically investigated. Several studies have comprehensively examined physical measures that could account for the changes in the perception of phonological features in the presence of background noise for a wide range of consonants (Farar et al., 1987; Hant and Alwan, 2000, 2003; Jiang et al., 2006; Hedrick and Younger, 2007; Parikh and Loizou, 2005). Farar et al. (1987) adopted an approach to quantify perceptual confusions in noise by incorporating speech into psychoacoustic masking models. Using stationary broad-band noises with spectral shapes resembling certain plosives, the authors measured the discrimination thresholds for different plosive burst pairs as a function of burst duration. The results showed that discrimination thresholds decreased nearly 20 dB as the bursts' durations increased from 10 to 300 ms. However, they were unable to model the data to predict these durational effects. Alwan (1992) conducted discrimination experiments with synthetic /ba,da/ stimuli while masking their F2 trajectories with a bandpass noise. The discrimination results suggested that high-frequency cues (such as relative spectral amplitude differences in the F3 to F4 region) can be used as place cues since subjects were able to identify the consonants when F2 was completely masked. The (Alwan, 1992) study only examined /ba,da/ syllables. In (Hant and Alwan, 2000, 2003; Hant, 2000), the authors developed a general, time/frequency detection model to fit the noise-masked thresholds of bandpass noises which varied in noise duration, bandwidth, and center-frequency. The model predicted well the discrimination of synthetic voiced plosive CV syllables in perceptually flat and speech-shaped noise. Their perceptual experiments and model showed that formant transitions are more perceptually salient in noise than the plosive burst. Jiang et al. (2006) conducted voicing discrimination experiments using stimuli consisting of naturally-spoken CV syllables by four talkers in various levels of additive white Gaussian noise. Their results indicate that the onset frequency of the first formant is critical in perceiving voicing in syllable-initial plosives in additive white Gaussian noise, while the VOT duration is not. Parikh and Loizou (2005) used multi-talker babble and speech-shaped noise to examine the acoustic and perceptual influence of noise on plosive consonant cues in VCV syllables. Plosive consonant recognition remained high even at −5 dB despite the disruption of burst cues due to additive noise. The authors speculated that listeners must be relying on other cues, perhaps formant transitions, to identify plosives. The (Parikh and Loizou, 2005) study employed plosive consonant identification rather than place discrimination, and there was no correlation analyses between identification scores and acoustic measurements for plosives. Hedrick and Younger (2007) investigated whether there were different perceptual weightings to cues for the /p,t/ place of articulation in speech-shaped noise versus reverberant listening conditions. The authors used synthetic /pa/ and /ta/ stimuli with varying amplitude of the spectral peak in the F4-F5 frequency region of the burst relative to the adjacent vowel peak amplitude in the same frequency region and F2/F3 formant transition onset frequencies. Results with normal-hearing listeners showed that the weightings of relative spectral amplitudes and transition cues depended on the listening condition (quiet, speech-shaped noise, or reverberation). That is, normal-hearing listeners reduced their weighting of formant transitions in speech-shaped noise, while they had little difficulty using the formant transition cues in the reverberant listening condition. The (Hedrick and Younger, 2007) study only examined /pa,ta/ syllables.

Noise characteristics influence the perception of speech sounds. Hant and Alwan (2000) examined the perceptual confusion of synthetic plosives in noise and found that there was a 5 to 10 dB drop in threshold SNRs (for which place of articulation was just perceptually salient) between speech-shaped noise and perceptually flat noise, suggesting that adult native English listeners might be using high-frequency cues to discriminate plosives in speech-shaped noise, while those cues were unavailable in perceptually flat noise. The perceptually flat noise had equal energy per Equivalent Rectangular Bandwidth of the auditory filter (Glasberg and Moore, 1990). Nittrouer et al. (2003) showed clear differences in adults' perception of consonants in white versus speech-shaped noise, while there was no difference in children's perception. Another type of noise includes background talker[s]. Simpson and Cooke (2005) demonstrated that a single competing talker or amplitude-modulated noise is a far less effective masker than multi-talker babble or speech-shaped noise for consonant identification in VCV syllables and that babble-modulated noise is a less effective masker than natural babble when there are more than two talkers in the noise. Similar results were found by Engen and Bradlow (2007) and by Lecumberri and Cooke (2006). Engen and Bradlow (2007) found that in two-talker babble, native English listeners were more adversely affected by English babble than by Mandarin Chinese babble for sentence recognition. Lecumberri and Cooke (2006) showed that English listeners performed better when the competing speech was Spanish.

A number of studies have demonstrated that speech perception in noise depends on the context information (Benkí, 2003; Bradlow and Alexander, 2007; Cutler et al., 2008). Benkí (2003) showed that the perception of CVC words in noise depends on the lexical status, word frequency, and neighborhood density as context effects. Bradlow and Alexander (2007) examined the semantic and phonetic enhancements for speech perception in noise by native and non-native listeners. The authors found that non-native listener's final word recognition improved only when both semantic and acoustic enhancements were available. In contrast, the native listeners benefited from each source of enhancement separately and in combination. Redford and Diehl (1999) found that initial consonants were significantly more identifiable than final consonants for CVC syllables embedded in frame sentences.

Listener differences have also been investigated (Cutler et al., 2004; Lecumberri and Cooke, 2006; Bradlow and Alexander, 2007; Cutler et al., 2008; Cooke et al., 2008). Cutler et al. (2004) examined English phoneme confusions by native and non-native listeners in CV and VC syllables embedded in multi-talker babble. Although non-native listeners performed less accurately than native listeners at all noise levels, the effects of language background and noise did not interact. That is, there were no differential effects of noise on non-native listening. Lecumberri and Cooke (2006) studied the identification of American English consonants in /aCa/ context with noise being a single competing talker, speech-shaped noise, or eight-talker babble. The authors showed that non-native listeners were more adversely affected by noise than native listeners. In a follow-up study, Cutler et al. (2008) presented the (Lecumberri and Cooke, 2006) experiment to the listeners from the population in (Cutler et al., 2004) in the quiet and multi-talker babble conditions. Larger noise effects on consonant identification emerged for non-native listeners than for native listeners, suggesting that task factors (consonant identification in CV and VC syllables vs. in /aCa/ syllables) rather than non-native population differences (Dutch vs. Spanish) underlie the discrepancy between the (Cutler et al., 2004) and (Lecumberri and Cooke, 2006) studies. Cooke et al. (2008) studied the native and non-native listeners' keywords in English sentences in quiet and masked by either speech-shaped noise or a competing talker. The authors showed non-native talkers suffered more from increasing levels of noise.

1.4. The present study

In the present study, we examine the relationship between the acoustic properties of speech signals and the results from perceptual experiments conducted in the presence of additive white Gaussian noise. Our overall goal is to discover the perceptual effect of noise on acoustic cues for place of articulation and to develop a deeper understanding of the place of articulation perception in noise. First, measurements of a number of acoustic properties from a set of CV utterances was made (in quiet) and analyzed for possible place-of-articulation cues using logistic regression analyses. Second, perceptual experiments were conducted using the speech tokens mixed with varying amounts of white Gaussian noise. Finally, the acoustic measurements were examined in conjunction with the results from the perceptual experiments to determine which cues could possibly account for the perception of place of articulation in noise. This was done by performing correlation analyses between the acoustic measurements and the place of articulation discrimination threshold SNRs.

The present study will contribute to the literature in two ways: (1) it studies a comprehensive set of acoustic cues relevant to place of articulation, as reported in various papers, using a single context (CV syllables) with three vowels, and (2) it examines the perceptual relevance of these cues in quiet and in noise across a range of consonants (plosives and fricatives). Most of the cues were implicated in many separate prior studies, and it is important to investigate their noise robustness in a single context. The noise robustness of these cues for place of articulation perception is examined across plosives and fricatives rather than within each manner of articulation, which could result in more general and consistent results. As a first step in this research direction, the present study uses naturally-spoken CV syllables for which higher-level factors such as lexical frequency or contextual information are irrelevant.

2. Acoustic analysis

2.1. Stimuli

Stimuli consisted of isolated, naturally-spoken CV utterances, where C was from the set {/b/,/d/,/p/,/t/,/f/,/s/,/v/,/z/} and V was from the set {/a/,/i/,/u/}, for a total of 24 syllables. Speech signals were recorded in a sound-attenuating room using a headset microphone and were sampled at a rate of 16 kHz with a 16 bits per sample representation. Four talkers (two males, two females; age range 18 to 36 years), all native speakers of American English, were recorded. Each talker produced eight tokens for each CV, while only four of them were used for the present study (the first three tokens and the last one were discarded), resulting in a total of 16 tokens per CV syllable. Syllables were sorted in labial/alveolar pairs (such as /ba/ and /da/), such that manner of articulation, voicing, and vowel context were identical, and the two syllables in each pair differed only in the place-of-articulation dimension. Thus, there were a total of 12 CV pairs (see Table 1).

Table 1.

CV pairs used in this study.

voiced
voiceless
plosives fricatives plosives fricatives
/a/ /ba,da/ /va,za/ /pa,ta/ /fa,sa/
/i/ /bi,di/ /vi,zi/ /pi,ti/ /fi,si/
/u/ /bu,du/ /vu,zu/ /pu,tu/ /fu,su/

2.2. Acoustic measurements

All tokens were normalized such that the peak amplitude of the entire sampled waveform was set to the same level. Acoustic measurements were made for the speech tokens in quiet. The total set of measured properties is described in Table 2.

Table 2.

Acoustic measurements. Fi can be F1, F2, or F3. Superscripts f or p indicate that the measures were made only for fricatives or plosives, respectively. The “v” letter indicates that measures were made for the vowel spectrum. Those without asterisks were intermediate measures that were used to make relative spectral amplitude measurements.

Name Description
*Fib/Fie/Fis/FiD Fi onset/offset/steady-state frequency/transition duration
*FibA/FieA/FisA Fi onset/offset/steady-state amplitude
*Fidf/FidA Fi frequency/amplitude change
*votD/bstDp/nDf VOT/burst/noise duration
Ahi Peak amplitude of burst/noise spectrum in high frequencies
(female: above 3.5 kHz; male: above 3 kHz)
Av/Av4 Peak amplitude of vowel spectrum at the F1/F4 prominence
A23/A45 Average amplitude of burst/noise spectrum in F2-F3/F4-F5
pA23/pA45 Peak amplitude of burst/noise spectrum in F2-F3/F4-F5
Am/Avm Average amplitude of burst/vowel onset spectrum at mid-frequencies
(3.2-4.8 kHz)
Ans Average amplitude of the entire noise spectrum
*Ahi-A23 Spectral tilt of the burst/noise
*Av-Ahi Peak spectral amplitude of burst/noise in high frequencies
relative to that of vowel at F1
*Av4-A45 Peak spectral amplitude of vowel at F4 relative to
the average spectral amplitude of burst/noise in F4-F5
*Av4-pA45 Peak spectral amplitude of vowel at F4 relative to
that of burst/noise in F4-F5
*Av-pA23p Mid-frequency spectral prominence for plosives
*Am-Avmp Difference between burst and vowel spectral amplitude at mid-frequencies
*Av-Ansf Average spectral amplitude of noise relative to the peak of vowel at F1

2.2.1. Formant frequency and amplitude measurements

Formant measurements (frequency and amplitude) were made from the time waveforms, wideband spectrograms, LPC (Linear Predictive Coding) spectra, and short-time DFT (Discrete Fourier Transform) spectra using Matlab. To obtain a spectrum, a 20 ms (for tokens from male talkers) or 15 ms (for tokens from female talkers) Hamming window was applied to define an analysis segment.

Each segment was zero-padded for a 1024-point FFT analysis, and the frame shift was half the Hamming window length. For an LPC analysis, no zero padding was applied, the frame shift was 2.5 ms for all talkers, and the LPC order was between 8 and 12 (depending on the variance of the prediction error). Vowel measurements included the first three formants (F1, F2, and F3). The three formants were located by examining the LPC spectra (Fig. 1a) and spectrograms. Three landmark points were defined for each formant: onset, offset, and steady state (Fig. 1c). F1, F2, and F3 onsets, chosen manually, were defined as the center point of the frame that exhibited the following characteristic: a sudden spectral change in the corresponding frequency range, particularly the introduction of a sharp spectral peak. The end of a formant transition (offset), chosen automatically, was defined as the frame during which the rate of change of the formant frequency fell to less than 5 Hz per 2.5 ms, and the average rate of change for the next 12.5 ms was also less than 5 Hz per 2.5 ms (Kewley-Port, 1982, see Fig. 1d). The steady-state point was centered at 95 ms after the onset, and the steady-state measurements were averaged over five frames. At the formant transition onset, offset, and steady-state points, formant frequency (in Hz), and formant amplitude (in dB) were recorded based on the LPC spectrum. From these measurements, formant frequency and amplitude changes were measured between the formant transition onset and steady state. Formant transition duration was defined as the time difference between the formant transition offset and onset.

Figure 1.

Figure 1

(a) LPC spectrum of a /ta/ token during the vowel, (b) DFT spectrum of a /ta/ token during the burst, (c) formant transition measurements, and (d) illustration of the determination of formant transition offset (in this case, F1 frequencies obtained using LPC analyses) when the change in frequency drops below 5 Hz per 2.5 ms.

2.2.2. Duration and relative spectral amplitude measurements

The burst, frication noise, and VOT measurements were made by visually inspecting the time waveforms and wideband spectrograms of the tokens using the software CoolEdit Pro. Wideband spectrograms were calculated using a 6.4 ms Hamming window with a frame shift of one sample. The burst was defined as the short segment characterized by a sudden, sharp vertical line in the spectrogram. If multiple bursts were present, the burst duration was measured (in ms) from the beginning of the first burst to the end of the last. The spectrum of the combined transient and burst was estimated using Welch's averaged periodogram method (Stevens et al., 1999). That is, the signal was divided into overlapping sections of specified window length. If the burst duration was shorter than 9 ms, then a 3 ms window with 1.5 ms overlap was used; otherwise, a 6 ms window with a 3 ms overlap was used. The spectrum was obtained using a 256 point FFT (Fast Fourier Transform) method. VOT duration in plosives was measured from the end of the burst to the beginning of the vowel, which was also the beginning of the first waveform period. VOT duration in fricatives was measured from the consonant release to the beginning of the vowel, including noise duration and aspiration.

Ahi represents the peak amplitude of the burst/noise spectrum in the frequency range above 3500 Hz for female talkers and 3000 Hz for male talkers. A23 and A45 are the average amplitudes of the burst/noise spectrum in the F2-F3 and F4-F5 regions, respectively. Av and Av4 represent the peak amplitudes of the vowel spectrum at the F1 and F4 prominence, respectively. The pA23 and pA45 measures represent the peak amplitudes of the burst/noise spectrum in the F2-F3 and F4-F5 regions, respectively. Am and Avm are the average amplitudes of the burst and vowel onset spectrum at mid frequencies (between 3200 Hz and 4800 Hz), respectively. Ans represents the average amplitude of the entire noise spectrum. All these measures were in dB. For the vowels /a/ and /u/, F2-F3 and F4-F5 formant frequency regions are 1000-3000 and 3000-5000 Hz, respectively. For vowel /i/, F2-F3 and F4-F5 formant frequency regions are 1500-3500 and 4000-6000 Hz, respectively. The definitions of Ahi, A23, and pA23 are illustrated in Fig. 1b.

From these measurements, a set of relative spectral amplitude measures was constructed: (1) Ahi-A23 characterizes the spectral tilt of the burst/noise; (2) Av-Ahi is the high-frequency burst/noise spectral amplitude relative to F1 amplitude in the vowel; (3) Av-pA23 is calculated only for plosives (Stevens et al., 1999) to determine a mid-frequency spectral prominence; (4) Av4-A45 characterizes the relative spectral amplitude of the vowel versus the burst/noise in the F4-F5 region; (5) Av4-pA45 is very similar to Av4-A45 except that the peak amplitude of the burst/noise in the F4-F5 region is calculated; (6) Am-Avm characterizes the difference between burst and vowel spectral amplitude at the mid-frequency range for plosives (Stevens et al., 1999); and (7) Av-Ans quantifies the overall amplitude of the noise relative to spectral amplitude of the vowel at the F1 prominence for fricatives (Hedrick and Ohde, 1993).

The measurements Av-Ahi, Ahi-A23, and Av-pA23 were inspired by Stevens et al. (1999), except for two differences in calculating these noise measures. First, the burst segment in Stevens et al. (1999) did not include aspiration in voiceless plosives. Second, Stevens et al. (1999) used the same window length for both the vowel onset and the burst segment. In addition, the present study also examined the noise properties in the F4 to F5 regions that had been suggested for place of articulation distinction (Hedrick and Jesteadt, 1996; Hedrick et al., 1995). Furthermore, the average of the entire noise spectrum was measured for fricatives.

2.3. Place-of-articulation classification based on acoustic measurements

The acoustic measurements were analyzed using logistic regression, where the quiet speech tokens were classified as either labial or alveolar according to a single acoustic property measured without the addition of the white Gaussian noise. A separate logistic regression model was applied to each acoustic variable for each CV pair,

log[prob(1prob)]=α+βMea+e (1)

where prob is the probability of a token being labial, α is a constant, β is a weighting coefficient, Mea is one acoustic feature (measurement), and e is the error term. For each token, the consonant was either labial or alveolar, and thus prob was either 0 or 1. After logistic regression, α + β · Mea = 0 was used for classification, and results were compared against ideal classification to obtain the percent correct scores. Table 3 lists the results in terms of percent correct classification based on logistic regression using the tokens from all talkers. Only acoustic measures with 79% or higher correct classification are listed and sorted (from high to low) for each CV pair.

Table 3.

Percent correct classification (shown as a superscript) of the quiet speech tokens (from all talkers) based on a single acoustic property measured without the addition of the white Gaussian noise.

/ba,da/ /bi,di/ /bu,du/ /pa,ta/ /pi,ti/ /pu,tu/ /va,za/ /vi,zi/ /vu,zu/ /fa,sa/ /fi,si/ /fu,su/
F2b100 Av-Ahi94 F3df91 bstD97 Ahi-A2394 Av-Ahi100 nD100 Av-Ans100 Av-Ans91 F2df94 Av4-pA4588 Av-Ans100
F2df100 F1bA84 Av-Ahi91 Ahi-A2384 Av-Ahi81 Ahi-A2391 F2df97 votD84 Av4-pA4584 F3df94 F2e81
F1b94 F2bA84 F2b88 Av4-pA4581 Av4-pA4591 Av-Ans94 nD84 F1e81 F1b91 Av4-pA4581
F3df94 Am-Avm84 F2e84 bstD81 F2e88 F1b88 F3df81 Av-Ans91
Av-Ahi88 F2dA84 F1e81 F2b84 F2b81 F2b84
F2D84 Av4-A4584 F2e81 F1df81
Av4-pA4581
bstD81

Of the 37 recorded measurements in Table 2, a number of acoustic measurements do not appear in Table 3. That is, these acoustic measurements were not prominent in classifying the labial/alveolar place-of-articulation distinction. These non-prominent measurements include formant steady-state frequencies and amplitudes, formant offset amplitudes, F3 onset and offset frequencies, F3 onset amplitude, F1 and F3 amplitude change, F1 and F3 transition duration, and Av-pA23. Several other acoustic measurements, although they appear in Table 3, produced moderate place of articulation classification performance for only one or two CV pairs. Such measurements include F1 offset frequency (F1e, 81% for /bu,du/ and /vu,zu/), F1 frequency change (F1df, 81% for /fa,sa/), F1 and F2 onset amplitude and F2 amplitude change (F1bA, F2bA, and F2dA, 84% for /bi,di/), F2 transition duration (F2D, 84% for /ba,da/), VOT duration (votD, 84% for /vi,zi), Av4-A45 (84% for /pu,tu/), and Am-Avm (84% for /bi,di/). A first generalization from these non-prominent acoustic measures is that formant amplitudes, steady-state frequencies, and offset frequencies were not discriminative for labial/alveolar place of articulation classification. An exception is that the F2 offset frequency (F2e) yielded moderate classification performance for several CV pairs (84% for /bu,du/, 84% for /pu,tu/, 81% for /va,za/, and 81% for /fu,su/). A second generalization is that the voicing feature measurements (e.g., VOT) were not reliable cues for labial/alveolar place of articulation except for the noise/burst duration measurements.

Several formant frequency measurements, F1 and F2 onset frequencies and F2 and F3 frequency changes (F1b, F2b, F2df, and F3df), were mostly distinctive for the /a/-context labial/alveolar pairs (11 out of 16 cases), moderately for the /u/-context ones (4 out of 16 cases), but not for the /i/-context ones. Labials had a higher F1 onset frequency than alveolars except for /pi,ti/ and /pu,tu/. F2 onset frequency was lower for labials than for the alveolars by 200-400 Hz except for the /i/ context where the onsets were approximately the same. The F2 frequency change was smaller in amplitude for labials than for alveolars, and this difference was the most prominent for the /a/-context pairs and least prominent for the /i/-context ones (see Fig. 2). The F3 frequency change was a distinctive cue for labial/alveolar place of articulation for plosives in the /a/ and /i/ contexts but not in the /u/ context.

Figure 2.

Figure 2

Histograms of F2 frequency change (F2df) for the 12 labial/alveolar pairs with the labial and alveolar tokens counted separately. The histogram bin centers ranges from −820 to 540 Hz with a 80 Hz step. F2df of less than −860 Hz and of more than 580 Hz is counted into the −820 Hz center and 540 Hz center regions, respectively. F2df was a reliable cue for the vowel /a/ pairs except for /pa,ta/. Asterisks are added next to the CV pair name to indicate 79% or above correct classification of place of articulation.

The relative spectral amplitude measurements, Ahi-A23, Av-Ahi, and Av4-pA45, were reliable cues for labial/alveolar place of articulation for both plosives and fricatives to varying degrees. Ahi-A23 was higher in alveolars than in labials for plosives (by about 4 dB for voiced plosives and 14 dB for voiceless plosives), but approximately the same for fricatives (with a difference of about 2 dB). However, in the classification analyses, the measure was only reliable for voiceless plosives. Av-Ahi in labials were, on average, about 23 dB and 4 dB higher than those in alveolars for plosives and fricatives, respectively. Therefore, Av-Ahi produced relatively high labial/alveolar place of articulation classification for plosives (e.g., 100% for /pu,tu/) except for /pa,ta/. Six out of the 12 pairs were reliably classified by Av4-pA45 whose values in labials were, on average, about 16 dB and 30 dB higher than those in alveolars for plosives and fricatives, respectively (see Fig. 3). In fact, for voiceless plosives and fricatives in the /i/ and /u/ contexts, classification was above 80% correct using only Av4-pA45. Av-Ans appeared to be the most reliable cue for classifying place of articulation for five out of six fricative pairs. That is, it resulted in 100% classification for /vi,zi/ and /fu,su/, 94% for /va,za/, and 91% for /vu,zu/ and /fa,sa/. Av-Ans measurements were higher for labial fricatives than for alveolar ones (by about 17 dB on the average). The place-of-articulation distinctions appeared to be more prominent in the higher frequency ranges (i.e., Av4-A45, Av4-pA45, and Av-Ans) for fricatives than for plosives. In plosives, burst duration (bstD) signaled labial/alveolar place of articulation for /ba,da/, /pa,ta/, and /pi,ti/. Infricatives, noise duration (nD) appeared to be a cue for labial/alveolar place of articulation for /va,za/ and /vi,zi/. The noise duration was about 40 ms longer for alveolars than for labials.

Figure 3.

Figure 3

Histograms of Av4-pA45 for the 12 labial/alveolar pairs with the labial and alveolar tokens counted separately. The histogram bin centers ranges from −55 to 55 dB with a 5 dB step. Av4-pA45 of less than −57.5 dB and of more than 57.5 dB is counted into the −55 dB center and 55 dB center regions, respectively. Av4-pA45 was distinctive for /ba,da/, /pi,ti/, /pu,tu/, /vu,zu/, /fi,si/, and /fu,su/. Asterisks are added next to the CV pair name to indicate 79% or above correct classification of place of articulation.

In summary, formant amplitudes, steady-state frequencies, offset frequencies (except F2 offset frequency), and voicing feature measurements (except noise/burst duration measurements) were generally not discriminative for labial/alveolar place of articulation classification. Several formant frequency measurements (F1 and F2 onset frequencies and F2 and F3 frequency changes) were somewhat distinctive for labial/alveolar place of articulation classification (mostly in the /a/ context and moderately in the /i/ context). The relative spectral amplitude measurements in the higher frequency ranges, Ahi-A23, Av-Ahi, and Av4-pA45, were reliable cues for labial/alveolar place of articulation for both plosives and fricatives to varying degrees, consistent with the results in (Stevens et al., 1999) on Av-Ahi and Ahi-A23 and the results in (Hedrick et al., 1995) on Av4-pA45. Av-Ans was the most reliable cue for classifying place of articulation for fricatives. The burst and noise duration measurements, bstD and nD, appeared to be a moderate cue for labial/alveolar place of articulation for plosives and fricatives, respectively. A summary of the relevance of several acoustic properties in classifying labial/alveolar place of articulation for plosives and fricatives is shown in Table 4. (Threshold SNRs from the perception experiment are also given; see Section 3.4.) Generally speaking, formant frequency measurements (F1b, F2b, F2df, F3df), relative spectral amplitude measurements (Ahi-A23, Av4-pA45, Av-Ahi, Av-Ans), and noise/burst duration were reliable cues for labial/alveolar place of articulation (marked by asterisks in Table 4).

Table 4.

A summary of acoustic features that each yielded 79% or above correct classification of place of articulation. Threshold SNRs are listed beneath each CV pair. Asterisks are added next to the measures that were discussed at the end of Sec. 2.3.

/ba,da/ /bu,du/ /pu,tu/ /pi,ti/ /bi,di/ /pa,ta/ /fu,su/ /fa,sa/ /va,za/ /fi,si/ /vu,zu/ /vi,zi/
−4.4 −1.8 0 0.6 4.4 6.0 −5.2 −4.9 −4.5 −3.7 −3.2 −1.4
*F1b
F1e
F1df
F1bA
*F2b
F2e
*F2df
F2D
F2bA
F2dA
*F3df
*Av-Ans
*Av-Ahi
*Av4-pA45
Av4-A45
*Ahi-A23
Am-Avm
*bstD
votD
*nD

3. Perceptual study

3.1. Stimuli

All 384 CV tokens described in Sec. 2.1 were used as stimuli for the perceptual study. The masking noise used in the perceptual experiments was a 1250-ms segment of white Gaussian noise. At the beginning of each experimental session, 32 Gaussian noise sources were generated. During the presentation of each stimulus, a noise masker was randomly selected from the 32 Gaussian noise sources. The SNR was defined as the ratio of the maximum root mean square (RMS) value in the CV to the RMS value of the noise token [20 log 10(max_RMSCV) − 20 log 10(RMSnoise)]. A post-hoc examination of the acoustic portions with maximum RMS energy indicated that the first term occurred in the vowel part for most of the CV tokens. The maximum RMS energy of a token was computed using a 30 ms rectangular window so as to exclude acoustic spikes. The use of maximum RMS energy was consistent with the approach in (Miller and Nicely, 1955) where SNR was set based on the peak deflection of the VU needle. The RMS energy of the noise was based on the entire noise segment. Hence, the SNR did not depend on the duration of the speech token.

3.2. Participants

Listening experiments were conducted with four participants (two males, two females; age range 18 to 36 years; different from the speakers), all native speakers of American English who passed a hearing test (i.e., their hearing thresholds were equal to or below 10 dB HL, sound pressure level, from 250 Hz to 8 kHz).

3.3. Procedure

Perceptual testing took place in a sound-attenuating room. Digital speech stimuli were played via an Ariel Pro Port 656 board digital-to-analog converter (16 bits at a rate of 16 kHz). The resulting analog waveforms were amplified by a Sony 59ES DAT recorder and were then presented binaurally via Telephonics TDH49P headphones. The system was calibrated within 0.5 dB (from 125 to 7500 Hz at third octave intervals) using a 6-cc coupler and a Larson Davis 800B sound level meter (with the “A” weighting scale and a slow response) prior to each experiment.

Each signal (without noise) was played at 60 dB SPL, and the accompanying noise level was adjusted. The SPL of the speech signals were set based on their maximum RMS energy in a 30 ms rectangular window around the maximum level of the CV. The SPL of the white Gaussian noise was adjusted based on its RMS energy to result in different SNRs. The speech signal was added to a 1250 ms noise (or silence) segment such that it was centered in the middle of the segment.

Participants made two-alternative forced choices (2-AFC). Utterances were played in blocks of 64 tokens of a single CV pair (32 tokens × 2 presentations). When an utterance was played, subjects were asked to label the sound heard as either the labial or alveolar consonant (e.g., /b/ or /d/). A computer program was developed to record participants' responses from their keyboard inputs. No feedback was given at any time. The test was then repeated at different SNR levels. The order of SNR conditions was: quiet, 10 dB, 5 dB, 0 dB, −5 dB, −10 dB, and −15 dB (same order for all listeners). The CV pairs were presented in the order of /ba,da/, /bi,di/, /bu,du/, /pa,ta/,/pi,ti/, /pu,tu/, /fa,sa/, /fi,si/, /fu,su/, /va,za/, /vi,zi/, and /vu,zu/. To counterbalance the effects of talker and token order, the order of presenting the 64 tokens within each CV pair was pseudo-randomized. Participants were forced to take a break after each CV pair and were instructed to take at least one break every hour. Also, they were allowed to take voluntary breaks if they felt tired while listening to each CV pair. Each session lasted about one hour but no longer than two hours to prevent fatigue. On Day 1, each participant had a one-hour training session. During training, the experimenter explained and demonstrated the experimental procedure to the participants; participant read a written instruction; and participants then had a set of practice trials during which they could ask the experiment questions.

3.4. Results

3.4.1. Percent correct classification and threshold SNRs for place of articulation in noise

The percentage of correct place of articulation judgments was computed and listed as a function of SNR, manner of articulation, voicing, and vowel context (see Table 5). The percent correct values were calculated using all the data collected from the perceptual experiments, including all listeners and all talkers. Each data entry thus represents 256 responses from four listeners for a CV pair at a specific SNR condition (4 talkers × 4 listeners × 4 tokens × 2 presentations × 2 consonants).

Table 5.

Percent correct judgments as a function of SNR (dB), manner of articulation, voicing, and vowel

/b,d/
/p,t/
/v,z/
/f,s/
SNR /a/ /i/ /u/ /a,i,u/ /a/ /i/ /u/ /a,i,u/ /a/ /i/ /u/ /a,i,u/ /a/ /i/ /u/ /a,i,u/
21 100 100 100 100 99 96 99 98 100 98 100 99 100 99 100 100
10 98 92 93 94 81 98 100 93 97 98 100 98 99 100 100 100
5 96 81 98 92 78 88 98 88 97 95 96 96 99 97 96 97
0 87 65 86 79 70 78 79 76 91 84 90 88 95 93 96 95
−5 79 57 64 67 64 64 56 61 78 66 72 72 79 72 80 77
−10 66 50 60 59 54 52 52 53 61 54 52 56 56 51 59 55
−15 51 41 56 49 47 51 48 49 58 52 52 54 42 50 53 48

* This is a psychophysics study of perception of place of articulation in noise.

* In quiet, formant, relative spectral, duration measures are all reliable cues.

* Perceptual thresholds depend on manner, vowel, and voicing-manner-interaction.

* Formant frequency measures become increasingly important as the SNR degrades.

Most of the 12 CV pairs had close to 100% correct place of articulation judgments in the absence of noise. The listeners appeared to have had a particularly difficult time classifying the /pa,ta/ pair (with 81% correct place of articulation judgments even when the SNR was 10 dB). However, for CV pairs other than /pa,ta/, the percent correct was 92% or above when the SNR was 10 dB. Among the 12 CV pairs, the /f,s/ pairs appeared to have the best place of articulation judgments (93-96% correct) when the SNR was 0 dB. For SNRs of −10 dB and below, place of articulation judgments for all 12 pairs were dramatically affected by noise (below 70% correct). When the SNR was −15 dB, the percent correct of place of articulation judgments was about 50%, which is chance performance.

In order to analyze how the acoustic properties account for the perceptual results, a single SNR value for each CV pair was needed to represent the robustness of that CV pair in the presence of noise. That value, or threshold (in dB), was computed along the SNR continuum at which the percent correct of responses is 79% (Levitt, 1971). The perceptual results for the 12 CV pairs were arranged into plots as shown in Fig. 4 where percent correct is plotted versus SNR. For the quiet conditions, the SNR was estimated as 21 dB. A sigmoid was then fitted to each plot and described by the following equation:

y=c+dc2(11e(xb)a1+e(xb)a) (2)

where x and y represent SNR and percent correct, respectively; d and c are the maximum and minimum values of percent correct, respectively; a and b are parameters to adjust the slope and position of the transition of the sigmoid function between the top and bottom flat areas. In this study, c was set to 50%, the chance performance, and d was set to 100%. Therefore, a and b varied systematically to obtain the best fit sigmoid by minimizing the mean squared error. From the best fit sigmoid, the threshold SNR level corresponding to 79% correct responses was obtained. Thus, a single threshold SNR value for each of the 12 pairs of labial/alveolar CV syllables was calculated to represent the perceptual robustness of that pair. A lower threshold SNR corresponded to better perceptual results (more robust to noise).

Figure 4.

Figure 4

A sigmoid fitting (solid line) of percent correct scores as a function of SNR (dB) for the 12 labial/alveolar pairs. For each pair, the 79% threshold line is drawn, and the threshold SNR value is labeled. The average percent correct scores (from the four listeners) are in circles. The error bars represent the minimum and maximum numbers among the four listeners.

The sigmoid fitting was applied to perceptual results for each CV pair and each listener. The obtained threshold SNRs were submitted to an omnibus repeated measures analysis of variance with the manner of articulation (2), voicing (2), and vowel (3) as within-subjects factors. The only reliable interaction was between manner of articulation and voicing [F(1,3)=23.9, p=.016]. That is, for plosives, voiced CV syllables yielded better place of articulation classification than voiceless ones (by about 2.4 dB on average with the /i/ context as an exception), while the opposite was true for fricatives (by about 1.5 dB on average). Note that in (Miller and Nicely, 1955), the authors suggested the relative independence between the perception of manner of articulation and that of voicing. This inconsistency might result from the task difference between the two studies (i.e., open-set identification vs. 2-AFC on place of articulation. The main effect of manner of articulation was reliable [F(1,3)=61.1, p=.004], with fricatives (mean threshold SNR = −3.9 dB) being more robust than plosives (mean threshold SNR = 0.9 dB), agreeing with the results from (Miller and Nicely, 1955). A possible reason may be due to the differences in noise spectra between labial and alveolar CV syllables for plosives and fricatives. Generally speaking, fricatives have longer duration, and thus their noise spectral differences for place of articulation can be more easily perceived than those for plosives. The main effects of vowel context was marginally significant [F(2,2)=16.8, p=.056]. The vowel /i/ context yielded high threshold SNRs (less robust) than the /a/ [F(1,3)=9.8, p=.052] and /u/ [F(1,3)=29.0, p=.013] contexts, but there was no significant difference between the /a/ and /u/ contexts [F(1,3)=0.7, p=.452]. This agrees with (Hant, 2000), where /bi,di/ was the least robust while /ba,ga/ was the most robust. This vowel effect on threshold SNRs may result from the fact that formant frequency measurements were distinctive in /a/ and /u/ contexts in quiet (except for /pa,ta/), but not for /i/ contexts (see Sec. 2.3). The mean threshold SNRs were −1.9 dB, 0 dB, and −2.6 dB for the /a/, /i/, and /u/ pairs, respectively. The /pa,ta/ pair was an exception for the vowel effect. The main effect of voicing was not significant [F(1,3)=1.6, p=.297]. As a demonstration, the threshold SNR levels at 79% correct for all CV pairs are shown in Fig. 4, where percent correct scores were averaged over all talkers and all listeners.

Table 4 lists the threshold SNRs and the relevance of several acoustic properties in classifying labial/alveolar place of articulation for the 12 CV pairs in quiet conditions. For plosives, three pairs (/ba,da/, /bu,du/, and /pu,tu/), which had formant frequencies in addition to noise measurements as cues, were relatively more robust in noise for the labial/alveolar place of articulation distinctions. The other three plosive pairs (/pi,ti/, /bi,di/, and /pa,ta/) did not have formant frequency cues, and correspondingly their threshold SNRs are 0 dB or above. Fricative pairs in general have lower threshold SNRs (less than −1 dB) compared to the plosive ones. Similarly for fricative pairs, both the formant frequencies and the relative spectral amplitude measurements appeared to be responsible for the low threshold SNRs.

In summary, the perception of labial/alveolar place of articulation in noise depended on the interaction between voicing and manner of articulation, manner of articulation, and vowel context. Fricatives were generally more robust than plosives. The labial/alveolar distinction was not robust in the vowel /i/ context.

4. Correlations between threshold SNR values and absolute acoustic differences of the means

Correlations were computed between the 12 threshold SNR values from the perceptual experiments and the absolute differences of the mean values of each measured acoustic property for the 12 labial/alveolar pairs. The mean value of each acoustic measurement for every CV syllable was calculated from 16 tokens (4 talkers × 4 tokens of the same syllable). The correlation is defined as

r=corr(Mea¯laMea¯al,10SNRt) (3)

where corr represents the Pearson correlation function, la represents labial tokens, al represents alveolar tokens, Mea represents one type of acoustic measurement, the bar over Mea represents the mean operation, SNRt represents the threshold SNR values, and 10–SNRt indicates how much the threshold SNRs were below 10 dB. The assumption is that if an acoustic property is an important cue for place of articulation, then a larger absolute difference between the means would correspond to better performance (a lower threshold SNR), while a smaller absolute difference between the means would correspond to poorer performance (a higher threshold SNR). A negative correlation coefficient indicates that larger differences correspond to higher (worse) threshold SNRs, which is opposite to a normal psychoacoustic relationship. Also, to evaluate how the correlations vary with perceptual performance levels, the threshold SNRs were re-estimated at a number of perceptual thresholds between 71% and 84% using Equation 2.

Pearson product correlation coefficients were obtained only for those acoustic properties that appear in Table 4. Figure 5 shows the results of correlating threshold SNRs with the absolute differences of the means of several acoustic properties for the 12 CV pairs along different perceptual performance levels. Those acoustic properties that had negative correlation coefficients were not displayed. The correlation coefficients shown in Fig. 5 were not significant after Bonferroni correction because of the low N. They were therefore not definitive in and of themselves (because of the risk of false positives), but they were considered to be useful aids for the interpretation of the core results as given in Secs. 2.3 and 3.4. For this reason, the correlations for each acoustic property were examined across different threshold percent corrects and were examined in terms their pattern (i.e., increasing or decreasing with threshold percent corrects).

Figure 5.

Figure 5

Correlation coefficients between threshold SNRs and acoustic measures (distances between means) across all talkers as a function of the threshold percent correct (71%-84%). Acoustic measures that produced negative correlations were not displayed.

There were 10 acoustic properties that had positive correlations with threshold SNRs. They are F1b, F1e, F1df, F2b, F2e, F2df, F2D, F3df, votD, and Av4-pA45. Their correlations with threshold SNRs were between 0.15 and 0.70. Among the 10 acoustic properties, the correlation for votD and Av4-pA45 became lower with decreasing perceptual performance. That is, votD and Av4-pA45 became less effective when there was more noise (lower performance levels). At all SNRs, votD was more effective than Av4-pA45. In contrast, the eight formant measures became more effective when there was more noise (lower performance levels). The order of these formant measures in terms of correlations (from high to low) was: F3df, F2b, F2D, F2df, F2e, F1b, F1df, and F1e. The correlations for F3df, F2b, and votD were at about the same level.

Consistent with the results in Sec. 2.3, formant amplitudes produced negative correlations, indicating that formant amplitudes did not contribute to lowering the threshold SNRs of labial/alveolar distinction in noise. In Sec. 2.3, F1 and F2 onset frequencies (F1b and F2b) and F2 and F3 frequency changes (F2df and F3df) resulted in a high percentage of labial/alveolar classification for several quiet CV pairs. These formant properties also contributed to the place of articulation distinction in noise. For example, F3df, F1b, F2b, and F2df yielded relatively high correlations between the perceptual SNR thresholds and acoustic measures. Other formant properties (e.g., F1e and F2e frequencies) also showed positive correlations, although they did not classify labial/alveolar well in quiet conditions.

Relative spectral amplitude measurements (e.g., Ahi-A23, Av-Ahi, Av-Ans, bstD, and Am-Avm), which were found in Sec. 2.3 to be acoustically distinctive in terms of place of articulation, usually produced negative correlations with the perceptual measures. This might be due to the relative spectral amplitude measurements being easily corrupted in the presence of additive noise. The acoustic property Av4-pA45 produced positive but low correlations. Although the VOT duration was not a distinctive acoustic feature for labial/alveolar place of articulation in quiet conditions, it resulted in relatively high correlations between acoustic and perceptual measures. The VOT duration was longer for fricatives than for plosives, while fricatives usually had lower threshold SNRs than plosives. Therefore, the relatively high correlations for VOT duration mainly resulted from the differences between fricatives and plosives.

In summary, formant frequency properties were more noise robust than the relative spectral amplitude measurements. Because the /a/ context resulted in larger absolute differences in the F3df, F2b, F2df, and F1b measurements between the labial and alveolar pairs than the /i/ and /u/ contexts, the place of articulation judgments in the /a/ context were more robust than those in the /i/ and /u/ contexts.

5. General discussion

The present study examines the acoustic correlates and perception in noise of place of articulation in naturally-spoken syllable-initial plosive and fricative consonants. Both formant frequency and relative spectral amplitude measurements were the cues most predictive of place of articulation decisions in quiet conditions, but relative spectral amplitude measurements appeared to be masked at low SNRs, with a contrasting result that formant frequency measurements were better place of articulation cues at low SNRs. Specifically, in quiet conditions, all of the 12 CV pairs were correctly classified at or near 100% with formant frequency or relative spectral amplitude measurements. Nevertheless, no single cue showed high classification for both fricatives and plosives across all vowel contexts.

In the presence of noise, listeners could still make correct labial/alveolar place of articulation judgments even when the SNR level was −5 dB. However, for an SNR of −15 dB, listeners' responses were equivalent to random guesses (chance performance). The present study showed that fricatives, in general, had lower threshold SNRs than plosives, agreeing with (Miller and Nicely, 1955). Similar to that in (Miller and Nicely, 1955), this study showed that voiceless fricatives, in particular, were slightly more robust than the voiced ones.

For place of articulation classification in noise, vowel effect was significant in the sense that the vowel /a/ context, except for /pa,ta/, yielded lower threshold SNRs than the /u/ context, which was more robust than the /i/ context. The reason might be that the distinctive acoustic features (e.g., F1b, F2b, F2df, and F3df) were most prominent for the /a/-context pairs and least prominent for the /i/-context pairs (see Sec. 2.3), and this was confirmed with the correlation analyses for which the high correlation coefficients usually resulted from the vowel differences in the formant frequency measurements. The vowel effect is consistent with the (Parikh and Loizou, 2005) study for which acoustic analyses indicated that F1 was detected more reliably than F2 and correlation analyses indicated that vowel identification scores were highly correlated with acoustic parameter values at a SNR of −5 dB. Interestingly, the formant frequency measurements for /pa,ta/ were not discriminative compared to other /a/-context pairs. The effect of manner of articulation was also reliable, which could be attributed to the noise durations in plosives and fricatives (Jongman, 1988).

Relative spectral amplitude measurements, although acoustically distinctive in quiet conditions, usually had low or negative correlations with the threshold SNRs (except for votD and Av4-pA45). These results indicate that the formant frequency measurements were more important for the perception of place of articulation at low SNRs than the relative spectral amplitude measurements, agreeing with acoustic analysis and perceptual results from (Parikh and Loizou, 2005). Compared to formant frequency measurements, relative spectral amplitude measurements are easily corrupted by noise, especially broadband noise. The higher correlations between threshold SNR values and formant measurements at lower percent correct threshold are consistent with the glimpsing model of speech perception in noise for which listeners use spectro-temporal regions in which the target signal is least affected by the background for speech perception and integration (Hant and Alwan, 2003; Cooke, 2006; Li and Loizou, 2007). That is, listeners use whatever cues are available, and those cues crucially depend on the nature of the noise masker. Therefore, the effect of the type of noise masker should also be taken into account. One speculation is that speech-shaped or multi-talker babble noise might also corrupt the formant frequency measurements. For example, in (Hedrick and Younger, 2007), the authors showed that for the perception of place of articulation in plosive consonants /p,t/, normal hearing listeners reduced their weighting of formant transitions and relied more on the relative spectral amplitude cues in the speech-shape noise than in the quiet condition. If the present study were carried out with speech-shaped or multi-talker babble noise, the correlations between threshold SNR values and relative spectral amplitude measurements at lower percent correct threshold might be higher.

A limitation in the correlation analyses is that the within- and between-talker variations were not examined due to the limited number of tokens and perceptual responses. One possibility is that the correlations were driven by data from one talker (between-talker variation) or some specific tokens within one talker (within-talker variation). This in turn would limit the generalization of results from the present study. However, the distributions of all acoustic properties (e.g., Figs. 2 and 3) were examined and were found not to be not abnormal. Nevertheless, evaluating the within- and between-talker variations is an interesting future topic.

In summary, for white Gaussian noise, the formant frequency measurements are more dominant cues for labial/alveolar place of articulation than relative spectral amplitude measurements; place of articulation perception is dependent on the interaction of voicing and manner of articulation, manner of articulation, and vowel context; and no single acoustic feature could cue perception of place of articulation. These results could eventually be useful for hearing aids, cochlear implant processing algorithms, and noise-robust automatic speech recognition. For example, for hearing aids, better noise reduction algorithms could be designed by enhancing noise-level-dependent salient acoustic cues.

In future studies, experiments will be conducted using a larger dataset and synthetic stimuli to construct acoustic continua and to control interactions between a limited number of acoustic properties (e.g., independently vary the F2 onset frequency, the F2 frequency change, and the F3 frequency change). Perceptual experiments can be expanded by masking the stimuli with different types of noise maskers (e.g., perceptually flat noise, speech-shaped noise, multi-talker babble, car noise, etc.) In addition, perceptual experiments can be performed with listeners with cochlear implants so as to help understand which speech cues they rely on in a noisy environment.

Acknowledgments

This work was supported in part by NIH-NIDCD grant 1R29-DC02033-01A1, the NSF, and a Fellowship from the Radcliffe Institute to Abeer Alwan. We thank Marcia Chen for her help in data analysis and Wendy Espeland, Marwa Elshakry, and Christine Stansell for commenting on an earlier version of this manuscript. Thanks also to Steven Lulich for constructive comments. The views expressed here are those of the authors and do not necessarily represent those of the NSF.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Portions of this work were presented at ICSLP 2000 and ICPhS 2003.

References

  1. Alwan A. The role of F3 and F4 in identifying the place of articulation for stop consonants; Proceedings of the International Conference on Spoken Language Processing; Banff, Canada; 1992. pp. 1063–1066. [Google Scholar]
  2. Behrens S, Blumstein SE. On the role of the amplitude of the fricative noise in the perception of place of articulation in voiceless fricative consonants. J. Acoust. Soc. Am. 1988;84(3):861–867. doi: 10.1121/1.396655. [DOI] [PubMed] [Google Scholar]
  3. Benkí JR. Quantitative evaluation of lexical status, word frequency, and neighborhood density as context effects in spoken word recognition. J. Acoust. Soc. Am. 2003;113(3):1689–1705. doi: 10.1121/1.1534102. [DOI] [PubMed] [Google Scholar]
  4. Blumstein SE, Stevens KN. Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. J. Acoust. Soc. Am. 1979;66(4):1001–1017. doi: 10.1121/1.383319. [DOI] [PubMed] [Google Scholar]
  5. Bradlow AR, Alexander JA. Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners. J. Acoust. Soc. Am. 2007;121(4):2339–2349. doi: 10.1121/1.2642103. [DOI] [PubMed] [Google Scholar]
  6. Cooke M. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 2006;119(3):1562–1573. doi: 10.1121/1.2166600. [DOI] [PubMed] [Google Scholar]
  7. Cooke M, Lecumberri MLG, Barker J. The foreign language cocktail party problem: Energetic and informational masking effects in non-native speech perception. J. Acoust. Soc. Am. 2008;123(1):414–427. doi: 10.1121/1.2804952. [DOI] [PubMed] [Google Scholar]
  8. Cutler A, Lecumberri MLG, Cooke M. Consonant identification in noise by native and non-native listeners: Effects of local context. J. Acoust. Soc. Am. 2008;124(2):1264–1268. doi: 10.1121/1.2946707. [DOI] [PubMed] [Google Scholar]
  9. Cutler A, Weber A, Smits R, Cooper N. Patterns of English phoneme confusions by native and non-native listeners. J. Acoust. Soc. Am. 2004;116(6):3668–3678. doi: 10.1121/1.1810292. [DOI] [PubMed] [Google Scholar]
  10. Delattre PC, Liberman AM, Cooper FS. Acoustic loci and transitional cues for consonants. J. Acoust. Soc. Am. 1955;27(4):769–773. [Google Scholar]
  11. Engen KJV, Bradlow AR. Sentence recognition in native- and foreign-language multi-talker background noise. J. Acoust. Soc. Am. 2007;121(1):519526. doi: 10.1121/1.2400666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fant G. Stops in cv-syllables. In: Fant G, editor. Speech Sounds and Features. MIT; Cambridge, MA: 1973. pp. 110–139. [Google Scholar]
  13. Farar CL, Reed CM, Ito Y, Durlach NI, Delhorne LA, Zurek PM, Braida LD. Spectral-shape discrimination. i. results from normal-hearing listeners for stationary broadband noises. J. Acoust. Soc. Am. 1987;81:1085–1092. doi: 10.1121/1.394628. [DOI] [PubMed] [Google Scholar]
  14. Fruchter D, Sussman HM. The perceptual relevance of locus equations. J. Acoust. Soc. Am. 1997;102(5):2997–3008. doi: 10.1121/1.421012. [DOI] [PubMed] [Google Scholar]
  15. Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched noise data. Hear. Res. 1990;47(1-2):103–138. doi: 10.1016/0378-5955(90)90170-t. [DOI] [PubMed] [Google Scholar]
  16. Guerlekian JA. Recognition of the spanish fricatives /s/ and /f/ J. Acoust. Soc. Am. 1981;70:1624–1627. [Google Scholar]
  17. Hant JJ. A computational model to predict human perception of speech in noise. University of California; Los Angeles, CA: 2000. Ph.D. dissertation. [Google Scholar]
  18. Hant JJ, Alwan A. Predicting the perceptual confusion of synthetic plosive consonants in noise; Proceedings of the International Conference on Spoken Language Processing; Beijing, China: 2000. pp. 941–944. [Google Scholar]
  19. Hant JJ, Alwan A. A psychoacoustic-masking model to predict the perception of speech-like stimuli in noise. Speech Commun. 2003;40(3):291–313. [Google Scholar]
  20. Harris KS. Cues for discrimination of American English fricatives in spoken syllables. Lang. Speech. 1958;1:1–17. [Google Scholar]
  21. Hedrick M, Ohde RN. Effect of relative amplitude of frication on perception of place of articulation. J. Acoust. Soc. Am. 1993;94(4):2005–2026. doi: 10.1121/1.407503. [DOI] [PubMed] [Google Scholar]
  22. Hedrick MS, Jesteadt W. Effect of relative amplitude, presentation level and vowel duration on perception of voiceless stop consonants by normal and hearing impaired listeners. J. Acoust. Soc. Am. 1996;100(5):3398–3407. doi: 10.1121/1.416981. [DOI] [PubMed] [Google Scholar]
  23. Hedrick MS, Schulte L, Jesteadt W. Effect of relative and overall amplitude on perception of voiceless stop consonants by listeners with normal and impaired hearing. J. Acoust. Soc. Am. 1995;98(3):1292–1303. doi: 10.1121/1.413466. [DOI] [PubMed] [Google Scholar]
  24. Hedrick MS, Younger MS. Perceptual weighting of stop consonant cues by normal and impaired listeners in reverberation versus noise. J. Speech Lang. Hear. Res. 2007;50(2):254–269. doi: 10.1044/1092-4388(2007/019). [DOI] [PubMed] [Google Scholar]
  25. Heinz JM, Stevens KN. On the properties of voiceless fricative consonants. J. Acoust. Soc. Am. 1961;33:589–596. [Google Scholar]
  26. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 1990;87(4):1738–1752. doi: 10.1121/1.399423. [DOI] [PubMed] [Google Scholar]
  27. Jiang J, Chen M, Alwan A. On the perception of voicing in syllable-initial plosives in noise. J. Acoust. Soc. Am. 2006;119(2):1092–1105. doi: 10.1121/1.2149841. [DOI] [PubMed] [Google Scholar]
  28. Jongman A. Duration of frication noise required for identification of English fricatives. J. Acoust. Soc. Am. 1988;85(4):1718–1725. doi: 10.1121/1.397961. [DOI] [PubMed] [Google Scholar]
  29. Kewley-Port D. Measurement of formant transitions in naturally produced stop consonant-vowel syllables. J. Acoust. Soc. Am. 1982;72(2):379–389. doi: 10.1121/1.388081. [DOI] [PubMed] [Google Scholar]
  30. Lecumberri MLG, Cooke M. Effect of masker type on native and non-native consonant perception in noise. J. Acoust. Soc. Am. 2006;119(4):2445–2454. doi: 10.1121/1.2180210. [DOI] [PubMed] [Google Scholar]
  31. Levitt H. Transformed up-down methods in psychoacoustics. J. Acoust. Soc. Am. 1971;49(2B):467–477. [PubMed] [Google Scholar]
  32. Li N, Loizou PC. Factors influencing glimpsing of speech in noise. J. Acoust. Soc. Am. 2007;122(2):1165–1172. doi: 10.1121/1.2749454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Liberman AM, Delattre PC, Cooper FS, Gerstman LJ. The role of consonant-vowel transitions in the perception of the stop and nasal consonants. Psychol. Mono. 1954;68(8):788 1–13. [Google Scholar]
  34. Miller GA, Nicely PE. An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Am. 1955;27(2):338–352. [Google Scholar]
  35. Nittrouer S, Wilhelmsen M, Shapley K, Bodily K, Creutz T. Two reasons not to bring your children to cocktail parties. J. Acoust. Soc. Am. 2003;113:2254. [Google Scholar]
  36. Ohde RN, Stevens KN. Effect of burst amplitude on the perception of stop consonant place of articulation. J. Acoust. Soc. Am. 1983;74:706–714. doi: 10.1121/1.389856. [DOI] [PubMed] [Google Scholar]
  37. Parikh G, Loizou PC. The influence of noise on vowel and consonant cues. J. Acoust. Soc. Am. 2005;118(6):3874–3888. doi: 10.1121/1.2118407. [DOI] [PubMed] [Google Scholar]
  38. Potter RK, Kopp GA, Green H. Visible Speech. Van Nostrand; Princeton, NJ: 1947. [Google Scholar]
  39. Redford MA, Diehl RL. The relative perceptual distinctiveness of initial and final consonants in CVC syllables. J. Acoust. Soc. Am. 1999;106(3):1555–1565. doi: 10.1121/1.427152. [DOI] [PubMed] [Google Scholar]
  40. Shadle CH, Mair SJ. Quantifying spectral characteristics of fricatives; Proceedings of the International Conference on Spoken Language Processing; Philadelphia, PA: 1996. pp. 1521–1524. [Google Scholar]
  41. Shannon RV, Zeng F-G, Kamath V, Wygonski J, Ekelid M. Speech recognition with primarily temporal cues. Science. 1995;270(5234):303–304. doi: 10.1126/science.270.5234.303. [DOI] [PubMed] [Google Scholar]
  42. Simpson SA, Cooke M. Consonant identification in n-talker babble is a non-monotonic function of n. J. Acoust. Soc. Am. 2005;118(5):2775–2778. doi: 10.1121/1.2062650. [DOI] [PubMed] [Google Scholar]
  43. Soli SD, Arabie P. Auditory versus phonetic accounts of observed confusions between consonant phonemes. J. Acoust. Soc. Am. 1979;66(1):46–59. doi: 10.1121/1.382972. [DOI] [PubMed] [Google Scholar]
  44. Stevens KN. Evidence for the role of acoustic boundaries in the perception of speech sounds. In: Fromkin VA, editor. Phonetic Linguistics: Essays in Honor of Peter Ladefoged. Academic Press; New York, NY: 1985. pp. 243–255. [Google Scholar]
  45. Stevens KN. Acoustic Phonetics. MIT Press; Cambridge, MA: 1998. [Google Scholar]
  46. Stevens KN, Blumstein SE. Invariant cues for place of articulation in stop consonants. J. Acoust. Soc. Am. 1978;64:1358–1368. doi: 10.1121/1.382102. [DOI] [PubMed] [Google Scholar]
  47. Stevens KN, Manuel SY, Metthies M. Revisiting place of articulation measures for stop consonants: Implications for models of consonant production; Proceedings of the International Congress of Phonetic Sciences; San Francisco, CA: 1999. pp. 1117–1120. [Google Scholar]
  48. Strope B, Alwan A. A model of dynamic auditory perception and its application to robust word recognition. IEEE Trans. Speech Audio Process. 1997;5:451–464. [Google Scholar]
  49. Suchato A. Classification of stop consonant place of articulation. Massachusetts Institute of Technology; Cambridge, MA: 2004. Ph.D. dissertation. [Google Scholar]
  50. Sussman HM, Fruchter D, Cable A. Locus equations derived from compensatory articulation. J. Acoust. Soc. Am. 1995;97(5):3112–3124. doi: 10.1121/1.411873. [DOI] [PubMed] [Google Scholar]
  51. Sussman HM, Hoemeke KA, Ahmed FS. A cross-linguistic investigation of locus equations as a phonetic descriptor for place of articulation. J. Acoust. Soc. Am. 1993;94(3):1256–1268. doi: 10.1121/1.408178. [DOI] [PubMed] [Google Scholar]
  52. Sussman HM, McCaffrey HA, Matthews SA. An investigation of locus equations as a source of relational invariance for stop place categorization. J. Acoust. Soc. Am. 1991;90(3):1309–1325. [Google Scholar]
  53. Wang MD, Bilger RC. Consonant confusion in noise: A study of perceptual features. J. Acoust. Soc. Am. 1973;54:1248–1265. doi: 10.1121/1.1914417. [DOI] [PubMed] [Google Scholar]
  54. You HY. An acoustical and perceptual study of English fricatives. University of Edmonton; Edmonton, Canada: 1979. Master thesis. [Google Scholar]
  55. Zue V. Acoustic characteristics of stop consonants: A controlled study. Massachusettes Institute of Technology; Cambridge, MA: 1976. Ph.D. dissertation. [Google Scholar]

RESOURCES