The use of acoustic cues for phonetic identification: Effects of spectral degradation and electric hearing

Matthew B Winn; Monita Chatterjee; William J Idsardi

doi:10.1121/1.3672705

. 2012 Feb;131(2):1465–1479. doi: 10.1121/1.3672705

The use of acoustic cues for phonetic identification: Effects of spectral degradation and electric hearing^a)

Matthew B Winn ^1,^b), Monita Chatterjee ¹, William J Idsardi ²

PMCID: PMC3292615 PMID: 22352517

Abstract

Although some cochlear implant (CI) listeners can show good word recognition accuracy, it is not clear how they perceive and use the various acoustic cues that contribute to phonetic perceptions. In this study, the use of acoustic cues was assessed for normal-hearing (NH) listeners in optimal and spectrally degraded conditions, and also for CI listeners. Two experiments tested the tense/lax vowel contrast (varying in formant structure, vowel-inherent spectral change, and vowel duration) and the word-final fricative voicing contrast (varying in F1 transition, vowel duration, consonant duration, and consonant voicing). Identification results were modeled using mixed-effects logistic regression. These experiments suggested that under spectrally-degraded conditions, NH listeners decrease their use of formant cues and increase their use of durational cues. Compared to NH listeners, CI listeners showed decreased use of spectral cues like formant structure and formant change and consonant voicing, and showed greater use of durational cues (especially for the fricative contrast). The results suggest that although NH and CI listeners may show similar accuracy on basic tests of word, phoneme or feature recognition, they may be using different perceptual strategies in the process.

INTRODUCTION

In view of the remarkable success of the cochlear implant (CI) as a prosthetic device (Zeng et al., 2008), and in the context of continually growing body of research on cochlear implants, literature on phonetic cue perception must be expanded to acknowledge the abilities of individuals fitted with these devices. It is well known that a major obstacle to accurate speech understanding with electric hearing (the use of a CI) is the poor spectral resolution offered by these devices, owing to the limited number of independent spectral processing channels (Fishman et al., 1997; Friesen et al., 2001), interactions between the electrodes which carry information from those channels (Chatterjee and Shannon, 1998) as well as a distorted tonotopic map (Fu and Shannon, 1999). Thus, the subtle fine-grained spectral differences perceptible to those with normal hearing are not reliably distinguished by those who use CIs (Kewley-Port and Zheng, 1998; Loizou and Poroy, 2001; Henry et al., 2005). In view of some of these studies, it is presumed that phonetic cues driven by spectral contrasts would be most challenging for CI listeners. Although numerous studies have explored word, phoneme and feature recognition in various kinds of degraded conditions, few have explored the use of acoustic cues that contribute to these perceptions.

Not all sound components are compromised in electric hearing; temporal processing can be as good or better than that of normal-hearing (NH) listeners, as evidenced by temporal modulation transfer functions Shannon (1992) and gap detection tasks (Shannon, 1989). Thus, although some phonetic cues are obscured by spectral degradation, it is expected that CI listeners should be able to use nonspectral cues in speech, which might be carried by the temporal amplitude envelope or segment duration. Fittingly, experiments have revealed a large number of errors on place-of-articulation perception (which relies primarily upon spectral cues in the signal, such as spectral peak frequencies and formant transitions), while the manner-of-articulation and voicing features are rarely misperceived, because they can be transmitted via temporal cues, which are well-maintained in electric hearing (Dorman et al., 1991). Similar results of poor place perception and excellent voicing perception have been shown continually for NH listeners listening to simulations of cochlear implants (i.e., Shannon et al., 1995).

Trading relations in phonetic feature perception

Phonetic contrasts are signaled by various acoustic dimensions in the temporal and spectral domains. Those dimensions that are used perceptually to identify speech sounds are called “phonetic cues”; they are acoustic cues that contribute to phonetic categorization. For example, the first formant (F1) of a vowel sound corresponds to the height of that vowel; as the vowel height decreases, F1 increases. Hence, F1 serves as a phonetic cue for contrastive vowel height. There are multiple co-occurring phonetic cues for any particular contrast, which creates a high amount of redundancy in the signal. A classic example is the contrast between voiced and voiceless stops in word-medial position, which has been claimed to contain at least 16 different acoustic cues (Lisker, 1978). A wealth of literature has revealed that changes in one acoustic dimension can be compensated by conflicting changes in another dimension (for multiple examples, see Repp, 1982). For example, trading relations can be observed in the integration of cues for syllable-initial stop consonant voicing; changes in voice-onset-time that signal voicing can be somewhat offset by changes in the pitch domain that signal voicelessness (Whalen et al., 1993). As these and other cues covary in natural speech, the listener must integrate them in a way that yields reliable and accurate identification of the incoming information. It has been shown that the use of acoustic cues for phonetic contrasts is affected by the developmental age (Nittrouer, 2004, 2005) as well as language background (Morrison, 2005) of a listener. Perhaps it is also affected by spectral resolution in a way that is useful for understanding the experience of CI listeners relative to NH listeners.

Perception of acoustic dimensions such as duration, formant frequencies, or the time-varying amplitude envelope all depend on the fidelity of the stimulus. Trading relations between temporal and spectral signal fidelity have been observed for the perception of English consonants and vowels (Xu et al., 2005) as well as for Mandarin lexical tones (Xu and Pfingst, 2003). In those studies, as the degree of spectral resolution was decreased, the level of temporal resolution played a larger role in listeners’ perceptual accuracy. These experiments were carried out using noise-band vocoding (NBV) (to be described in detail later) based on that used by Shannon et al. (1995), which is commonly used to simulate electric hearing. The current study takes a similar approach to ask a different question—beyond showing correct and incorrect performance on word and phoneme recognition tasks, what can we learn about the avenue that listeners take to achieve this performance? There is reason to believe that listeners will adapt to an altered stimulus input by changing the relative importance of signal components (Francis et al., 2000, 2008, among many examples). Perhaps cochlear implant listeners and normal-hearing listeners in degraded conditions can adopt new strategies that would suit the challenges and residual abilities available to them.

Readers will recognize the central issue in this paper as one of cue-trading/cue-weighting. There are several models of cue-weighting present in the literature, and the current study was not designed to explicitly test or challenge any of them. Of particular interest, however, are those accounts which specifically acknowledge the reliability with which the signal is represented. The cue weighting-by-reliability model of Toscano and McMurray (2010) suggests that the weighting of acoustic cues in phonetic perception can be predicted by their distributional properties in the input; the basic theme of this research permeates other work as well, such as that of Holt and Idemaru (2011). Specifically, a cue is more reliable (and hence should be more heavily weighted) if the contrastive level means are far apart and have low variance. In the current study, it could be argued that spectral degradation (whether simulated or via electric hearing) would diminish the reliability of spectral cues like vowel formants as well as formant transitions, since there is no clear placement of these peaks in the degraded spectrum. The temporal dimensions (duration, time-varying amplitude envelope), however, should remain relatively unchanged.

In summary, the current experiments were conducted to explore whether spectral degradation affects listeners’ use of various acoustic cues in phonetic identification. It was hypothesized that if spectral resolution were poor, listeners would be less affected by phonetic cues in the spectral domain, and more affected by phonetic cues in the temporal domain. This hypothesis would be supported by two kinds of results: (1) normal-hearing listeners using phonetic cues differently when spectral resolution is artificially degraded and (2) cochlear implant listeners using phonetic cues in a way that is different from normal-hearing listeners. The hypothesis was tested using two different phonetic contrasts, described below.

EXPERIMENT 1: THE LAX/TENSE VOWEL DISTINCTION

Review of acoustic cues

The first experiment explored the high-front lax/tense vowel contrast (/I/ and /i/) in English, which distinguishes word pairs such as hit/heat, fill/feel, hid/heed, and bin/bean. The cues that contribute to this distinction include the spectral dimensions of formant structure and vowel-inherent spectral change (VISC), as well as vowel duration. Formant structure has long been known to correspond to vowel categorization, albeit with a considerable amount of overlap between categories (Hillenbrand et al., 1995). Still, this cue is extremely powerful; using only steady-state formants synthesized from measurement taken at one timepoint in a vowel, human listeners identify vowels with roughly 75% accuracy (Hillenbrand and Gayvert, 1993). Automatic pattern classifiers show similar performance with just one sample of formant structure (i.e., a spectral snapshot) (Hillenbrand et al., 1995).

VISC refers to the “relatively slowly varying changes in formant frequencies associated with vowels themselves, even in the absence of consonantal context” (Nearey and Assmann, 1986). Throughout production of the lax vowel /I/, F1 increases and F2 decreases; the tense vowel /i/ is relatively steady-state by comparison, with only a negligible amount of formant movement, if any (Hillenbrand et al., 1995). VISC plays a role in vowel classification, as indicated by at least four kinds of data: (1) measurement of dynamic formant values from production data (Nearey and Assmann, 1986; Hillenbrand et al., 1995), (2) results of pattern classifiers show better performance when spectral change is included as a factor (Zahorian and Jagharghi, 1993; Hillenbrand et al., 1995), (3) listeners reliably identify vowels with only snapshots of the onset and offset (with silent or masked center portions) (Jenkins et al., 1983; Parker and Diehl, 1984; Nearey and Assmann, 1986), and (4) human listeners show improved identification results when vowels include natural patterns of spectral change; there is generally a 23%–26 % decline in accuracy for vowels whose formant structure lacks spectral change (Hillenbrand and Nearey, 1999; Assmann and Katz, 2005). When VISC is neutralized, there is a significant decline in /I/ recognition, while the vowel /i/ is identified virtually perfectly (Assmann and Katz, 2005), consistent with the acoustics of these vowels.

The duration of tense vowels tend to be longer than that of lax vowels by roughly 33%–80%, depending on the particular contrast and context (House, 1961; Hillenbrand et al., 1995). However, the role of duration in vowel perception has not always been clear; it appears to be driven at least in part by the fidelity of the stimulus. Ainsworth (1972) showed that duration can modulate identification of vowels synthesized with two steady-state formants. Bohn and Flege (1990) and Bohn (1995) revealed a small effect of duration for the i/I contrast when using three steady-state formants. However, these results are challenged by other studies that preserved relatively richer spectral detail, including time-varying spectral information (Hillenbrand et al., 1995, 2000; Zahorian and Jagharghi, 1993). Using modified natural speech, Hillenbrand et al. (2000) reported that duration-based misidentifications of the I/i contrast were especially rare (with an error rate of less than 1%). An emergent theme from Hillenbrand et al. (2000), Nittrouer (2004), and Assmann and Katz (2005) is that the use of acoustic cues in vowels is affected by signal fidelity, to the extent that commonly used formant synthesizers are likely to underestimate the role of time-varying spectral cues, and to overestimate the role of durational cues. That is, listeners use phonetic cues differently depending on the quality with which the sound is presented.

Although considerable improvements in speech synthesis and manipulation have improved the quality of signals in perceptual experiments, signal degradation is inescapable for individuals with cochlear implants. Iverson et al. (2006) remarked, “It would be surprising if exactly the same cues were used when recognizing vowels via cochlear implants and normal hearing, because the sensory information provided by acoustic and electric hearing differ substantially.” Despite the aforementioned trend observed in spectral and temporal signal fidelity, Iverson et al. (2006) did not find evidence to suggest that duration was more heavily used by CI listeners or NH listeners in degraded conditions. In fact, as spectral resolution was degraded from 8 to 4 to 2 channels (each representing progressively worse resolution, to be explained further in Sec. 2), NH listeners showed less recovery of duration information in the signal. This counterintuitive result may have arisen because of the methods by which duration cue use was assessed. The experimenters used information transfer analysis (ITA) (Miller and Nicely, 1955) to track phonetic features that were recovered or mistaken in the identification tasks. Although these features are commonly thought to correspond regularly to acoustic dimensions (i.e., vowel height as variation in F1 frequency, vowel advancement as variation in F2 frequency, lax/tense as duration), ITA by itself does not reveal the mechanisms (cues) by which the features are recovered. This is particularly important for the duration cue; most dialects of English do not contain vowel pairs that contrast exclusively by duration. Thus, for any long or short vowel in English (as coded in ITA), there are accompanying covarying spectral cues. If a listener relies on these spectral cues (as would be predicted on the basis of aforementioned work), then it is not surprising that “duration” information transmission declined as spectral resolution decreased. In the ITA sort of analysis, “duration” could be merely a different name for spectral information, unless the latter has been specifically controlled. The question remains then, as to whether changes in vowel duration play a greater role in vowel identification when spectral resolution is degraded.

Despite of the limitations of the ITA-based analysis, the work by Iverson et al. (2006) is to be commended for laying the groundwork for studying the role of varying acoustic cues with varying degrees of temporal and spectral resolution. This approach has been only sparingly applied to the problem of speech perception by CI listeners (Dorman et al., 1991, is a rare example), and it is the aim of the present paper to explore it further using two contrasts that have been shown to involve both spectral and temporal cues. Many previous experiments (Hillenbrand and Nearey, 1999; Hillenbrand et al., 2000; Iverson et al., 2006) have assessed the role of multiple cues by retaining them or neutralizing them in a dichotic fashion. The current experiment seeks to expand upon this work by manipulating acoustic cues gradually and orthogonally, so as to assess their effects in a more fine-grained way that is unfeasible in experiments that test for many vowels and consonants concurrently.

Some prior work indicates that listeners with hearing impairment do exhibit altered use of acoustic cues in speech perception. In a place-of articulation identification task, Dorman et al. (1991) showed that, compared to NH listeners, CI listeners were affected more heavily by the spectral tilt of a stop consonant; NH listeners relied instead on formant transitions. Kirk et al. (1992) found that CI listeners were able to make use of static formant cues in vowels, but did not take advantage of the formant transition contrasts used by NH listeners. This would suggest that the dynamic formant cue for lax vowels may be compromised in degraded conditions. Accordingly, Dorman and Loizou (1997) indicated that CI listeners identified the lax vowel /I/ with accuracy similar to that of NH listeners in conditions where VISC is neutralized (Hillenbrand and Gayvert, 1993). We therefore expected the perception of speech sounds by CI listeners to fall in line with predictions informed by the aforementioned work that implicates signal degradation as an influential force on the use of durational cues. We thus predicted that as spectral resolution became poorer, use of formant cues would decline, the use of VISC cues would decline (if at all present), and the use of temporal cues would increase.

Methods

Participants

Participants included 15 adult (14 between the ages of 19–26; mean age 22.7 years, and one 63 year-old) listeners with normal hearing, defined as having pure-tone thresholds ≤20 dB HL from 250–8000 Hz in both ears (ANSI, 2004). A second group of participants included seven adult (age 50–73; mean age 63.5 years) recipients of cochlear implants. CI listeners were all post-lingually deafened. Six were users of the Cochlear Freedom or N24 devices; one used the Med-El device. See Table TABLE I. for demographic information and speech processor parameters for each CI user. All participants were native speakers of American English and were screened for fluency in languages for which vowel duration is a phonemic feature (i.e., Finnish, Hungarian, Arabic, Vietnamese, etc.), to ensure that no participant entered with a priori bias towards durational feature sensitivity. Normal-hearing participants 01 (the first author) and 02 were highly familiar with the stimuli, having been involved in pilot testing and the construction of the materials. It should be noted that the age difference between the normal-hearing and cochlear implant listener group is substantial, and can influence auditory processing in a way that is relevant to this study. Specifically, auditory temporal processing is known to be deficient in older listeners (Gordon-Salant and Fitzgibbons, 1999). The current study explores whether auditory cues in the temporal domain can overcome those that are compromised in the spectral domain. These listeners may or may not experience deficiencies in the temporal domain that could complicate this matter. Aside from this, there also exists variability in the durations and etiologies of deafness among the impaired listener group (as is the case in virtually all studies that use CI listeners). For these reasons, direct statistical comparisons between the normal-hearing listeners and cochlear implant listeners are limited in their utility and thus omitted from this paper.

TABLE I.

Relevant demographic information about the CI participants in this study. All used the ACE processing strategy and the MP1+2 stimulation mode except for C30, who used the CIS strategy.

ID No.	Gender	Etiology of HL	Duration of HL	Age at testing	Age at impl.	Device	Pulse rate
C1	F	Unknown	Unknown	66	63	Freedom	900
C2	F	Genetic	10 years	66	63	Freedom	1800
C3	M	Unknown	22 years	64	57	N 24	900
C4	M	Labyrinthitis	11 years	50	40	N 24	720
C5	M	Unknown	Unknown	56	54	Med-El	1515
C6	F	Measles	59 years	71	66	Freedom	1800
C7	F	Unknown	4 years	73	69	Freedom	2400

Open in a new tab

Stimuli

Speech synthesis.

Words were synthesized to resemble “hit” and “heat.” The vowels in these words varied by formant structure (in seven steps, with the first four formants all simultaneously varying), vowel-inherent spectral change (in five steps, with the first three formants all varying dynamically) and vowel duration (in seven steps). See Table TABLE II. for a detailed breakdown of the levels for each parameter. This 7 × 7 × 5 continuum of words was synthesized using HLSYN (Hanson et al., 1997; Hanson and Stevens, 2002). Formant structure was based off values reported in the online database of Hillenbrand et al. (1995); it was expanded beyond the average values in their corresponding publication to represent a realistic natural range of production. Formant continuum steps were interpolated using the Bark frequency scale (Zwicker and Terhardt, 1980) to reflect the nonlinear frequency spacing in the human auditory system. Levels in Bark frequency were converted to Hz in this article to facilitate ease of interpretation. A second dimension of stimulus construction varied by the amount and direction of vowel-inherent spectral change (VISC). Although there are various ways of modeling this cue (Morrison and Nearey, 2007), it is represented here in terms of the difference in the F1, F2, or F3 frequency (in Hz) from the 20% to the 80% timepoints in the vowel. All three formants were changed in accordance with data from Hillenbrand et al. (1995), except the fourth formant, which was kept constant. The penultimate items in this VISC continuum were modeled after typical lax and tense vowels, and the continuum endpoints were expanded along this parameter, again to account for productions outside the means reported by Hillenbrand et al. See Table TABLE II. for a detailed breakdown of this parameter, and Fig. 1 for a schematic illustration of its effects on formant structure. Vowel durations were modeled from characteristic durations of /i/ and /I/ (before voiceless stop sounds) reported by House (1961), and linearly interpolated (see Table TABLE II.). Word-initial [h] was 60 ms of steady voiceless/aperiodic formant structure that matched that at the onset of the vowel; necessarily, the initial consonant was also varied as a result of the formant continuum. Word-final [−t] transitions targets for F1, F2, F3, and F4 were 300, 2000, 2900, and 3500 Hz, respectively, as used by Bohn and Flege (1990). These transitions all began at the 80% timepoint in the vowel (although this decision resulted in slightly different transition speeds depending on overall duration, it was necessary to ensure that the entire 20 %–80 % VISC trajectory could be realized). The formant transition was followed by a 65 ms of silent stop closure, followed by a 65 ms diffuse high-frequency (t) burst. Vowel pitch began at 120 Hz, rose to 125 Hz at the 33% timepoint of the vowel, and fell to 100 Hz by vowel offset.

TABLE II.

Acoustic parameter levels defining the continua of formants, vowel-inherent spectral change, and vowel duration. Each parameter was varied orthogonally.

		Step number
		1	2	3	4	5	6	7
Formants	F1	446	418	403	389	375	362	335
(Hz)	F2	1993	2078	2122	2167	2213	2260	2357
	F3	2657	2717	2747	2778	2809	2841	2905
	F4	3599	3618	3628	3637	3647	3657	3677
VISC	F1	49	33	16	0	−16
(change	F2	−287	−191	−96	0	96
in Hz)	F3	−33	−22	−11	0	11
	F4	0	0	0	0	0
Duration (ms)		85	100	108	115	122	130	145

Open in a new tab

Stylized representation of different levels of VISC applied to the same formant structure.

Spectral degradation: Noise-band vocoding.

Spectral resolution was degraded using noise-band vocoding (NBV), which has become a common way to simulate a cochlear implant (see Shannon et al., 1995). This was accomplished using online signal processing within the icast stimulus delivery software (version 5.04.02; Fu, 2006). Stimuli were bandpass filtered into four or eight frequency bands using sixth-order Butterworth filters (24 dB/octave). This number of bands was chosen to best approximate the performance of CI listeners (Friesen et al., 2001). The temporal envelope in each band was extracted by half-wave rectification and low-pass filtering with a 200-Hz cutoff frequency, which is sufficient for good speech understanding (Shannon et al., 1995). The envelope of each band was used to modulate the corresponding bandpass-filtered noise. Specific band frequency cutoff values were determined assuming a 35 mm cochlear length (Greenwood, 1990) and are listed in Table TABLE III. below. The lowest frequency of all analysis bands (141 Hz, 31 mm from the base, approximately) was selected to approximate those commonly used in modern CI speech processors. The highest frequency used (6000 Hz, approximately 9 mm from the base) was selected to be within the normal limits of hearing for all listeners, and to correspond with the upper limits of the frequency output of HLSYN. No spectral energy above this frequency was available to listeners in the unprocessed condition. Spectrograms of the word “hit” in the unprocessed (regularly synthesized), eight-channel NBV and four-channel NBV versions are illustrated in Fig. 2. The images show that specific formant frequency bands are no longer easily recoverable; the spectral fine structure is replaced by coarse/blurred sampling. Formant differences that remain unresolved within the same spectral channel are coded by the relative level of the noise band carrying that channel, as well as the time-varying amplitude (i.e., beating) owing to the interaction of multiple frequencies added together.

TABLE III.

Specification of analysis and carrier filter bands for the noise-band vocoding scheme for experiment 1.

	Channel number
4-channel	1		2		3		4
8-channel	1	2	3	4	5	6	7	8
High-pass (Hz)	141	275	471	759	1181	1801	2710	4044
Low-pass (Hz)	275	471	759	1181	1801	2710	4044	6000

Open in a new tab

Spectrograms illustrating synthesized words “hit” (left) and “heat” (right) in the normal/unprocessed condition (top), eight-channel noise-band vocoder (middle) and four-channel noise-band vocoder (bottom) conditions.

Procedure

All speech recognition testing was conducted in a double-walled sound-treated booth. Stimuli were presented at 65 dBA in the free field through a single loudspeaker. Each token was presented once, and listeners subsequently used a computer mouse to select one of two word choices (“heat” or “hit”) to indicate their perception. Stimuli were presented in blocks organized by degree of spectral resolution (unprocessed, eight-channel or four-channel). Ordering of blocks was randomized, and presentation of tokens within each block was randomized. In this self-paced task, the 245 stimuli were each heard 5 times in each condition of spectral resolution.

Analysis

Categorical responses were fit using logistic regression, in accordance with recent trends in perceptual analysis (Morrison and Kondaurova, 2009). Listeners’ binary responses (tense or lax) were fit using a generalized linear (logistic) mixed-effects model (GLMM). This was done in the r software interface (R Development Core Team, 2010), using the lme4 package (Bates and Maechler, 2010). A random effect of participant was used, and the fixed-effects were the stimulus factors described above. The binomial family call function was used because the possibility of a “tense” response could not logically exceed 100% or fall below 0%. This resulted in the use of the logit link function, and an assumption that variance increased with the mean according to the binomial distribution. Parameter levels were centered around 0, since the R GLM call function sets “0” as the default level while estimating other parameters. Thus, since the median duration was 115 ms, a stimulus with duration of 85 ms was coded as −30, and one with duration of 122 ms was coded as 8 ms. All factors and interactions were added via a forward-selection hill-climbing process. The model began with the intercept and factors (e.g., the inclusion of duration as a response predictor) competed one at a time; that which yielded the highest significance was kept. Subsequent factors (or factor interactions) were retained in the model if they significantly improved the model without unnecessarily over-fitting. The ranking metric was the Akaike information criterion (AIC) (Akaike and Hirotugu, 1974), as it has become a popular method for evaluating mixed effects models (Vaida and Blanchard, 2005; Fang, 2011). This criterion measures relative goodness of fit of competing models by balancing accuracy and complexity of the model. This method contrasts with backward-elimination models that would be judged according to the Wald statistic. The goal of each model was similar to that used by Peng et al. (2009); it tested whether the coefficient of the resulting estimating equation for an acoustic cue was different from 0 and, crucially, whether the coefficient was different across conditions of spectral resolution.

Previous literature suggested that 4 or 8 is a suitable number of channels in a noise-band vocoder as a simulation of a cochlear implant. Both of these were tested in this experiment, not for a regression of cue usage against spectral degradation, but instead to find the best proxy value to simulate electric hearing for the problem at hand. Inspection of the psychometric functions of the NH listeners and CI listeners revealed that the eight-channel simulation was the best model of electric hearing, in accordance with previous assessment of better-performing CI listeners (Dorman and Loizou, 1998; Friesen et al., 2001). Furthermore, the amount of variability in the four-channel condition made it difficult to draw firm conclusions about how listeners perceived the signals. A small number of listeners demonstrated non-monotonic effects of spectral degradation on the use of the phonetic cues (i.e., they showed greater use of formant cues in four-channel compared to eight-channel conditions, but sometimes reported hearing neither the /i/ nor the /I/ vowel), suggesting that reducing the number channels below 8 did not necessarily change the resolution in a meaningful way vis a vis this experimental task. In the four channel case, the reduced spectral degradation was likely accompanied by increased availability of temporal envelope cues in voiced portions (because of increased numbers of harmonics falling into the broader filters), which may have been accessed/utilized differentially by different participants, depending on the precision of their temporal resolution. Some were able to capitalize on this, while some were not. Although (variations in) this ability is an interesting consideration in the use of noise-band vocoded signals, it is outside the scope of this investigation. Subsequent analysis of the data discarded the four-channel condition, yielding two sets of data models: (1) normal hearing listeners in both listening conditions (unprocessed and degraded using an eight-channel NBV) and (2) cochlear implant listeners hearing the unprocessed stimuli.

Results

Identification functions along the three parameter continua are shown in Figs. 3 4 5. The following models were found to describe the data optimally.

(1)
Perception by NH listeners in different conditions:
$\begin{matrix} Tense ~ Formant + Duration + VISC + SR + Formant : SR + VISC : SR + Duration : SR + (1 | Participant) . \end{matrix}$
(2)
Perception by CI listeners:
$Tense ~ Formant + VISC + Duration + (1 | Participant) .$

Group mean response functions from 15 normal-hearing listeners and seven cochlear implant listeners along the continuum of vowel formant structure. Although these results are plotted by F2, the other formants were covarying (see Table TABLE II.).

Group mean response functions from 15 normal-hearing listeners and seven cochlear implant listeners along the continuum of vowel-inherent spectral change. Although these results are plotted by change in F2, the other formants were covarying (see Table TABLE II.).

Group mean response functions from 15 normal-hearing listeners and seven cochlear implant listeners along the continuum of vowel duration.

For these two models, the interaction between two factors A and B is indicated by A:B. Independent factors are indicated by “+.” “SR” refers to spectral resolution (normal or degraded/NBV), and (1|Participant) is a random effect of participant.

For both models, all three main cues were significant (all p < 0.001), and interactions between each cue and spectral resolution was also significant for the normal-hearing listeners (all p < 0.001). The parameter estimates all went in the predicted direction, and are listed in Table TABLE IV.. Results suggest that when spectral resolution was degraded, normal-hearing listeners’ responses were affected less by formants, less by VISC, and more by duration, compared to when spectral resolution was intact. The CI simulations were predictive of the CI listeners’ results (smaller effect of formants and VISC, greater effect of duration), although direct statistical comparison between the NH and CI groups was not conducted (to be discussed further in the summary and discussion). Surprisingly, there were no significant interactions between cues. Typically, one would expect the effects of VISC and duration to be strongest in an ambiguous range of formant values; raw data suggested this, but the interaction did not reach significance in the model. Although error bars were omitted from the group psychometric functions (Figs. 3–5), variability in the use of acoustic cues is presented in Table TABLE IV..

TABLE IV.

Intercepts and parameter estimates for the optimal logistic models for experiment 1. The top portion reflects the group model; raw data could be reconstructed using these variables in an inverse logit equation. Rows in the lower portion reflect parameter estimates from individual listeners within each group.

	Formants			VISC			Duration
	NH	NBV	CI	NH	NBV	CI	NH	NBV	CI
Group Est.	0.026	0.011	0.010	0.011	0.004	0.004	0.046	0.061	0.052
Int.	−1.368	−0.22	−0.773	−1.368	−0.22	−0.773	−1.368	−0.22	−0.773
Indiv. ests.
01	0.034	0.014	0.008	0.016	0.007	0.003	0.074	0.091	0.047
02	0.024	0.020	0.015	0.013	0.007	0.007	0.099	0.096	0.040
03	0.028	0.019	0.003	0.012	0.006	0.001	0.041	0.042	0.036
04	0.027	0.015	0.009	0.010	0.005	0.006	0.033	0.045	0.067
05	0.034	0.016	0.015	0.019	0.006	0.006	0.039	0.031	0.056
06	0.025	0.014	0.011	0.012	0.004	0.004	0.038	0.038	0.045
07	0.027	0.013	0.007	0.015	0.004	0.004	0.058	0.049	0.073
08	0.037	0.009		0.012	0.003		0.057	0.067
09	0.024	0.003		0.008	0.001		0.052	0.049
10	0.016	0.006		0.010	0.002		0.045	0.040
11	0.019	0.006		0.013	0.002		0.029	0.050
12	0.023	0.010		0.006	0.002		0.023	0.069
13	0.024	0.008		0.003	0.001		0.031	0.097
14	0.025	0.012		0.007	0.002		0.035	0.085
15	0.018	0.005		0.004	0.002		0.027	0.065

Open in a new tab

Although direct statistical comparison is not valid for the groups in this study, the CI listener data is encouraging, as it falls along the same general trend as the NH listeners in the simulated conditions. The individual variability is apparently not limited to one group or the other; just as NH listeners have variations in listening strategies, so do the CI listeners, and both groups fall within similar ranges.

Conclusions

In this experiment, listeners were presented with words whose vowels varied along three acoustic dimensions. Normal-hearing listeners heard these words with clear unprocessed spectral resolution and also through eight- and four-channel noise-band vocoding schemes; the eight-channel condition was a better match to the CI listeners’ performance. Cochlear implant listeners heard only the unprocessed words.

In conditions that are thought to simulate the use of a cochlear implant, normal-hearing listeners showed decreased use of spectral cues (formant structure and vowel-inherent spectral change), and showed increase use of vowel duration when identifying tense and lax vowels. Results from CI listeners suggested that they may be affected less by formant and VISC cues, and may be affected more by duration cues compared to NH listeners. Although this experiment tests merely one phonetic contrast, it appears to suggest that the NBV simulations hold some predictive value in determining the use of phonetic cues by CI uers.

In view of previous studies using synthesized speech, it is possible, despite the high quality of the speech synthesized by HLSYN, that the role of duration for NH listeners in the unprocessed condition was overestimated. Previous work suggests that duration is largely neglected by NH listeners for this vowel contrast when natural speech quality is preserved (Hillenbrand et al., 2000). Thus, the differences in the use of duration by NH in diferent conditions (and possibly the differences in the use of duration by NH listeners and CI listeners) may be larger than what these data suggest. Another important consideration is the relatively advanced age of the CI user group, which will be discussed later in the summary and discussion section.