Abstract
Hearing-impaired (HI) listeners have been shown to exhibit increased fusion of dichotic vowels, even with different fundamental frequency (F0), leading to binaural spectral averaging and interference. To determine if similar fusion and averaging occurs for consonants, four natural and synthesized stop consonants (/pa/, /ba/, /ka/, /ga/) at three F0s of 74, 106, and 185 Hz were presented dichotically—with ΔF0 varied—to normal-hearing (NH) and HI listeners. Listeners identified the one or two consonants perceived, and response options included /ta/ and /da/ as fused percepts. As ΔF0 increased, both groups showed decreases in fusion and increases in percent correct identification of both consonants, with HI listeners displaying similar fusion but poorer identification. Both groups exhibited spectral averaging (psychoacoustic fusion) of place of articulation but phonetic feature fusion for differences in voicing. With synthetic consonants, NH subjects showed increased fusion and decreased identification. Most HI listeners were unable to discriminate the synthetic consonants. The findings suggest smaller differences between groups in consonant fusion than vowel fusion, possibly due to the presence of more cues for segregation in natural speech or reduced reliance on spectral cues for consonant perception. The inability of HI listeners to discriminate synthetic consonants suggests a reliance on cues other than formant transitions for consonant discrimination.
I. INTRODUCTION
Speech perception in background noise is mediated by grouping and segregation of sounds into different auditory streams (Darwin, 2008). The cues for grouping and segregation have been studied extensively in normal-hearing (NH) listeners but are not yet well understood in hearing-impaired (HI) listeners, who often struggle with greater difficulties to understand speech in background noise.
The formation of discrete auditory streams from multiple sources requires two processes: grouping or “fusion” of the acoustic components associated with each source into a single object and, conversely, segregation or “fission” of the acoustic components associated with the different sources into separate streams (Moore and Gockel, 2012; Shinn-Cunningham et al., 2017). Generally, a listener is more likely to perceptually fuse or group acoustic information from a single source with acoustically similar components than information from different sources with acoustically dissimilar components. For instance, frequency cues—such as fundamental frequency (F0) or voice pitch (i.e., the perception of F0)—as well as timing, such as synchronicity—are used for grouping and segregation (Oxenham, 2008). Similarity of F0 increases perceptual fusion of sounds presented with different spectral content to the two ears (Ladefoged and Broadbent, 1957; Reiss and Molis, 2021). Such fusion is more likely, though, when spectral content is overlapping (Darwin and Hukin, 2004). However, fusion can still occur even with nonoverlapping spectral content, such as when there are spectral gaps across ears, and allow for speech identification (Yoon and Morgan, 2022). Conversely, differences in F0 or onset time decrease fusion and increase segregation, as shown for vowel sounds (Darwin, 1984; Reiss and Molis, 2021; Eddolls et al., 2022).
However, for HI listeners, there is likely impairment of these processes that are involved in fusion and segregation, particularly those dependent on F0 differences. Previously, it was shown that compared with NH individuals, HI individuals exhibit increased perceptual fusion of tones differing by as much as 1–3 octaves in frequency across ears (Reiss et al., 2017). Ultimately, this abnormally broad fusion leads to a perceptual averaging of pitch evoked by these disparate tone frequencies (Oh and Reiss, 2017). This broad fusion also leads to averaging of spectral shape across ears for more complex stimuli such as vowels. The vowel percept with two ears differs from that for either ear alone when pitch mismatches are present such as in HI listeners with diplacusis or bimodal cochlear implant users who wear a cochlear implant in one ear and a hearing aid (HA) in the other ear (Reiss et al., 2016). Reiss and Molis (2021) further demonstrated that individuals with abnormally broad fusion also have difficulties segregating dichotic vowels, even those with differing F0. In their work, dichotic vowels were presented with F0 varied between ears, and NH and HI listeners were asked to identify the single vowel or two vowels that were heard. Unlike in previous studies of concurrent or dichotic vowels (Zwicker, 1984; Summerfield and Assmann, 1991; Arehart et al., 2005), subjects were not informed that two vowels were always presented because Kwon and Perry (2014) showed that such information biases subjects toward responding with two vowels even if only one vowel was perceived. The results from Reiss and Molis (2021) demonstrated that whereas NH listeners were able to improve segregation and, thus, identification of both vowels with increasing F0 differences, HI listeners with broad tone fusion were not able to benefit from F0 differences. Instead, HI listeners continued to fuse and perceive only one vowel even for large F0 differences, and they often identified this vowel incorrectly as one of the vowels not presented in the pair. Often, this fused vowel percept was a spectral average of the original two vowels presented. More recently, it was demonstrated that the broader the binaural tone fusion, the less benefit for voice gender differences for speech recognition in background talkers for NH and HI listeners (Oh et al., 2022). Thus, abnormally broad fusion could interfere with the segregation of multiple speech streams based on F0 differences.
It is not clear if dichotic consonant perception is similarly affected by abnormally broad fusion in HI listeners as a result of the greater reliance on temporal cues and reduced dependence on spectral cues for consonants compared to vowels (Shannon et al., 1995). Consonant perception is generally thought to occur in two stages with peripheral frequency analysis followed by feature analysis (Shankweiler and Studdert-Kennedy, 1967). Manipulations, such as spectral filtering and the addition of masking noise, showed that consonants are discriminated based on features such as voicing, place, and manner of articulation, which are putatively processed independently (Miller and Nicely, 1955). Although place of articulation is dependent on spectral cues, voicing and manner are more dependent on temporal envelope cues, requiring only two spectral channels for 95% correct recognition (Shannon et al., 1995). Broad binaural fusion is known to affect perception of spectral cues but may have less impact on perception of temporal cues and, thus, consonant perception.
A few studies have examined fusion of synthetic consonants under dichotic presentation in NH listeners. Cutting (1976) demonstrated that dichotic synthetic stop consonants fuse according to a predictable pattern based on their linguistic features (e.g., place of articulation) and, presumptively, by the acoustic cues provided by those features (Cutting, 1976). Cutting (1976) showed that the dichotic presentation of synthetic /ba/ and /ga/ to opposite ears at the same F0 can be fused in NH listeners to create a spectrally intermediate percept /da/. This process, called psychoacoustic fusion, theoretically involves averaging formant transitions of both consonants centrally to produce the intermediate percept (Cutting, 1975) and is similar, conceptually, to the vowel averaging demonstrated in Reiss and Molis (2021). Repp (1976) also demonstrated fusion of dichotic consonant pairs selected from a continuum of synthesized stimuli from /ba/ to /ga/, presented at the same F0. Notably, as in Reiss and Molis (2021), participants were not informed that two sounds were presented and instructed to select one sound for each dichotic pair. Hence, dichotic consonant fusion has been demonstrated, at least for synthetic stimuli. Cutting (1976) also described phonetic feature fusion, in which phonetic features can be shared between dichotically presented consonants such that /ba/ and /ka/ may be perceived as /pa/ or /ga/ based on the sharing of voicing or place of articulation. However, dichotic consonant fusion has not yet been demonstrated for naturally spoken consonants, which have more spectro-temporal cues for segregation, nor has it been studied in HI listeners.
Here, similar to the study by Reiss and Molis (2021), we explored how ΔF0 plays a role in dichotic consonant fusion and identification in NH and HI listeners using natural and synthetic speech. We hypothesized that consonant fusion would decrease with increased ΔF0 in NH listeners but not in HI listeners, and fused percepts would correspond to spectral averages of the original stimuli presented, which is similar to the results obtained for vowels in Reiss and Molis (2021). We also hypothesized that both listener groups would have increased fusion and decreased identification of synthetic consonants compared to natural consonants, presumably a result of the reduced availability of spectro-temporal cues for segregation in synthetic stimuli.
II. METHODS
A. Subjects
These studies were conducted according to the guidelines for the protection of human subjects as set forth by the Institutional Review Board (IRB) of Oregon Health and Science University (OHSU), and the methods employed were approved by that IRB. Nine adult subjects with normal hearing (eight females and one male) and nine adult subjects with moderate-severe sensorineural hearing loss (HL; seven females and two males) participated in this study. All subjects were self-identified native speakers of American English. Subjects were screened for normal cognitive function using the Mini Mental Status Examination (MMSE) with a minimum score greater than 25 out of 30 required to qualify (Folstein et al., 1975; Souza et al., 2007). Tympanometry and otoscopy were also conducted for all subjects to verify normal middle ear function and normal external auditory canal, respectively.
The nine NH subjects ranged in age from 24 to 62 years of age [mean and std (standard deviation) = 44.2 ± 14.2 years of age]. All NH subjects were screened for audiometric thresholds within normal limits (thresholds ≤ 20 dB HL from 250 to 4000 Hz). Group average audiograms are shown in Figs. 1(A) and (1B) for right and left ears, respectively.
FIG. 1.
Individual and mean audiograms are shown for right (A) and left (B) ears in NH and HI listeners (only group mean is displayed for NH listeners).
The HI subjects' demographic data, including age, gender, duration of moderate-severe or worse HL, duration of HA use, daily hours of HA use, and HA model(s), are shown in Table I.
TABLE I.
Demographic information for HI listeners.
Subject identification | Age (yr) | Gen-der | Duration of HL (yr) | Duration of HA use (yr) | Daily HA use (h/day) | HA model (ear if only one worn) |
---|---|---|---|---|---|---|
HI16 | 46 | F | 26 | 12 | 9 | Phonak Bolero (Stäfa, Switzerland) |
HI17 | 67 | F | 2 | 0.3 | 0 | Oticon Nera® (Copenhagen, Denmark) |
HI22 | 56 | M | 6 | 7 | 16 | Rexton RIC (Plymouth, MN) |
HI25 | 69 | F | 30 | 31 | 14 | Phonak Versata P (Stäfa, Switzerland) |
HI38 | 73 | F | 21 | 21 | 16 | Phonak Audeo V90 (Stäfa, Switzerland) |
HI41 | 39 | M | 39 | 7 | 10 | Resound Linx 3D-9 (Ballerup, Denmark) |
HI43 | 29 | F | 29 | 5 | 14 | Phonak Naidaq (Stäfa, Switzerland) |
HI44 | 60 | F | 15 | 4 | 17 | Phonak Bolero (Stäfa, Switzerland) |
HI45 | 38 | F | 33 | 33 | 16 | Siemens Signia (Munich, Germany) |
The subject ages ranged from 29 to 73 years of age (mean and std = 53.0 ± 15.6 years of age), the duration of HL ranged from 2 to 39 years of age (mean and std = 22.3 ± 12.5 years of age), and the duration of HA use ranged from 0.3 to 33 years of age (mean and std = 13.4 ± 12.1 years of age). Individual and group average audiograms are displayed in Figs. 1(A) and (1B) for right and left ears, respectively. Age distributions did not differ significantly between NH and HI groups (Z = 99.5, p = 0.23, Wilcoxon rank-sum two-tailed test).
B. Stimuli
The stimuli consisted of six different consonants: /pa/ (as in “palm”), /ba/ (as in “ball”), /ka/ (as in “karma”), /ga/ (as in “gall”), /ta/ (as in “talk”), and /da/ (as in “dawn”) with three F0s of 74, 106, or 185 Hz. The spectrograms of /ba/, /da/, and /ga/ are shown for F0 = 74 Hz in Fig. 2 while the spectrograms for /pa/, /ta/, and /ka/ have similar formant transitions albeit different voice onset times. Arrows indicate the F2 positions. It can be observed from the spectrograms that the F2 formant transition for /da/ is intermediate in frequency between the F2 formant transitions for /ba/ and /ga/. Presumably, if spectral averaging occurs during binaural fusion of /ba/ and /ga/, an intermediate formant transition corresponding to /da/ will be perceived and reflected in subject responses.
FIG. 2.
Spectrogram of natural (top) and synthetic (bottom) consonants /ba/ (left), /da/ (middle), and /ga/ (right) F2 values are indicated by arrows.
Natural consonants were modified from recordings in a study by Shannon et al. (1999) of two male (F0 = 74 and 106 Hz) and one female (F0 = 185 Hz) speakers of American English (see the supplementary material1). The natural consonants were produced with a sampling rate of 44.1 kHz and amplitude normalized to the steady-state portion of the vowel (Shannon et al., 1999). Stimulus durations were 375 ms. Synthetic consonants were produced using a Klatt synthesizer design (Klatt, 1980; see the supplementary material1 and the Appendix for details on synthetic vowel synthesis). For unvoiced consonants to vowels, the amplitude of aspiration (AH) started at a value of 55. Then, at the start of the transition at 80 ms, the amplitude of voicing (AV) transitioned toward the voiced-consonant-to-vowel trajectory, whereas the AH transitioned to zero; the transition was complete at 100 ms. Finally, after the transition, the AV trajectory was identical to the voiced-consonant-to-vowel case, and the AH remained at zero. The formant frequency targets were based on the reported phonemic categorical formants frequency for the stop consonants as shown in Table II (Olive et al., 1993).
TABLE II.
Linguistic features and formants of stop consonant stimuli.
Place of articulation | Voiceless | Voiced | F2 (Hz) | F3 (Hz) |
---|---|---|---|---|
Labial | /p/ | /b/ | 1312 | 2348 |
Alveolar | /t/ | /d/ | 1772 | 3026 |
Velar | /k/ | /g/ | 2234 | 2018 |
For example, the formant frequency targets for /b/ were 300, 1000, and 2600 Hz. The /a/ formant frequency targets were 750, 1300, and 2600 Hz. The frequency transition between these targets was modeled as an exponential trajectory. Throughout the synthetic stimulus, F0 was modeled as a linearly falling trajectory from 120 to 70 Hz; a falling pitch contour was used to ensure sampling of the spectrum at frequencies other than multiples of F0 and elicit a more natural sound (Bush and Kain, 2014). Stimulus durations were 500 ms. A sampling rate of 10 kHz was sufficient for the synthetic stimuli as a trajectory of the highest F3 does not exceed 4.5 kHz, which is below the Nyquist sampling rate limit of 5 kHz.
C. Procedure
All experiments were conducted in a double-walled, sound-attenuated booth (Industrial Acoustic Company, IAC). Signals were presented using matlab software (MathWorks, Natick, MA; version R2010b), processed through an ESI Juli sound card (Leonberg, Germany), TDT PA5 digital attenuator (Alachua, FL), HB7 headphone buffer, and Sennheiser HD-25 headphones (Wedemark, Germany). The headphones' frequency response was equalized using calibration measurements obtained with a Brüel and Kjaer sound level meter with a 1-in. microphone in an artificial ear (Naerum, Denmark). Stimuli were presented through headphones at a comfortable level for all listeners—68 dB sound pressure level (SPL) for NH listeners and customized levels for HI listeners, who did not wear their HAs during the experiment. Customized levels for HI listeners were set to NAL-NL2 (National Acoustic Labs-Nonlinear fitting procedure, version 2) prescriptive targets based on the audiogram.
Prior to testing, subjects underwent familiarization with consonants at all F0s under binaural presentation, first for natural and then for synthetic stimuli. During familiarization with each consonant set, each subject was provided with a printed reference sheet of the six consonant sounds used in the study (as listed previously) with an example word for each consonant. Subjects were informed that some consonants “may sound a little different” (in reference to synthetic consonants) but were not specifically informed that natural or synthetic speech stimuli were being presented.
The subjects were next screened for the ability to perceive the correct consonants in each ear under monaural presentation using a six-alternative forced-choice procedure. Each run consisted of 72 trials with a single consonant first presented diotically for practice and then monaurally to each ear. The 72 trials represent 6 consonants at 3 F0s with 4 repeats. In each trial, the subject was instructed to identify the consonant that they heard using a touchscreen. Feedback was provided. Each monaural run was repeated up to three times until subjects identified greater than 70% correct of consonants in the left and right monaural conditions; only subjects who could achieve greater than 70% correct were included in the study. 70% was chosen because achieving this score is challenging for NH subjects for synthetic consonants and for HI subjects for natural consonants (chance performance, 16.67%). For natural consonants in NH subjects, the range for left ear scores was 98.61%–100% (mean and std = 99.85% ± 0.46) and for right ear scores, it was 100%. For natural consonants in HI subjects, the range for left ear scores was 70.83%–100% (mean and std = 92.28% ± 8.82) and for right ear scores, it was 86.11%–100% (mean and std = 92.13% ± 5.06). For synthetic consonants in NH subjects, the range for left ear scores was 76.39%–95.83% (mean and std = 86.46% ± 5.93) and for right ear scores, it was 73.61%–90.28% (mean and std = 86.29% ± 5.27). Most of the HI subjects were unable to discriminate synthetic consonants to the required 70% level consistently.
All NH subjects that passed the monaural screening were randomly assigned to be tested in the dichotic testing with natural or synthetic consonants first and then second with the other or vice versa. HI subjects were only tested on dichotic testing with natural consonants. Dichotic consonant testing was conducted with consonant pairs always differing across the ears. Unlike in the familiarization and monaural testing, only four of the six original consonants were presented: /pa/, /ba/, /ka/, and /ga/. Only these four were presented to allow subjects to respond with an intermediate percept corresponding to a consonant in the English language, which are only possible between /pa/ and /ka/ and /ba/ and /ga/ but not other combinations. Participants were not aware that /ta/ and /da/ were not presented during dichotic consonant testing such that they may consider choosing these options based on a possible percept. Unique pairs of /pa/-/ka/, /pa/-/ga/, /ba/-/ka/, and /ba/-/ga/ were presented such that each stimulus pair and its converse were presented with every combination of the three F0 possibilities represented. F0 was the same or different across the ears.
Again, a six-alternative forced-choice procedure was used. Similar to testing dichotic vowels in Reiss and Molis (2021), subjects could respond by selecting one or two consonants, then selecting a “done” button once one or more consonant selections were made. A “repeat” button to repeat the stimulus was also provided for subjects to use as needed.
Subjects were not informed that two sounds were always presented or /ta/ and /da/ would not be presented. Five runs were completed for each consonant set. During these runs, feedback was not provided.
D. Analysis
Confusion matrices were generated for each subject by adding the number of perceived consonant(s) for each stimulus and plotting the perceived consonant(s) versus actual consonants presented. The percent of single consonant responses (fusions) and percent correct identification of both consonants were calculated.
A two-way repeated measures analysis of variance (ANOVA) was first performed on the natural consonant data with ΔF0 (0, 32, 79, or 111 Hz) as a within-subject factor and group (NH or HI) as a between-subject factor. Then, a two-way repeated measures ANOVA was performed on the consonant data in NH listeners with ΔF0 and stimulus type (natural or synthetic) as within-subjects factors. Greenhouse-Geisser corrections (Greenhouse and Geisser, 1959) were applied to the degrees of freedom in cases where the assumption of sphericity was rejected by Mauchly's test (Mauchly, 1940). Post hoc two-tailed t-tests were conducted when significant main effects were present with Bonferroni corrections as applicable. For these analyses, all percent single consonant responses and percent correct scores were transformed into rationalized arcsine units (rau; Studebaker, 1985).
III. RESULTS
Figure 3 shows the percent correct identification of both consonants in the dichotic pair as a function of ΔF0 for the groups (NH and HI) and stimulus conditions (natural and synthetic).
FIG. 3.
Mean percent identification of natural and synthetic consonants for NH listeners and natural consonants for HI listeners. Data points are offset intentionally to improve readability with vertical dotted lines indicating actual corresponding ΔF0 values. Error bars indicate standard deviation.
Each line indicates percent both correct when two consonants were perceived and selected. For NH subjects listening to natural consonants (Fig. 3, solid black line/circles), identification performance increased as the ΔF0 increased, particularly from ΔF0 = 0 to ΔF0 > 0. Figure 4 shows that when the consonants differed in voicing and place of articulation (e.g., /pa/-/ga/ and /ba/-/ka/), percent correct identification of both consonants was generally higher than when the consonants differed in only place of articulation (e.g., /pa/-/ka/ and /ba/-/ga/) across all ΔF0.
FIG. 4.
Percent correct identification of both natural consonants by NH listeners, which are plotted by consonant pair. The heavy gray line indicates mean over all pairs.
This difference is also apparent in the group-averaged confusion matrices for each ΔF0 [supplemental Figs. 1(a)–1(d)],1 where 2-consonant responses are depicted in the bottom 15 rows of each matrix. Although most responses are correct (darkest shading along the diagonal in rows 7–10), occasional feature mixing is apparent for those pairs differing in two features (lighter shading of /ta/-/ka/ in response to /pa/-/ka/ and lighter shading of /ka/-/ga/ in response to /pa/-/ga/), especially for ΔF0 = 0.
HI listeners demonstrated a similar trend to NH listeners for natural consonants (Fig. 3, dotted black line/triangles), where increased identification of both consonants was associated with increasing ΔF0. Pairwise analyses (Fig. 5) and confusion matrices [supplemental Figs. 2(a)–2(d)]1 showed slightly improved performances in consonant pairs that differed in multiple features.
FIG. 5.
Percent correct identification of both natural consonants by HI listeners, which are plotted by consonant pair. The heavy gray line indicates mean over all pairs.
A repeated measures ANOVA showed significant main effects on percent correct identification of both consonants (in rau) for both ΔF0 [F(1.93,30.89) = 31.37, p < 0.001] and group [F(1,16) = 30.29, p < 0.001]. Differences in F0 across the two ears resulted in more correct identifications, and NH subjects more often identified both consonants of a pair correctly than did HI subjects. There was no interaction of ΔF0 × group [F(1.93,30.89) = 1.69, p = 0.20], meaning the trends with F0 were similar for the two groups. A post hoc test of within-subjects contrasts indicated significant differences between ΔF0 = 0 and all other ΔF0's and between ΔF0 = 32 and ΔF0 = 79 [t(17) = –4.89, –6.67, –6.53, and –4.35, respectively, p < 0.001] but not other comparisons.
Figure 6 shows the percent single responses (i.e., fusion) for all groups and conditions.
FIG. 6.
Mean single responses of natural and synthetic consonants for NH listeners and natural consonants for HI listeners. Data points are offset intentionally to improve readability. Error bars, standard deviation.
For NH subjects listening to natural consonants (solid black line/circles), there was a decrease in the selection of a single response as the ΔF0 increased, which corresponds with the increase in percent correct identification of both consonants. As displayed in Fig. 7, the percent single response for the natural consonants in NH subjects (i.e., fused response) was generally higher for consonants that differed in only one feature compared with those situations when consonant pairs differed in two features.
FIG. 7.
Percent single response for natural consonants by NH listeners, which are plotted by consonant pair. The heavy gray line indicates mean over all pairs.
This is also apparent in the group-averaged confusion matrices [supplemental Figs. 1(a)–1(d)],1 where the top six rows of each confusion matrix show the single consonant responses, especially for ΔF0 = 0. Across all ΔF0, when /pa/-/ka/ and /ba/-/ka/ were presented, the predominant consonant chosen was /ka/. When /ba/-/ga/ was presented, the predominant consonant chosen was /ga/.
HI listeners showed a similar trend for natural consonants of decreased single responses with increasing ΔF0 (Fig. 3, dotted black line/triangles). As shown in Fig. 8, greater fusion, again, was visible for consonants that differed in only the place feature [see also supplemental Figs. 2(a)–2(d)].1
FIG. 8.
Percent single response for natural consonants by HI listeners, which are plotted by consonant pair. Heavy gray lines indicate mean over all pairs.
A repeated measures ANOVA showed a significant main effect on percent single responses (in rau) for ΔF0 [F(1.27,20.34) = 25.12, p < 0.001] but not group [F(1,16) = 4.09, p = 0.20]. Differences in F0 across the two ears resulted in fewer single responses, but there were no differences between NH and HI groups. There was, again, no interaction of ΔF0 × group [F(1.27,20.34) = 040, p = 0.58], meaning the trends with F0 were similar for the two groups. A post hoc test of within-subjects contrasts indicated significant differences in percent single responses between ΔF0 = 0 and all other ΔF0s only [t(17) =7.58, 5.70, and 5.00, respectively, p < 0.001].
Figure 3 also shows percent correct identification results for synthetic consonants in NH listeners (dashed gray line/diamonds). With increased ΔF0, the identification of both consonants generally improved, although they never exceeded 50% correct. Figure 9 shows that like the identification of natural consonants in NH listeners, the percent correct identification was higher for consonants that differed in more than one feature compared with consonants that differed in the place feature only, although this effect was more apparent with the /ba/-/ka/ pair than the /pa/-/ga/ pair [see also supplemental Figs. 3(a)–3(d)].1
FIG. 9.
Percent correct identification of both synthetic consonants by NH listeners, which are plotted by consonant pair. The heavy gray line indicates mean over all pairs.
Only three out of nine HI listeners (as compared with all NH listeners) could successfully complete some of the monaural conditions for synthetic consonants, and those subjects did not show significant changes, i.e., no benefit, with ΔF0 for synthetic consonant testing (not shown).
A repeated measures ANOVA was performed to investigate main effects of ΔF0 and stimulus type (natural versus synthetic) in NH listeners and revealed significant main effects on percent correct identification of consonants (in rau) for ΔF0 [F(1.59,11.15) = 133.11, p < 0.001] and stimulus type [F(1,7) = 191.82, p <0.001]. Increasing differences in F0 across the two ears resulted in more correct identifications for natural and synthetic consonants, and identification was higher for natural than synthetic consonants. There was no interaction of ΔF0 × group [F(1.27,8.91) = 2.86, p = 0.29], meaning the trends with F0 were similar for natural and synthetic consonants. A post hoc test of within-subjects contrasts indicated significant differences between all ΔF0 comparisons (p < 0.01) except for ΔF0 = 79 versus ΔF0 = 111 (p > 0.05), and significant differences between natural and synthetic consonants for all ΔF0s [t(7)= –6.33, –12.69, –12.81, –17.72, and p < 0.001 for ΔF0 = 0, 32, 79, and 111 Hz, respectively].
Figure 6 also shows the percent single response (fusion) for synthetic consonants in NH listeners (gray dashed lines/diamonds). When ΔF0 was 0 Hz, synthetic consonants were almost completely fused, which is depicted by a nearly 100% single consonant response rate versus the near-zero correct identification at this ΔF0. Although there was a smaller increase in percent correct identification of both consonants with increased ΔF0 than for natural consonants, there was a profound decrease in perception of a single response with increased ΔF0. Figure 10 again shows that consonants differing in only the place feature were more likely to be fused [see also supplemental Figs. 3(a)–3(d)].1
FIG. 10.
Percent single response for synthetic consonants by NH listeners, which are plotted by consonant pair. The heavy gray line indicates mean over all pairs.
A repeated measures ANOVA showed a significant main effect on percent single responses (in rau) for both ΔF0 [F(1.36,9.50) = 75.96, p < 0.001] and stimulus type [F(1,7) = 44.45, p < 0.001]. Increasing differences in F0 across the two ears resulted in fewer single responses, and there were more single responses for synthetic than natural consonants. There was a significant interaction of ΔF0 × stimulus type [F(1.09,7.54) = 6.84, p = 0.03], meaning that the trends with F0 differed for the two stimulus types, reflected in the greater decrease in fusion from ΔF0 = 0 to ΔF0 = 32. A post hoc test of within-subjects contrasts indicated significant differences between all ΔF0 comparisons (p < 0.005 for all ΔF0 comparisons except for ΔF0 = 79 versus ΔF0 = 111, where p < 0.05), and significant differences between natural and synthetic consonants for all ΔF0s [t(7) = 5.67, 5.15, 3.93, 2.98, and p < 0.001, = 0.001, = 0.006, and = 0.021for ΔF0 = 0, 32, 79, and 111 Hz, respectively].
Analyses with age added showed no significant interactions with age except for percent single responses for synthetic consonants in NH listeners, where a significant interaction of age was observed with ΔF0 [F(1.74,10.47) = 8.76, p = 0.007], suggesting age differences in effects of ΔF0 on fusion of synthetic consonants.
Further comparisons of the fusion and identification trends by consonant pair show that type of fusion depends on the pair, especially in the synthetic consonant condition. For consonant pairs differing in place of articulation, NH listeners generally showed psychoacoustic fusions or the selection of a spectrally intermediate consonant not among the two original stimuli presented [e.g., selection of /ta/ and /da/ for fused consonant pair /pa/-/ka/ or /ba/-/ga, respectively; see supplemental Figs. 1(a)–1(d) and Figs. 3(a)–3(d)].1
For consonants differing in voicing, NH listeners showed more phonetic feature fusions [e.g., selection of /pa/ or /ga/ when presented with the pair /ba/-/ka/; see supplemental Figs. 1(a)–1(d) and Figs. 3(a)–3(d)],1 although the velar consonant was generally favored (e.g., /ga/ was favored over /pa/). With natural consonants, phonetic feature fusion occurred less frequently.
With natural consonants, HI listeners were more likely than NH listeners to experience psychoacoustic fusion and choose the intermediate percept [see supplemental Figs. 1(a)–1(d) and 2(a)–2(d)].1 Unlike NH listeners, HI listeners did not show velar predominance for voiceless consonants, although there was a preference for velar consonants in the voiced pairs.
IV. DISCUSSION
A. The role of F0 differences in dichotic consonant fusion and identification
F0 appears to play a key role in processing of two auditory streams of consonants as hypothesized. When ΔF0 = 0, NH and HI listeners had increased perception of a single (i.e., fused) response and concomitantly had more difficulty in identifying both consonants correctly. When ΔF0 > 0, NH and HI listeners had decreased perception of a fused response and improved on identification of both consonants, which is consistent with prior data on the role of F0 in separating speech streams as well as with the previous findings on dichotic vowel perception (Reiss and Molis, 2021). HI listeners benefited less than NH listeners from F0 differences for identification, but there were no significant group differences in consonant fusion. This lack of difference in consonant fusion differs from the previous findings on vowel fusion. There are three possible interpretations of these findings. First, the previous study only used synthetic vowels, whereas in the current study, HI listeners were only successfully tested on natural consonants. Second, this was a different sample of HI listeners, who may not have had as broad fusion as the HI listeners in the previous study. Third, abnormal binaural spectral fusion may not have as much impact on segregation of consonants as for vowels, resulting from reduced reliance on spectral cues for consonants (Shannon et al., 1995), especially for the linguistic pairs differing in more than one feature. More data are needed with synthetic consonants that HI listeners are able to discriminate, repeated binaural tone fusion measurements, and/or reduction of available temporal cues to narrow down these possible interpretations.
The benefit of F0 was consistent for natural and synthetic consonants in NH listeners with similar trends of decreasing fusion and increasing identification performance as ΔF0 increased. However, NH listeners experienced less fusion and performed better on identification of both consonants with the natural consonants, suggesting a benefit for natural voice cues for promoting segregation over fusion. This benefit may be derived from natural F0 fluctuations and trajectories that are not present in synthetic speech (van Wieringen and Wouters, 1999). Interestingly, most HI listeners were unable to identify synthetic consonants, implying that they use some other cue for discriminating consonants than those provided (e.g., steady-state formant energy rather than formant transitions; Kent and Read, 2002); this is consistent with studies of other consonants /d/, /n/, and /l/ with transition segment cues, where HI listeners could discriminate the natural but not synthetic consonants with just transition cues (Revoile et al., 1991). Because the synthesizer used makes several simplifying assumptions, this could also have removed some of the fluctuations or other cues in natural speech needed by HI listeners. In theory, an articulatory synthesizer could provide better results for future study.
B. The role of consonant features in fusion
For both NH and HI listeners, the number of differences in linguistic features was protective from fusion, which is consistent with prior research by Cutting (1976) that linguistic features, such as place of articulation and voicing, play a significant role in preventing fusion.
Interestingly, the increased number of linguistic features between the presented consonants appears to help reduce fusion more for natural consonants than synthetic consonants. The low-pass filtering in the studies performed by Miller and Nicely (1955) may provide evidence as to this phenomenon. Perhaps the high-frequency information encoded in natural speech beyond the main formants used to traditionally describe these stop consonants also plays a role in segregation of these sounds. That is, whereas the lower-frequency information is sufficient to identify the signals in isolation, as evidenced by appropriate identification of the stimuli during screening for natural and synthetic consonants in NH listeners, perhaps the dichotic presentation requires the higher-frequency information for the segregation of signals. Another possibility is that F0 fluctuations and other spectro-temporal cues to segregation in natural consonants (van Wieringen and Wouters, 1999) enhance the protective effects of linguistic features.
C. Patterns of consonant fusion
Psychoacoustic and phonetic feature fusion occurred for dichotic consonants in this study. Psychoacoustic fusion—the perception of a single consonant with linguistic qualities averaged from two presented consonants (e.g., /da/ from /ba/-/ga/)—was more common in HI listeners than NH listeners, even though the stimuli corresponding to the averaged percepts were never actually presented, which is consistent with the spectral averaging observed for vowels (Reiss and Molis, 2021). This suggests that although single response rates did not differ significantly between the groups, fusion may have led to greater averaging and effects on percepts and, thus, identification in HI listeners.
Phonetic feature fusion—the perception of at least one consonant having borrowed linguistic features from two different consonants, particularly those differing in voicing in this study (e.g., /ba/ from /pa/ and /ga/)—occurred more frequently with synthetic than natural consonants in NH listeners. In cases without psychoacoustic or phonemic feature fusion, there was a predominance of the velar consonant compared with the labial consonant for NH listeners with natural consonants. Despite HI predominance for the intermediate percept, there was also no velar predominance for voiceless consonants, although there was a preference for velar consonants in the voiced pairs.
D. Conclusion and future directions
As hypothesized, fusion decreases and identification improves for dichotic consonants with increasing ΔF0 in NH and HI listeners, indicating a benefit of F0 differences for segregation and identification. With synthetic consonants, fusion increases and identification worsens compared to natural consonants in NH listeners, indicating a benefit of natural voice cues for segregation. The findings indicate that as for vowels, HI listeners have a deficit in identification of consonants compared to NH listeners, even when there are F0 differences.
ACKNOWLEDGMENTS
This study was supported by National Institutes of Health-National Institute on Deafness and Other Communication Disorders (NIH-NIDCD) Grant No. R01 DC013307. We appreciate the subjects who participated and members of the Reiss Laboratory at OHSU for their assistance and support with this project.
APPENDIX: SYNTHETIC VOWEL SYNTHESIS
For synthetic vowels, we first determined the average root mean square (RMS) energy of each vowel type (Eavg-vowel), calculated over all natural vowel tokens for that type. Then, for natural and synthesized stimuli in that vowel class, we determined their original vowel RMS energy (Estim) and, finally, multiplied the signal with
(A1) |
To produce synthetic consonants, the AV and AH were manipulated; voiced stops have minimal AH with nominal AV, whereas unvoiced stops have minimal AV with energy from AH.
Thus, for voiced consonants to vowels, the AV was exponentially falling from 60 to 0 dB using the following code in Python (in discrete time):
AV[:] = np.linspace(1, 0, N) ** 0.1 * 60.
This is equivalent to (in continuous time)
(A2) |
where N is the number of samples; meanwhile, AH was zero.
For unvoiced consonants to vowels, AV started at zero and then linearly transitioned to a decaying exponential, whereas AH starts at 55 and then linearly transitions to zero using the following code:
Nv1 = 800 # start of unvoiced-voiced transition,
Nv2 = 1000 # end of unvoiced-voiced transition,
AV[:Nv1] = 0,
AH[:Nv1] = 55,
AV[Nv1:Nv2] = np.linspace(0, AV[Nv2], Nv2-Nv1),
AH[Nv1:Nv2] = np.linspace(55, 0, Nv2-Nv1).
This is equivalent to
(A3) |
(A4) |
(A5) |
and
(A6) |
(A7) |
(A8) |
where for a sample rate of 10 kHz, Nv1 = 80 ms, and Nv2 = 100 ms.
Portions of this work were presented in “Fusion and identification of dichotic consonants in normal-hearing and hearing-impaired listeners,” 2019 Meeting of the Association for Research in Otolaryngology, Baltimore, MD, USA, February 2019.
Footnotes
See supplementary material at https://doi.org/10.1121/10.0024245 for confusion matrices for consonant presentation and selection, details of synthetic consonant generation, and sound files of natural and synthetic consonants used in the study.
References
- 1. Arehart, K. H. , Katz-Rossi, J. , and Prutsman, J. S. (2005). “ Double-vowel perception in listeners with cochlear hearing loss: Differences in fundamental frequency, ear of presentation, and relative amplitude,” J. Speech. Lang. Hear. Res. 48, 236–252. 10.1044/1092-4388(2005/017) [DOI] [PubMed] [Google Scholar]
- 2. Bush, B. O. , and Kain, A. (2014). “ Modeling coarticulation in continuous speech,” in 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp. 193–197. [Google Scholar]
- 3. Cutting, J. E. (1975). “ Aspects of phonological fusion,” J. Exp. Psychol. Hum. Percept. Perform. 1(2), 105–120. 10.1037/0096-1523.1.2.105 [DOI] [PubMed] [Google Scholar]
- 4. Cutting, J. E. (1976). “ Auditory and linguistic processes in speech perception: Inferences from six fusions in dichotic listening,” Psychol. Rev. 83(2), 114–140. 10.1037/0033-295X.83.2.114 [DOI] [PubMed] [Google Scholar]
- 5. Darwin, C. J. (1984). “ Perceiving vowels in the presence of another sound: Constraints on formant perception,” J. Acoust. Soc. Am. 76(6), 1636–1647. 10.1121/1.391610 [DOI] [PubMed] [Google Scholar]
- 6. Darwin, C. J. (2008). “ Listening to speech in the presence of other sounds,” Philos. Trans. R. Soc. B 363(1493), 1011–1021. 10.1098/rstb.2007.2156 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Darwin, C. J. , and Hukin, R. W. (2004). “ Limits to the role of a common fundamental frequency in the fusion of two sounds with different spatial cues,” J. Acoust. Soc. Am. 116(1), 502–506. 10.1121/1.1760794 [DOI] [PubMed] [Google Scholar]
- 8. Eddolls, M. S. , Molis, M. R. , and Reiss, L. A. J. (2022). “ Onset asynchrony: Cue to aid dichotic vowel segregation in listeners with normal hearing and hearing loss,” J. Speech. Lang. Hear. Res. 65(7), 2709–2719. 10.1044/2022_JSLHR-21-00411 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Folstein, M. F. , Folstein, S. E. , and McHugh, P. R. (1975). “ ‘Mini-mental state’: A practical method for grading the cognitive state of patients for the clinician,” J. Psychiatr. Res. 12(3), 189–198. 10.1016/0022-3956(75)90026-6 [DOI] [PubMed] [Google Scholar]
- 10. Greenhouse, S. W. , and Geisser, S. (1959). “ On methods in the analysis of profile data,” Psychometrika 24, 95–112. 10.1007/BF02289823 [DOI] [Google Scholar]
- 11. Kent, R. D. , and Read, C. (2002). The Acoustic Analysis of Speech, 2nd ed. ( Thomson Learning, Albany, NY: ). [Google Scholar]
- 12. Klatt, D. H. (1980). “ Software for a cascade/parallel formant synthesizer,” J. Acoust. Soc. Am. 67(3), 971–995. 10.1121/1.383940 [DOI] [Google Scholar]
- 13. Kwon, B. J. , and Perry, T. T. (2014). “ Identification and multiplicity of double vowels in cochlear implant users,” J. Speech. Lang. Hear. Res. 57(5), 1983–1996. 10.1044/2014_JSLHR-H-12-0410 [DOI] [PubMed] [Google Scholar]
- 14. Ladefoged, P. , and Broadbent, D. E. (1957). “ Information conveyed by vowels,” J. Acoust. Soc. Am. 29, 98–104. 10.1121/1.1908694 [DOI] [PubMed] [Google Scholar]
- 15. Mauchly, J. W. (1940). “ Significance test for sphericity of a normal n-variate distribution,” Ann. Math. Stat. 11, 204–209. 10.1214/aoms/1177731915 [DOI] [Google Scholar]
- 16. Miller, G. A. , and Nicely, P. E. (1955). “ An analysis of perceptual confusions among some English consonants,” J. Acoust. Soc. Am. 27(2), 338–352. 10.1121/1.1907526 [DOI] [Google Scholar]
- 17. Moore, B. C. , and Gockel, H. E. (2012). “ Properties of auditory stream formation,” Philos. Trans. R. Soc. B 367, 919–931. 10.1098/rstb.2011.0355 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Oh, Y. , Hartling, C. L. , Srinivasan, N. K. , Diedesch, A. C. , Gallun, F. J. , and Reiss, L. A. J. (2022). “ Factors underlying masking release by voice-gender differences and spatial separation cues in multi-talker listening environments in listeners with and without hearing loss,” Front. Neurosci. 16, 1059639. 10.3389/fnins.2022.1059639 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Oh, Y. , and Reiss, L. A. (2017). “ Binaural pitch fusion: Pitch averaging and dominance in hearing-impaired listeners with broad fusion,” J. Acoust. Soc. Am. 142(2), 780–791. 10.1121/1.4997190 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Olive, J. P. , Greenwood, A. , and Coleman, J. (1993). Acoustics of American English Speech: A Dynamic Approach, 1993rd ed. ( Springer, New York: ). [Google Scholar]
- 21. Oxenham, A. J. (2008). “ Pitch perception and auditory stream segregation: Implications for hearing loss and cochlear implants,” Trends Amplif. 12(4), 316–331. 10.1177/1084713808325881 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Reiss, L. , and Molis, M. R. (2021). “ An alternative explanation for difficulties with speech in background talkers: Abnormal fusion of vowels across fundamental frequency and ears,” J. Assoc. Res. Otolaryngol. 22(4), 443–461. 10.1007/s10162-021-00790-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Reiss, L. A. , Eggleston, J. L. , Walker, E. P. , and Oh, Y. (2016). “ Two ears are not always better than one: Mandatory vowel fusion across spectrally mismatched ears in hearing-impaired listeners,” J. Assoc. Res. Otolaryngol. 17(4), 341–356. 10.1007/s10162-016-0570-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Reiss, L. A. , Shayman, C. S. , Walker, E. P. , Bennett, K. O. , Fowler, J. R. , Hartling, C. L. , Glickman, B. , Lasarev, M. , and Oh, Y. (2017). “ Binaural pitch fusion is broader in hearing-impaired than normal-hearing listeners,” J. Acoust. Soc. Am. 141(3), 1909–1920. 10.1121/1.4978009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Repp, B. H. (1976). “ Identification of dichotic fusions,” J. Acoust. Soc. Am. 60(2), 456–469. 10.1121/1.381103 [DOI] [PubMed] [Google Scholar]
- 26. Revoile, S. G. , Pickett, J. M. , and Kozma-Spytek, L. (1991). “ Spectral cues to perception of /d, n, l/ by normal- and impaired-hearing listeners,” J. Acoust. Soc. Am. 90(2), 787–798. 10.1121/1.401948 [DOI] [PubMed] [Google Scholar]
- 27. Shankweiler, D. , and Studdert-Kennedy, M. (1967). “ Identification of consonants and vowels presented to left and right ears,” Q. J. Exp. Psychol. 19(1), 59–63. 10.1080/14640746708400069 [DOI] [PubMed] [Google Scholar]
- 28. Shannon, R. V. , Jensvold, A. , Padilla, M. , Robert, M. E. , and Wang, X. S. (1999). “ Consonant recordings for speech testing,” J. Acoust. Soc. Am. 106(6), L71–L74. 10.1121/1.428150 [DOI] [PubMed] [Google Scholar]
- 29. Shannon, R. V. , Zeng, F. G. , Kamath, V. , Wygonski, J. , and Ekelid, M. (1995). “ Speech recognition with primarily temporal cues,” Science 270(5234), 303–304. 10.1126/science.270.5234.303 [DOI] [PubMed] [Google Scholar]
- 30. Shinn-Cunningham, B. , Best, V. , and Lee, A. K. C. (2017). “ Auditory object formation and selection,” in The Auditory System at the Cocktail Party, edited by Middlebrooks J. C., Simon J. Z., Popper A. N., and Fay R. R. ( Springer International, Cham, Switzerland: ), pp. 7–40. [Google Scholar]
- 31. Souza, P. E. , Boike, K. T. , Witherell, K. , and Tremblay, K. (2007). “ Prediction of speech recognition from audibility in older listeners with hearing loss: Effects of age, amplification, and background noise,” J. Am. Acad. Audiol. 18(1), 054–065. 10.3766/jaaa.18.1.5 [DOI] [PubMed] [Google Scholar]
- 32. Studebaker, G. A. (1985). “ A ‘rationalized’ arcsine transform,” J. Speech Hear. Res. 28(3), 455–462. 10.1044/jshr.2803.455 [DOI] [PubMed] [Google Scholar]
- 33. Summerfield, Q. , and Assmann, P. F. (1991). “ Perception of concurrent vowels: Effects of harmonic misalignment and pitch-period asynchrony,” J. Acoust. Soc. Am. 89, 1364–1377. 10.1121/1.400659 [DOI] [PubMed] [Google Scholar]
- 34. van Wieringen, A. , and Wouters, J. (1999). “ Natural vowel and consonant recognition by Laura cochlear implantees,” Ear Hear. 20(2), 89–103. 10.1097/00003446-199904000-00001 [DOI] [PubMed] [Google Scholar]
- 35. Yoon, Y. S. , and Morgan, D. (2022). “ Dichotic spectral integration range for consonant recognition in listeners with normal hearing,” Front. Psychol. 13, 1009463. 10.3389/fpsyg.2022.1009463 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Zwicker, U. T. (1984). “ Auditory recognition of diotic and dichotic vowel pairs,” Speech Commun. 3, 265–277. 10.1016/0167-6393(84)90023-2 [DOI] [Google Scholar]