Abstract
Cochlear-implant (CI) listeners experience signal degradation, which leads to poorer speech perception than normal-hearing (NH) listeners. In the present study, difficulty with word segmentation, the process of perceptually parsing the speech stream into separate words, is considered as a possible contributor to this decrease in performance. CI listeners were compared to a group of NH listeners (presented with unprocessed speech and eight-channel noise-vocoded speech) in their ability to segment phrases with word segmentation ambiguities (e.g., “an iceman” vs “a nice man”). The results showed that CI listeners and NH listeners were worse at segmenting words when hearing processed speech than NH listeners were when presented with unprocessed speech. When viewed at a broad level, all of the groups used cues to word segmentation in similar ways. Detailed analyses, however, indicated that the two processed speech groups weighted top-down knowledge cues to word boundaries more and weighted acoustic cues to word boundaries less relative to NH listeners presented with unprocessed speech.
I. INTRODUCTION
Cochlear implants (CIs) are auditory prostheses that can partially restore hearing. They are used by individuals who have severe to profound hearing loss and minimal functional benefit from hearing aids. In quiet settings, CI users can show mostly accurate speech perception (Dorman et al., 2002; Rødvik et al., 2018; Tajudeen et al., 2010); however, CI listeners still demonstrate less successful speech perception than normal-hearing (NH) listeners. A great deal of research with CI listeners has looked at their ability to identify or recognize individual words. For example, when identifying words in quiet within Hearing in Noise Test (HINT) sentences, postlingually deafened adult CI listeners were accurate about 76% of the time compared to 96% of the time for the NH listeners (Gifford and Revit, 2010). In sentences from the Texas Instruments/Massachusetts Institute of Technology (TIMIT) database, which are relatively difficult because of their complexity and lack of predictable semantic context, sentence recognition scores were approximately 60% for a group of CI listeners in quiet (Schvartz-Leyzac et al., 2017). Inside and outside of the laboratory, successful word identification also depends on the ability to separate those words from continuous speech (i.e., segmentation), a process that has received less attention in this population.
Our goal is to better understand how CI listeners successfully segment continuous speech into separate words. For example, to successfully interpret the sequence of sounds [ənajsmæn], listeners must parse it into its component words, either [ən#ajsmæn] (“an iceman”) or [ə#najs#mæn] (“a nice man”). Word segmentation is challenging for any listener. Spoken languages, unlike many written languages, do not include consistent cues to separate sentences into words. Speech sometimes includes pauses or other overt cues to word boundaries, but such instances are rare in conversational speech (Cutler and Butterfield, 1990). Instead, listeners aggregate evidence from a wide range of probabilistic cues to parse a sentence into words.
Word segmentation depends on many acoustic cues, a large number of which may be impacted by CI processing. Aspects of both the temporal envelope of speech (slower-amplitude modulations across long timescales) and the temporal fine structure (faster, more abrupt changes at shorter timescales) can be used as cues to word boundaries (Rosen, 1992), as can existing linguistic knowledge. A list of cues considered in the present experiment, which should not be considered exhaustive, is presented in Table I. CIs provide listeners with a degraded signal when compared to NH listeners, including poorer spectral resolution and impoverished temporal fine structure information (Berg et al., 2019; Croghan et al., 2017; D'Alessandro et al., 2018; Friesen et al., 2001; O'Neill et al., 2019a). It may be harder for CI listeners to segment speech when compared to NH listeners, and CI listeners may use a different combination of cues to segment speech. If so, this may be one source of the performance differences between CI and NH listeners in speech perception.
TABLE I.
Cues to word segmentation, indicating the expected relationships between those cues and the perception of whether segments are word-final or word-initial.
| Cue | Cue type | Word-initial segments | Word-final segments | Sample citations |
|---|---|---|---|---|
| Duration, including voice onset time (VOT) | Envelope | Longer than word-medial | Longer than word-medial | Cutler and Butterfield, 1992; Cutler and Norris, 1988; Edwards and Beckman, 1988; Goldwater et al., 2009; Klatt, 1976; van Kuijk and Boves, 1999; Lieberman, 1960; Nakatani and Schaffer, 1978; Oller, 1973; Rietveld, 1980; Rietveld et al., 2004; Shatzman and McQueen, 2006; Turk and Shattuck-Hufnagel, 2014; Umeda, 1977 |
| Intensity, signaling lexical stress | Envelope | More intense than word-final | Less intense than word-initial | van Kuijk and Boves, 1999; Lieberman, 1960 |
| Δ Intensity | Envelope | Increase at start of segments | Decrease at end of segments | Heffner et al., 2013; Hillenbrand and Houde, 1996 |
| F0, signaling lexical stress | Primarily fine structure | Higher than word-medial | Lower than word-medial | Cutler and Butterfield, 1992; Cutler and Norris, 1988; van Kuijk and Boves, 1999; Lieberman, 1960; Nakatani and Schaffer, 1978; Pierrehumbert, 1979; Spinelli et al., 2010 |
| Δ F0 | Primarily fine structure | Increase at start of segments | Decrease at end of segments | Heffner et al., 2013; Hillenbrand and Houde, 1996; Spinelli et al., 2010 |
| Silent periods | Envelope | Preceded by pauses | Followed by pauses | Duez, 1982; Fisher and Tokura, 1996; Swerts, 1997 |
| Lexical frequency and context | Knowledge | Favor more-common parse | Favor more-common parse | Grosjean and Itzler, 1984; Marslen-Wilson, 1987; Shi and Lepage, 2008 |
If a degraded signal makes segmentation more difficult, such differences might also be seen when NH listeners hear speech degraded by vocoding, which is a type of signal processing that simulates aspects of CI processing (Loizou, 2006; Shannon et al., 1995). In vocoding, an acoustic signal is subdivided into different frequency bands, and the temporal envelope modulations from the band-limited signals are used to modulate either sine waves or narrowband noises. This preserves much of the timing and amplitude information in the original acoustic signal while degrading the spectral content in a way that is believed to mimic the degradation of the signal in a CI.
The use of these cues should depend on whether a listener is hearing processed speech. Some word segmentation cues should still be available to listeners hearing processed speech (whether through CI use or vocoding), including temporal envelope cues (such as intensity and duration). In other domains of speech perception, CI listeners compensate for challenges with other cues by using these envelope cues more than NH listeners hearing unprocessed speech (Donaldson et al., 2015; Kong et al., 2016; Marx et al., 2015; Moberly et al., 2014; Peng et al., 2012; Winn et al., 2012), although this is not uniformly the case (Winn et al., 2016). Just as temporal envelope cues have been suggested to be accessible to processed speech listeners, top-down knowledge of the language being spoken should not strongly differ between those groups so long as CI listeners experienced hearing loss postlingually (Davis et al., 2005; Hawthorne, 2018; Sheldon et al., 2008). Comparable levels of top-down knowledge might, in fact, be used more by CI listeners than NH listeners (Gianakas and Winn, 2019; O'Neill et al., 2019b). Other cues are likely to be less accessible, including temporal fine structure cues such as fundamental frequency (F0) contours that cue intonational patterns (Chatterjee and Peng, 2008; Everhardt et al., 2020). The nature of signal processing in CI technology eliminates many of the temporal fine structure cues that NH listeners use to perceive voicing (Heng et al., 2011; Moore, 2008), resulting in poorer perception of F0 contours for both CI listeners and NH listeners presented with vocoded speech (Kalathottukaren et al., 2017; Marx et al., 2015; Peng et al., 2012; Souza et al., 2011).
Because of the confluence of cues that leads to effective word segmentation, previous studies have indicated that it may be difficult for CI listeners to segment words. Studies comparing listener accuracy at discriminating individual speech sound contrasts compared to perceptual accuracy of phrases differing minimally in word segmentation have found that word segmentation tasks are much more difficult for CI listeners than for NH listeners presented unprocessed speech. This was true both for Swedish CI listeners distinguishing compound words from two-word phrases, for example, tekniker, “technician,” vs teknik är, “technology is” (Morris et al., 2013), and for a case study of a French CI listener distinguishing different locations for word boundaries, for example, l'affiche, “the poster,” vs la fiche, “the sheet” (Basirat, 2017). CI listeners and NH listeners presented vocoded speech struggle to exploit statistical regularities in syllables that can be used to learn to segment words in the speech stream (Deocampo et al., 2018; Grieco-Calub et al., 2017). These studies provide evidence that word segmentation should challenge listeners presented with processed speech.
In the present experiment, we explored the word segmentation abilities of CI listeners and NH listeners presented vocoded and unprocessed speech as well as the cues that both listener groups used to segment words. Based on the extant literature, there were two possibilities for how the processed speech groups may differ from the NH listeners presented unprocessed speech. Plausibly, cues such as F0 contours, which are much less accessible via CI processing (Chatterjee and Peng, 2008; Everhardt et al., 2020), might be used less by CI listeners than NH listeners presented unprocessed speech. CI listeners attend less to differences in pitch and more to differences in intensity and duration when distinguishing between questions and statements (Marx et al., 2015; Peng et al., 2012), therefore, this may also be true for word segmentation. Counterintuitively, some studies have suggested that poor signal quality (e.g., from noise) can encourage listeners to focus more on the acoustic signal than on top-down knowledge (Mattys et al., 2009). That is, listeners shift attentional resources to glean as much information as possible from the cues that are challenging to perceive. By corollary, listeners presented a processed signal might allocate more attention to degraded cues, such as F0, thereby downweighting unaffected cues such as duration or top-down information. We explore these two possibilities by assessing the word segmentation abilities of CI listeners and NH listeners presented both unprocessed and vocoded speech.
II. METHODS
A. Listeners
There were two groups of listeners in this study: 16 adult CI listeners and 16 adult NH listeners who were age-matched in that both groups had the same average age and standard deviation at the group level. Four CI listeners who were prelingually deafened were tested but excluded from later analyses. This experiment was approved by the University of Maryland, College Park's Institutional Review Board (IRB).
CI listeners were 59.4 years old, on average [median = 61 yr; standard deviation (SD) = 16.7 yr], and their ages ranged from 24 to 80 years old. The age at hearing loss onset ranged from 4 to 74 years old with a mean of 29.8 years old (SD = 24.8 yr). Their age at implantation (or first implantation if the CI listener had two CIs) ranged from 8 to 74 years old (M = 49.3 yr, SD = 16.7 yr). The CI listeners were self-reported native English speakers. Some participants wore hearing aids before implantation after hearing loss onset; however, as the information was not complete for each participant, we do not include this in our analyses. The detailed demographic information for our CI listeners is available in Table II.
TABLE II.
Demographic information for CI listeners. The row that is bolded includes mean values.
| Listener | Gender | Implant brand | Age at testing (yr) | Age at hearing loss onset (yr) | Age at (first) implantation (yr) |
|---|---|---|---|---|---|
| S1 | M | Cochlear | 24 | 5 | 8 |
| S2 | M | Cochlear | 29 | 10 | 18 |
| S3 | F | Cochlear | 36 | 11 | 20 |
| S4 | F | Cochlear | 57 | 16 | 50 |
| S5 | F | Cochlear | 57 | 22 | 50 |
| S6 | M | Cochlear | 58 | 5 | 52 |
| S7 | F | Cochlear | 59 | 48 | 48 |
| S8 | F | Advanced Bionics | 60 | 50 | 55 |
| S9 | F | Cochlear | 62 | 4 | 57 |
| S10 | F | Cochlear | 65 | 38 | 51 |
| S11 | M | Advanced Bionics | 69 | 5 | 54 |
| S12 | M | Cochlear | 69 | 57 | 57 |
| S13 | M | Advanced Bionics | 72 | 49 | 53 |
| S14 | M | Cochlear | 76 | 13 | 63 |
| S15 | F | Cochlear | 78 | 69 | 69 |
| S16 | M | Cochlear | 80 | 74 | 74 |
| Mean | 59.4 | 29.8 | 49.3 | ||
| SD | 16.7 | 24.8 | 16.7 |
For NH listeners, the average age was 54.3 years old (median = 62 yr; SD = 17.7 yr) and ranged from 21 to 74 years old. Precise one-to-one age matching to the CI listeners was not attempted; instead, the average, standard deviation, and range of ages in the NH listener group closely matched those of the CI listener group. The NH listeners were self-reported native English speakers. The NH listeners' hearing thresholds were tested at octave frequencies between 250 and 8000 Hz; to qualify for the study, hearing thresholds were required to be less than or equal to 25 dB hearing level (HL) at all octave frequencies between 250 and 4000 Hz. Furthermore, the differences in the threshold between the ears could not exceed 15 dB HL at any tested frequency. One NH listener had thresholds of 30 dB HL in the left ear at 500 and 4000 Hz. A summary of audiometric thresholds for the NH listeners, averaged across ears, is available in the supplementary material.1
B. Stimuli
To examine whether and how CI and NH listeners might differ in their word segmentation, we compiled a set of 185 potentially ambiguous stimulus clusters taken from prior literature on speech segmentation (Cole et al., 1980; Gow and Gordon, 1995; Heffner et al., 2017; Lehiste, 1960; Nakatani and Dukes, 1977; Repp et al., 1978; Turk and Shattuck-Hufnagel, 2000), consisting of two, three, or four stimuli differing minimally in the presence or location of a word boundary. These clusters were labeled pairs (153 pairs; e.g., “fair grounds” vs “fairgrounds”), triads (22 triads; e.g., “salmon tents” vs “Sam in tents” vs “Sam intense”), or tetrads (10 tetrads; e.g., “warfare” vs “war fair” vs “wharf air” vs “wharf fair”) based on whether there were two, three, or four different possible responses, respectively. All stimuli were recorded by author C.C.H., a 24-year-old male native speaker of American English. They were recorded in a sound-attenuated booth using a Shure SM51 microphone (Niles, IL) at a 44.1-kHz sampling rate.
Stimulus recordings generally matched the papers they were taken from in both the sentence context and potential parses of the clusters involved. For example, in Turk and Shattuck-Hufnagel (2000), the authors had speakers record three possible parses of each sequence without any sentence context, and we did likewise. The one exception was for stimuli taken from Heffner et al. (2017), which were sometimes recorded with more parses than were present in the original study because raw transcriptions from naive listeners of the sentences were available. When sequences were recorded in sentence contexts, the ambiguous words within the sentence were removed from the sentence context. For a full list of stimuli, see the supplementary material.1 All of the sequences in the supplementary material were presented to listeners,1 but seven of these sequences were removed from analysis as the sequences were either (1) not homophonous in the talker's dialect or (2) had contrasts that could not be coded consistently with other items.
Two types of stimuli were used in this experiment: unprocessed and vocoded stimuli. The unprocessed stimuli contained all spectral information and were normalized to the same average intensity. The vocoded stimuli were processed to simulate many aspects of CI processing (Shannon et al., 1995). In the current study, vocoded stimuli contained either four or eight channels of spectral information. CI users vary in the number of functional channels of spectral information they receive, depending on the specifics of the technology and electrode placement (Berg et al., 2019; Croghan et al., 2017; Friesen et al., 2001), but many listeners do not benefit from as many channels as their device putatively contains. We chose to use eight channels for a simulation of standard performance. A four-channel condition was included to allow for the simulation of a listener with poorer CI outcomes, although those results are not presented for the sake of simplicity in the data analysis and interpretation. Original stimuli were bandpass filtered into four or eight channels using third-order forward-backward Butterworth filters (slopes = −36 dB/octave). The channels were contiguous and logarithmically spaced with the lower frequency boundary of the lowest band set to 200 Hz and the upper frequency boundary of the highest band set to 8000 Hz. Envelopes were extracted from these bands, half-wave rectified, and second-order low-pass forward-backward filtered with a cutoff frequency of 400 Hz. Noise carriers with bandwidths that matched the analysis filter bandwidths were then modulated by the corresponding envelope. Sample spectrograms for unprocessed and eight-channel filtered stimuli are available in the supplementary material.1
C. Testing environment
Hearing screenings for NH listeners were conducted with an audiometer (Kimmetrics, Inc., MA-41,) in a double-walled sound attenuating booth (Industrial Acoustics, Inc., Naperville, IL). Experimental testing for all of the listeners was conducted in the same booth. The stimuli were presented from a pair of loudspeakers at 65 dB-A. The loudspeakers were located at ±45° approximately 1 m from the seated listener. The sound level was calibrated prior to testing. The experiment was run through matlab on a desktop computer, and listeners made responses by selecting their answers on a computer monitor using a mouse. Paper surveys about demographics and language experience were completed in a quiet area.
D. Procedure
For each trial, the listeners pressed a button to play a sequence. This sequence was accompanied by four boxes appearing on the screen in a two-by-two grid, providing up to four possible parses of the critical fragment. In the pair and triad stimuli, the unused boxes were disabled. The possible responses were assigned randomly to the four boxes on the screen. The experiment was self-paced; after listeners indicated what they heard within the trial, they clicked a button to initiate the next trial. The NH listeners completed a hearing screening prior to the experimental testing.
For the NH listeners, the trials were distributed across three blocks. The order of trials within each block was determined randomly for each block independently of the others. In the first block, listeners were given ten practice trials that could vary in word segmentation or in other aspects of their acoustic signal (e.g., the identity of an ambiguous sound) to learn the task. These practice trials all involved four- and eight-channel vocoded stimuli. Although the first block presented to the CI listeners included vocoded trials, this block was primarily intended for building familiarity with the task. In the second block, the NH listeners were given four- and eight-channel vocoded versions of each sequence in each cluster. For example, for a triad such as “salmon tents”/“Sam in tents”/“Sam intense,” the listeners heard both four- and eight-channel versions of each recording for a total of six trials, randomly distributed throughout the second block. With a total of 412 stimuli (153 from pairs, 22 from triads, and 10 from tetrads), each distributed across both four- and eight-channel vocoding, there was a total of 824 stimuli in the second block. In the third and final block, the listeners were presented unprocessed versions of each sequence without any vocoding applied for a total of 412 stimuli to measure typical performance. The CI listeners were only presented the first and third blocks as the second block (with vocoded sentences) was not relevant. Finally, all of the listeners in the study were given a form to allow them to indicate their relevant language background and demographic information. Most of the NH listeners took approximately 1 h and 30 min to complete the experiment, whereas the CI listeners took about 40 min to complete the experiment.
E. Coding
The listener responses were analyzed to measure the accuracy with which each listener perceived the word boundaries and the extent to which the listeners were using each cue enumerated above to segment words. For accuracy, the responses were judged as accurate if they were perceived in line with what the speaker intended. Thus, if the speaker intended to say “president's peach,” then the perception of “president's peach” was correct and the perception of “president's speech” was not and vice versa. For cue use, because the set of responses was limited to between two and four options, the mapping of cues to perception had to also reflect the responses that were available to the listener as well as the responses given. For example, in the triad “pop a pose”/“poppa pose”/“pop oppose”, the [ə] in the middle of the triad could be perceived as solely word-final (in “poppa”), solely word-initial (in “oppose”), or both (as “a”). As the NH listeners participated in two different conditions, these will be referred to as NH-8 (eight-channel vocoding) and NH-Unprocessed.
We developed a coding scheme that allowed for multiple responses to count toward perceiving an individual segment as word-initial or word-final, and this is outlined in the supplementary material.1 Briefly, we attributed segmentation cues according to whether it was possible for an ambiguous word boundary to be grammatically perceived as word-initial, word-final, or both. For example, the second [p] in the triad above could be perceived as word-final (in “pop”) but not as word-initial. The following [ə] could be perceived as either word-final (in “a” and “poppa”) or as word-initial (in “a” and “oppose”). For cases in which it was perceived as both (such as [ə] in this triad), the segment was treated as having two halves, one of which was word-initial and the other of which was word-final. This coding scheme allowed for a fair comparison of the cues to word-initial and word-final segments across all of the responses that were available. At the same time, the coding scheme did have the drawback of erroneously localizing some cues such as duration to one part of the vowel (unlike, for example, changes in intensity, which are generally found on the edge of segments). These compromises were necessary to obtain an internally coherent labeling scheme. For example, for the [ə] sound in the triad above, the [ə] was split into two parts. Each part was assigned half of the duration of this vowel, and aspects such as, for example, the change in intensity were computed by comparing the intensity at the edge of each half of the vowel to the intensity at the midpoint of the vowel.
III. ANALYSIS AND RESULTS: ACCURACY
A. Analysis
The listeners heard phrases that differed in the speaker's intended word boundary placement within each phrase. The average accuracy across trials was compared between listener groups and conditions to assess the variation in raw word segmentation abilities through a combination of t-test comparisons to chance and model comparison. Broadly speaking, model comparison involves comparing more complex models with more factors against simpler models with fewer factors. The comparison of models with different combinations of fixed factors allowed for an assessment of the importance of each factor. When the comparison is nonsignificant, the simpler model is preferred; when the comparison is significant, this indicates that a factor or set of factors can explain the significant variance in the outcome measure. We report Bayes factors (BFs) for the null effects within the model comparison in the accuracy data, computed using the bayesfactor_models function within the bayestestR package (Makowski et al., 2019). See the supplementary material for descriptions of the best-fitting models.1
B. Results
Although chance performance was different for each type of cluster, listener performance did not necessarily map to the number of responses available (mean performance across group conditions for pairs = 71.5%; for triads = 60.1%; for tetrads = 83.7%). Performance appeared to be best in the tetrads, for example, which may relate to the specific characteristics of the handful of items in the set. All group-level accuracies in Fig. 1 are significantly greater than chance. To compare accuracies to chance across each type of cluster, we transformed the data by adjusting for the different chance levels for each cluster type, subtracting the chance level from the observed level and dividing that by the chance probability of an incorrect answer. For example, an accuracy of 60% for an item from a pair would have an adjusted accuracy of 20% (60% – 50% = 10%, and 10% divided by 50% is 20%), whereas an accuracy of 60% for an item from a tetrad would have an adjusted accuracy of 47% (60% – 25% = 35%, and 35% divided by 75% is 47%). In each case, the accuracy is transformed to be relative to 0% (representing chance) with positive numbers indicating that performance is above chance and negative values indicating that performance is below chance. Using transformed accuracies, weighted by the number of items in each cluster type, all three listener groups had accuracy above chance. The NH-Unprocessed listeners at 57.5% [t(15) = 21.3, p < 0.001], NH-8 listeners at 41.9% [t(15) = 17.4, p < 0.001] and CI listeners at 39.0% [t(15) = 16.4, p < 0.001] all had transformed accuracies greater than chance (0% transformed) at the group level.
FIG. 1.
The mean percent correct with error bars indicating ±1 standard error by listener for each combination of listener group, processing condition, and cluster type. The dashed lines indicate the chance for each type of stimulus.
Despite the differences in the group averages, there were strong correlations between groups in the items that were challenging. The inter-group similarity was assessed by comparing the transformed accuracy across the participant groups. All three groups were compared in a pairwise fashion. The results are presented in Fig. 2. In general, the performance in one listener condition predicted the performance in the others. This was true when regressing accuracy by sequence for the CI listeners on the NH-Unprocessed condition, r = 0.84; for the NH-8 condition on the NH-Unprocessed condition, r = 0.87; and for the CI listeners on the NH-8 condition, r = 0.87. There was a strong relationship between the performance on individual items by different listener groups and conditions.
FIG. 2.
The pairwise correlations of average transformed sequence accuracy between different conditions. The (A) NH-Unprocessed condition and CI listeners, (B) NH-Unprocessed and NH-8 conditions, and (C) NH-8 condition and CI listeners are shown. Each point is the average accuracy for that sequence in each condition while the regression line is the best-fit line for a linear approximation of the data. As values are transformed, the negative values reflect levels lower than chance and positive values reflect levels higher than chance.
For the NH listeners, neither of the measured demographic characteristics predicted the accuracy across items. To test for effects of demographic characteristics on accuracy, mixed-effects models, including random intercepts for listener and item, were constructed. The random intercepts for item helped to adjust for the differences between pairs, triads, and tetrads in their accuracy as random intercepts allow items to vary in their baseline accuracy. We compared a model with fixed effects of the condition (NH-Unprocessed vs NH-8), age (continuous, rescaled to center on zero), and high-frequency pure tone averages (HFPTAs) and interactions between condition and age and condition and HFPTA [Bayesian information criterion (BIC)=11 158] to a model with just a fixed effect of the condition (BIC=11 123) and found no significant difference between the two, χ2(4) = 3.10, p = 0.54. In other words, adding the age and HFPTAs to the model did nothing to improve the model fit. Comparison of the BFs, computed using the BIC, indicates that there is substantially stronger evidence for the model without the effects of age and HFPTAs, BF = 3.69 × 107. The remaining model, using the NH-Unprocessed condition as the baseline, included just an intercept (b = 2.02, z = 13.3, p < 0.001) and a coefficient for the NH-8 (b = −0.709, z = −14.2, p < 0.001) conditions, indicating that the accuracy of the NH participants was significantly lower in the NH-8 condition than the NH-Unprocessed condition.
We also tested for significant predictors of success for the CI listeners. We were interested in four predictors of accuracy: age, age at onset of hearing loss, years since first implantation (i.e., duration of experience with CI listening), and duration of hearing loss. However, for the CI listeners in the current experiment, there were strong correlations between age, years since first implantation, and duration of hearing loss, preventing model convergence. As such, just two fixed factors were added into a mixed-effects model explaining the variance in accuracy with random intercepts for the listeners and item: age and years since first implantation (BIC = 8587). Removing the fixed factor of age from this model (BIC = 8562) led to a significant decrease in the model fit, χ2(1) = 11.9, p < 0.001, whereas removing the fixed factor of years since first implantation (BIC = 8582) did not lead to a significant decrease in the model fit, χ2(1) = 3.52, p = 0.06 (BF = 15.7). This suggests that age explains significant variance in accuracy for the CI users. Older CI users tended to do worse than younger ones at the task. Given the relatively small BF for the years since first implantation, strong conclusions about the effects of this factor should be approached with caution.
As can be seen in Fig. 1, the mean accuracy differed among listener groups and conditions. The performance in the NH-Unprocessed condition was consistently higher than the performance in the NH-8 condition and for the CI listeners. Figure 2 shows pairwise correlations in accuracy between the different listener groups. A mixed-effects model with a single fixed factor of group (CI, NH-Unprocessed, and NH-8), random intercepts for listener and item, and a dependent variable of accuracy was used as a basis of comparison (BIC=33 196). This model fit better than the model with just a fixed intercept, χ2(2) = 544, p < 0.001 (BIC = 33 153). Post hoc comparisons of the estimated means were used to examine comparisons between the CI listeners and NH-Unprocessed and NH-8 listeners, using the Tukey method to account for multiple comparisons. The CI listeners and NH-8 listeners did not have significantly different performances from one another, z = −0.972, p = 0.59. However, both the CI listeners, z = –8.01, p < 0.001, and NH-8 listeners, z = –22.8, p < 0.001, performed significantly worse than the NH-Unprocessed listeners.
IV. ANALYSIS AND RESULTS: CUE USE
A. Analysis
The phrases used in the study were evaluated according to a variety of acoustic and lexical properties, including duration, F0, intensity, presence or absence of silence, and lexical frequency. This allowed us to determine the acoustic and lexical cues that were most informative to word segmentation, as well as how those cues interacted with the listener group and condition. The supplementary material includes information about how the potential temporal envelope, temporal fine structure, and knowledge-based cues to word segmentation were coded.1 The dataset was split into ten parts with the word-final and word-initial cues for each class of the segment (vowel, approximant, nasal, fricative, and stop) treated separately.
Up to nine cues were used for each combination of class and position. These include cues directly discussed in Table I and attempts to systematize the cues depicted there. This included temporal envelope cues (Duration; Intensity; change in intensity, which we labeled Δ Intensity; Silence; and Voicing), temporal fine structure cues (F0 and change in F0, which we labeled Δ F0), and knowledge-based cues (Lexical Frequency and Word Co-occurrence). Given that the primary acoustic correlates of lexical stress include higher intensity, F0, and duration (van Kuijk and Boves, 1999; Lieberman, 1960), we believed that the influence of lexical stress on word segmentation would be included by some combination of these three cues. Similarly, the row labeled “pauses” in Table I was split into two separate cues of “Silence” and “Voicing,” reflecting two types of low-intensity information present in our dataset. The two knowledge-based cues, meanwhile, were used as different ways to instantiate the cue of lexical frequency and context into the model. We used the Corpus of Contemporary American English (Davies, 2008) to estimate the Word Co-occurrence and Lexical Frequency. Some of the cues were left unconsidered for certain segments because those cues were not important for that segment class. Some Silence and Voicing cue contrasts were also left out when including these cues prevented model convergence. Generally, this occurred in cases in which the number of sequences with an explicit period of silence or voicing was low. A series of iterated generalized linear mixed-effects models was constructed for each segment class, which was then used to assess the significance of each factor on the likelihood of perceiving the word boundaries.
Once the dataset was coded, it was split into ten parts, corresponding to the different segment classes and positions. The word-initial and word-final contrasts were coded separately from each other and crossed with segment class (vowels, approximants—including both liquids and glides—nasals, fricatives, and stops). Each combination of major class and position was modeled separately because the cues that signal word boundaries are not identical across these groupings. A full list of which cues were used for which segment class is given in Table III.
TABLE III.
Modeled cue use by segment class. Each row indicates a potential cue while each column shows the segment class. The main column headings refer to the major classes (vowel, approximant, nasal, fricative, and stop) and whether a segment was word-initial (I) or word-final (F). The cells are shaded in line with whether the cue was used for each consonant class; white cells were put into the initial models, whereas gray cells were left out. For example, the initial model for word-final approximants, word-initial stops, and both vowel positions included voicing in the initial model as a potential cue.
|
Because of model convergence concerns, a simple random effect structure was used with only random intercepts by listener and sequence (i.e., for each individual production of a segment). Although simple, this random effect structure still allowed for random variation among listeners in their likelihood to perceive a word boundary adjacent to each segment, as well as among sequences in their likelihood to be perceived as adjacent to a boundary. Many model comparisons were used to develop the final model for each segment class. As such, individual comparisons are not described. Instead, when discussing the results, model coefficients for the NH-Unprocessed condition, as well as the post hoc comparisons between contrasts of interest in the strength of each cue, are given in the text. The post hoc comparisons were performed using the emmeans package in R (Lenth, 2016) and are computed in terms of the estimated marginal mean differences (EMMs). Each one is Tukey-adjusted for multiple comparisons. The full model specifications are available in the supplemental material.1 Detailed information about the model comparison is available from the authors.
B. Results
Table IV shows the use of cues to word segmentation in the present dataset with the white cells showing positive relationships between the levels of that cue and word boundaries and light gray cells showing the negative relationships. Tables V–VII all show post hoc comparisons between those cues and listener group/condition with the numeric values in each table indicating significant differences in the strength of each cue between each pair of conditions. To give an example of interpretation, consider the case of the word-final vowel duration. The cell in Table IV for word-final vowel duration is shaded white with a z score of 5.22, indicating a positive relationship in the word-final position between duration and perception of a word boundary (i.e., vowels were more likely to be perceived as word-final when they were longer) for the NH-Unprocessed listeners. In Table V, which compares the NH-Unprocessed listeners to CI listeners, the cell is shaded white, showing that this relationship was still present in the CI listeners. However, there is a negative z score in that cell, indicating that the relationship between the duration and word boundary placement was less positive for the CI listeners than it was for the NH-Unprocessed listeners. This compares to Table VI (showing the NH-Unprocessed vs NH-8 comparison), in which the cell is shaded white but there is no number in the cell. This indicates that the relationship between duration and word-final boundary perception is also positive for the NH-8 listeners. Moreover, because there is no z score in the cell, there is no significant difference between these two listener conditions. Thus, although the NH listeners continued to use duration in this context even for a degraded (NH-8) signal, the CI listeners used it to a lesser extent than the NH-Unprocessed listeners.
TABLE IV.
The NH-Unprocessed listeners' cue use (rows) by segment class (columns). The main column headings refer to the major classes (vowel, approximant, nasal, fricative, and stop) and whether a segment was word-initial (I) or word-final (F). The dark gray cells indicate that those cues were not considered for that particular segment class. The cells shaded black indicate cues that did not explain significant variation for segmentation. The white cells indicate that higher levels of that cue were associated with a greater probability of perceiving word boundaries, whereas the light gray cells indicate that higher cue levels were associated with a lower probability of perceiving word boundaries. Each cell contains the z score for that coefficient for NH-Unprocessed listeners.
|
TABLE V.
The CI listeners' cue use (rows) by segment class (columns) relative to the NH-Unprocessed listeners, shaded similarly to those in Table IV. Each cell contains the z score for that coefficient for CI listeners relative to the NH-Unprocessed listeners if that difference was significant; the blank cells showed no significant difference in the pairwise comparison. Positive z values indicate that the relationship between that cue and word segmentation was more positive (for white cells) or less negative (for light gray cells). Negative z values indicate that the relationship between that cue and word segmentation was more negative (for light gray cells) or less positive (for white cells). The cells with an exclamation point indicate a cell where the sign for that coefficient flipped when comparing the CI listeners to NH-Unprocessed listeners.
|
TABLE VI.
The NH-8 listeners' cue use (rows) by segment class (columns) relative to the NH-Unprocessed listeners, shaded similarly to those in Table IV. Each cell contains the z score for that coefficient for the NH-8 listeners relative to the NH-Unprocessed listeners if that difference was significant; the blank cells showed no significant difference in the pairwise comparison. Positive z values indicate that the relationship between that cue and word segmentation was more positive (for white cells) or less negative (for light gray cells). Negative z values indicate that the relationship between that cue and word segmentation was more negative (for light gray cells) or less positive (for white cells). The cell with an exclamation point indicates a cell where the sign for that coefficient flipped when comparing the NH-8 listeners to NH-Unprocessed listeners.
|
TABLE VII.
The CI listeners' cue use (rows) by segment class (columns) relative to the NH-8 listeners, arranged similarly to those in Table IV. Each cell contains the z score for that coefficient for the CI listeners relative to NH-8 listeners if that difference was significant; the blank cells showed no significant difference in the pairwise comparison. Positive z values indicate that the relationship between that cue and word segmentation was more positive (for white cells) or less negative (for light gray cells). Negative z values indicate that the relationship between that cue and word segmentation was more negative (for light gray cells) or less positive (for white cells).
|
1. Durational cues
The duration of the segments was consistently used as a cue to the presence of word boundaries for the word-final segments but was less consistently used for the word-initial segments, where it was significant only for fricatives and stops. Longer segments were more likely to be heard as next to a boundary. Across the segment classes, the use of Duration was often weaker in CI listeners when compared to listeners in the NH-Unprocessed condition (see Table V), although there were generally no differences between the NH-Unprocessed condition and NH-8 condition (see Table VI). For instance, the best-fitting model for the word-final vowels included a significant coefficient for Duration for listeners in the NH-Unprocessed condition (b = 0.229, z = 5.22, p < 0.001), which was used as the reference level. For a listener in the NH-Unprocessed condition, a potentially word-final vowel with a length of 75 ms, for example, was heard as word-final in a model-estimated 19% of the cases, whereas a similar vowel with a length of 175 ms was heard as word-final 44% of the time. In word-final vowels, Duration was weighted significantly less positively for CI listeners than for listeners in the NH-Unprocessed condition (b = −0.0527, z = −3.18, p = 0.004) and for listeners in the NH-8 condition (b = −0.423, z = −2.50, p = 0.03), but there was no significant difference between the listeners in the NH-Unprocessed condition and those in the NH-8 condition (b = 0.0104, z = 0.605, p = 0.82).
2. Intensity
Intensity was sometimes a significant predictor of the model fit with the relationship in NH-Unprocessed listeners being negative when it was. As the intensity increased, NH-Unprocessed listeners were less likely to posit a word boundary before word-initial vowels, approximants, and stops. But the directionality depended on the listener group and condition. CI listeners showed a positive relationship between the intensity and word segmentation before vowels and stops (see Table V). For example, for a participant in the NH-Unprocessed condition, the coefficient for Intensity was negative in the best-fitting model for the word-initial approximants (b = −0.248, z = −4.94, p < 0.001). To the NH-Unprocessed participants, an approximant with an average intensity of 62 dB sound pressure level (SPL) was heard as being word-initial in about 54% of the cases, whereas an approximant with an average intensity of 72 dB was heard as being word-initial in about 11% of the cases, according to the best-fitting model. This negative relationship between the perception of word boundaries and Intensity was attenuated or even sometimes reversed in the processed speech groups. For the word-initial approximants, the CI listeners and NH-Unprocessed listeners generally had no significant differences in their propensity to use Intensity as a cue, b = 0.0742, z = 1.98, p = 0.12, but NH-8 listeners had a significantly less negative coefficient for Intensity than both the CI listeners, b = 0.100, z = 2.88, p = 0.01, and NH-Unprocessed listeners, b = 0.175, z = 4.79, p < 0.001.
Similarly, Δ Intensity (a change in intensity within a segment) often predicted significant variance in the perceptions of word boundaries but, again, this depended on the listener group. High levels of the cue were negatively associated with segmentation in the NH-Unprocessed condition for word-final vowels, approximants, and nasals, and word-initial nasals. Once more, however, the strength of these correlations was attenuated, or even the sign flipped, in the processed speech groups. For instance, the coefficient for Δ Intensity was negative for the word-final nasals for participants in the NH-Unprocessed condition (b = −0.228, z = −2.84, p = 0.004). This estimated coefficient was not significantly different between the NH-8 condition and CI listeners (b = −0.0114, z = −0.162, p = 0.99). It was, however, significantly different—in fact, different enough to be flipped in sign—between the NH-Unprocessed condition and NH-8 condition (b = 0.237, z = 3.23, p = 0.004; note that this suggests an estimated NH-8 coefficient for the Δ intensity of 0.009) and CI listeners (b = 0.248, z = 3.32, p = 0.003; an estimated CI coefficient for Δ Intensity of 0.0204).
3. Frequency-based cues
F0 also occasionally related to word boundary perception. For the word-initial vowels, higher F0 values were positively correlated with the likelihood of perceiving the vowel as being adjacent to a word boundary. For example, in an NH-Unprocessed listener, an average F0 of 90 Hz was associated with a likelihood of perceiving a word-initial boundary of 14%, whereas an average F0 of 130 Hz led to a likelihood of about 42%. This corresponds to a significant and positive coefficient for F0 in the NH-Unprocessed condition in the best-fitting model (b = 0.169, z = 2.16, p = 0.03). However, the opposite was true for the word-initial approximants, where there was a negative correlation between F0 and the likelihood of perceiving the approximant as word-initial, corresponding to a significant and negative coefficient in the best-fitting model (b = −0.253, z = −3.87, p < 0.001). F0 only rarely interacted with the participant group and condition. This cue was, for the most part, used no less by the processed speech groups than listeners in the NH-Unprocessed condition (see Tables V and VI). For instance, the interaction between the listener group and F0 for word-initial approximants added nothing to the model fit over a model without this interaction, χ2(2) = 5.95, p = 0.05, suggesting that there were no significant differences between the listener groups in their use of this cue for these segments.
The cue of Δ F0 (a change in F0 within a segment) explained significant variation in three of six segment types tested: Δ F0 was positively correlated with the likelihood of perceiving vowels as word-initial and perceiving approximants and nasals as word-final. Again, the use of this cue was not very different between the processed speech groups and NH-Unprocessed group. An example of these trends is seen in the coefficients for Δ F0 in the word-final approximants. There was a significant positive coefficient for Δ F0 in the final model (b = 0.122, z = 2.69, p = 0.007). However, the model comparisons between a model with an interaction between Δ F0 and the listener group and without that interaction suggested that the interaction explained no significant variance in word boundary perception [χ2(2) = 1.54, p = 0.46], implying that there was no interaction between Δ F0 and the listener group.
4. Silence and voicing
Silence was often a good predictor of the presence of boundaries, particularly for the word-final segments. When enough sequences with silence were available to make an analysis meaningful, Silence was a significant predictor of the tendency to hear a segment as boundary adjacent in about half of the cases (all vowels, word-final fricatives, and word-final stops). Silence did not add significant explanatory power to the models for the word-final nasals and approximants and word-initial fricatives and stops. When significant, the coefficients were large and positive: the participants were much more likely to hear a word boundary adjacent to silent intervals. For example, for NH-Unprocessed listeners, the coefficient for the word-initial vowels was significant and positive (b = 5.51, z = 9.19, p < 0.001). Vowels preceded by silence were heard as word-initial approximately 99% of the time, whereas those not preceded by silence were heard as word-initial only 25% of the time. For the most part, the use of Silence did not differ between the listener groups as can be seen from the bevy of non-interactions (blank cells) in Tables V–VII. In cases in which there was a significant interaction, these interactions were often in the margins. For word-initial vowels, although the interaction between the listener group and Silence was significant, none of the comparisons between the participant groups were significant in pairwise, post hoc testing (all p > 0.17).
Voicing, on the other hand, was not a frequent cue to word boundaries. Voicing explained a significant amount of variation for the word-initial vowels but not for the word-final approximants, word-initial stops, or word-final vowels. For example, for the word-initial stops, the difference in the model fit between a model with voicing and one without was not significant [χ2(3) = 3.81, p = 0.28], allowing us to forego its use in the final model. Although adding the interaction between Voicing, listener group, and condition explained significant variance to the model fit, there were not significant differences observed in the post hoc pairwise comparisons (all p > 0.09).
5. Top-down cues
Lexical Frequency and Word Co-occurrence were meant to model similar, if not entirely overlapping, lexical cues. Generally, one cue or the other explained significant variation in listener responses. Lexical Frequency almost never led to an improved significant model fit. However, when Word Co-occurrence explained significant model fit—for word-final approximants, word-final nasals, all fricatives, and word-initial stops—there was a significant, positive relationship between that cue and the perceived presence of a word boundary. In other words, listeners tended to parse sequences in a way that led to the perception of frequent phrases but did not necessarily parse sequences in a way that led to the perception of more frequent words. For instance, the best-fitting model estimated that the word-final fricatives with a negative word co-occurrence ratio were only heard as word-final in about 20% of the cases, whereas those with a positive word co-occurrence ratio were heard as word-final in about 39% of the cases. This corresponds to a nonsignificant positive model coefficient in the final model (b = 0.454, z = 1.63, p = 0.10). In both cases, the use of these cues (whether positive or negative) was enhanced in the processed speech groups relative to the NH-Unprocessed listeners, particularly for Word Co-occurrence (see Tables V and VI). For the word-final fricatives, for instance, both listeners in the NH-8 condition (b = 0.378, z = 3.39, p = 0.002) and CI listeners (b = 0.552, z = 4.89, p < 0.001) had significantly larger model coefficients for Word Co-occurrence than their NH-Unprocessed peers. The comparison between the NH-8 condition and CI listeners (b = −0.174, z = −1.56, p = 0.26) was nonsignificant.
V. GENERAL DISCUSSION
We examined word segmentation abilities of the CI listeners and NH listeners presented unprocessed or eight-channel vocoded speech. The goal was to assess both the accuracy of the perception of phrases with ambiguous segmentation as well as the cues used when segmenting words for those listener groups. There were two main findings. First, both CI listeners and NH listeners presented vocoded speech could segment words at above chance performance, although both groups presented processed speech performed significantly worse than the same NH listeners presented unprocessed speech. This can be seen particularly in Fig. 1, where all three bars indicating listener groups and conditions are above the dashed lines representing chance.
Second, both CI listeners and NH listeners presented vocoded speech used similar sets of cues to segment the signal into words. This can be seen particularly in Tables V and VI. In Tables V and VI, where the rows corresponding to the top-down cues (Word Co-occurrence and Lexical Frequency) are often shaded white and contain positive z scores, these cues are positively associated with word boundary presence across groups and the pairwise comparison of the processed speech groups to the NH listeners hearing unprocessed speech showed the processed speech groups used these cues more than the unprocessed group. Meanwhile, the pairwise comparisons of the signal-based cues, when significant, showed attenuation of the signal-based cues even when those cues should still be present in a degraded signal (e.g., duration). Positive z scores in the light gray cells indicate the cues less negatively associated with the perception of word boundaries for the listeners presented processed speech compared to the NH listeners presented unprocessed speech, whereas the negative z scores in the white cells indicate cues less positively associated with the perception of word boundaries for the listeners presented processed speech compared to the NH listeners presented unprocessed speech.
A. Word segmentation performance
First, examining the overall accuracy (Fig. 1), deficits in the accurate perception of word boundaries existed among CI listeners and NH listeners presented vocoded speech. The task was more difficult for the listeners presented processed speech compared to when the NH listeners were presented with unprocessed speech. Although the CI listeners and NH-8 listeners appeared to have more difficulty segmenting fluent speech in general, they struggled with a similar set of items as the NH-Unprocessed listeners. This is in line with the idea that the processed speech groups had a small and uniform difficulty with word segmentation rather than one limited to only a subset of the sequences. Additionally, performance from the processed speech groups, on an item-by-item basis, was also strongly correlated. Given this close match, this supports the idea that vocoded speech is an acceptable simulation of the challenges faced by CI listeners in the context of word segmentation.
In finding that CI listeners and NH listeners presented vocoded speech were challenged in their segmentation of words, our study is in line with other studies of word segmentation (and the use of prosody) in this population. For example, Morris et al. (2013) presented CI and NH listeners with stimuli that were ambiguous between a single word and a compound word. Although the task was easy with average accuracy among NH listeners at approximately 95%, the CI listeners showed an average accuracy of about 82%. The 13-percentage-point decrease is comparable to that shown in our study (10-percentage-point decrease). The case study of a CI listener reported by Basirat (2017) with a 19-percentage-point decrease in accuracy is in the range of the individual variation found in our study.
B. Cue use for word segmentation
Above and beyond the importance for clinical populations, the present study contributes important insights into the cues used by listeners for word segmentation in speech perception. Most of the previous studies have examined speech production or the perception of a limited number of cues. Meanwhile, the present study examines a large set of phrases to determine the extent to which natural variation in those cues drives word segmentation decisions in perception.
Temporal envelope cues often affected word segmentation in the present study. Duration was perhaps one of the most consistent cues to word boundaries in the present dataset; longer segments were more likely to be perceived as adjacent to boundaries. This was truer for the word-final segments than word-initial segments. This is contrary to some previous studies of word segmentation (Nakatani and Schaffer, 1978; Quené, 1993) but is more in line with studies of phrase segmentation (Edwards and Beckman, 1988; Rietveld, 1980; Scott, 1982; Wightman et al., 1992). The listeners may have distinguished between the sequences according to their perception of phrase boundaries rather than word boundaries. Cues like silence and voicing also often played a role in perception. Silent intervals led adjoining segments to be more likely to be perceived as adjacent to a boundary (Cutler and Butterfield, 1990; Duez, 1982, 1985; Vaissière, 1983).
The use of other temporal envelope cues differed from the predictions that could be generated from previous studies. Higher intensity was, in general, negatively associated with word-initial word boundaries. This was unexpected given that stressed syllables, which tend to have a higher intensity, are often word-initial (Cutler and Butterfield, 1992; Cutler and Norris, 1988). Although many syllables have a more-intense vowel in the center and a less-intense consonant near the word boundary, this cannot explain the current pattern, where the negative correlation between the intensity and perception of word boundaries is true even within segmental classes. For Δ Intensity, meanwhile, there were significant negative coefficients for most relationships between the cue and word segmentation. This agrees well with the observed data for the word-final vowels, approximants, and nasals given that word boundaries tend to be heard after locations of decreasing intensity (Heffner et al., 2013; Hillenbrand and Houde, 1996). Yet, it is the opposite of what is expected for the word-initial approximants given that increases in the intensity should be related to the likelihood of perceiving a segment as word-initial.
The use of temporal fine structure cues also depended strongly on the segments being considered. The complexity of these results may reflect something about the use of F0 cues by CI listeners and NH listeners presented vocoded speech given radical changes to the periodicity cues that result from CI signal processing (Schvartz-Leyzac and Chatterjee, 2015; Townshend et al., 1987). Furthermore, previous studies of F0 have largely depended on vowels rather than other sonorants (Heffner et al., 2013; Hillenbrand and Houde, 1996; Tyler and Cutler, 2009); in this study, there was a positive relationship between F0 and the perception of the word-initial vowels. Changes in F0 were not consistently associated even with vowels, although vowels with large increases in F0 were more likely to be perceived as word-initial, which is largely in line with the published literature (Heffner et al., 2013; Hillenbrand and Houde, 1996; Spinelli et al., 2010).
The assessed top-down cues had a split nature: Word Co-occurrence was generally positively associated with word boundary perception, whereas Lexical Frequency was generally unrelated to it, even though both were expected to have a positive relationship. Although word frequency has been shown to influence the perception of word boundaries (Grosjean and Itzler, 1984; Shi and Lepage, 2008), studies of its influence are less common than, say, the influence of the sentence context (Mattys et al., 2005). The sentence context may have been a much more regular predictor of word segmentation if it were included in this study.
C. Cue use in processed speech
Regarding the differences between the listener groups in their use of word segmentation cues, listeners who were presented processed speech downweighted acoustic cues and upweighted top-down cues. This was true regardless of whether those cues were negatively or positively associated with the perception of a word boundary in a certain location and regardless of whether the processed speech arrived via CI or via the process of noise-vocoding. Table VII shows that the use of specific cues was only sometimes different between the processed speech groups (CI listeners and listeners in the NH-8 condition). Certainly, differences between the processed speech groups were less common than, say, the differences between CI listeners and listeners in the NH-Unprocessed condition (Table V).
Some of this is expected. CI use and vocoding both lead to the perception of a very different acoustic signal compared to the unprocessed speech. More surprisingly, the relative downweighting of acoustic cues by the listeners presented processed speech was truer for the temporal envelope cues that are said to be left relatively intact (e.g., duration, intensity) than for temporal fine structure cues traditionally said to be more affected by signal processing (e.g., F0). For example, the processed speech groups downweighted duration when compared to NH listeners presented unprocessed speech (as shown with negative z scores in the white cells in Tables V and VI), meaning that a cue sometimes used to compensate for decrements in F0 in other areas of prosody (Donaldson et al., 2015; Kong et al., 2016; Winn et al., 2012) was, in fact, used less by the CI listeners to segment words. F0, meanwhile, was less affected by vocoding when it came to word segmentation despite widely demonstrated challenges on the part of CI listeners in perceiving it in general (as shown with the blank cells in the rows corresponding to F0 in Tables V and VI, indicating no significant pairwise differences between the processed speech groups and unprocessed speech group). The persistent use of this cue by both populations in this study suggests that the factors labeled F0 in the present study may index cues not assessed elsewhere. Silence also rarely interacted with the listener group and condition, which is more expected: silent intervals should be transmitted effectively through processing.
NH listeners presented vocoded speech may simply be less certain than NH listeners presented unprocessed speech. This suggests that difficulties with word segmentation may be at a decision level rather than at the level of actual acoustic processing. This idea echoes previous studies (Jaekel et al., 2017; Winn et al., 2016) that examined VOT judgments (which reflect, in part, duration) and found that CI listeners had less categorical VOT judgments. Some of the differences in the results obtained between CI and NH listeners may reflect differences in the processing of acoustic cues to voicing themselves. Given that the duration is relatively less affected by the signal processing found in CIs than other acoustic cues, some of the group contrasts may also reflect differences in later decision-making about the sounds being heard. These decision-making effects could have led to the result obtained here.
The top-down cues, meanwhile, were generally used more strongly by CI listeners than by NH listeners presented unprocessed speech. This is shown by the positive z scores in many of the white cells in Tables V and VI, indicating that those cues, which are positively associated with word boundaries across conditions in general, are used even more by the processed speech groups. This seems to suggest that when acoustic cues to segmentation are weaker, listeners are more likely to rely on prior knowledge. Although this seems logical, this is contradictory to a hypothesis that could be generated from Mattys et al. (2009), in which listeners in acoustically challenging contexts would pay more attention to the aspects of the acoustic signal that are still intact when processing speech. Instead, listeners seemed to ignore certain aspects of the degraded acoustic information and instead attended more to some top-down cues to word segmentation, which is in line with previous studies of listeners presented processed speech in other contexts (Davis et al., 2005; Hawthorne, 2018; Sheldon et al., 2008). Subsequent studies are necessary to probe the extent to which this reliance on top-down cues could be explained by the fact that the CI listeners in the present study were postlingually deafened and explore the role of early linguistic experience on these top-down factors.
The differences between the NH-8 listeners and CI listeners, meanwhile, were relatively small, although surprisingly consistent. Table VII shows the pairwise comparisons between the CI listeners and NH-8 listeners. There was only one significant pairwise difference in top-down cue use between the processed speech listener groups, and no significant pairwise differences in the use of the silence or voicing cues, indicating that these cues are all used roughly comparably by the processed speech groups. The use of the temporal fine structure cues of F0 and change in F0 were also largely comparable between the groups, although CI listeners were less affected by the change in F0 for word-final nasals than the NH-8 group.
There were consistent differences between the groups for the envelope-based cues: the intensity was consistently used more by CI listeners than NH-8 listeners (as shown by the positive values in the white cells and negative values in the light gray cells in Table VII) and the duration was consistently used less by CI listeners than NH-8 listeners (as shown by the negative values in the white cells in Table VII). Why these two cues should trade-off in this fashion for these two listener groups is not clear. Studies that have examined the temporal envelope cues in both CI listeners and NH listeners presented processed speech have usually examined the duration or intensity in isolation not together. In one study examining the distinction between statements and questions in CI listeners and NH listeners presented processed speech (Peng et al., 2012), the CI listeners were less affected by distinctions in both the intensity and duration than NH listeners presented noise-vocoded speech. Thus, it is not clear why those cues were split in the present study.
D. Limitations
The small and heterogeneous nature of the participants that were included is one important limitation of this study. There were 16 postlingually deafened CI listeners included in the present study, who were age-matched at the group level to a set of 16 NH listeners. These sample sizes are not particularly large and influenced more by the availability of participants than considerations about power. It is likely that some of the effects observed (or that we failed to observe) stem from the small sample size of participants, which is one reason why the discussion focuses more on finding patterns across segment classes and positions than interrogating each datapoint in depth. Of primary concern is the heterogeneity of the CI sample: the participants were aged 24–80 years old, indicated their hearing loss started anytime between ages 4 and 74 years old, and experienced deafness prior to the first implantation anywhere from less than 1 year to 54 years.
This variability makes it challenging to predict, for example, whether listeners who had hearing loss that began earlier in life would show weaker use of top-down cues than listeners whose hearing loss began later. Although years since first implantation did not predict significant variance in the overall accuracy scores, the correlations between this factor and the others in our dataset prevented more detailed exploration of this idea. This limits the clinical implications of our study as such implications would be extrapolating from information that is too variable or not present in our sample. For example, it may be helpful to have other measures of speech perception abilities and correlate those measures with individual participants' word segmentation abilities. Age, on the other hand, did predict overall accuracy for CI listeners but not for NH listeners, suggesting it may be a more promising future avenue to pursue the clinical applications of word segmentation research. Although this would have been statistically unwise with just 16 participants in the CI listener group, including scores on other tasks of speech perception as factors in our analysis would have been helpful in guiding these findings. If word segmentation is an integral part of the challenges that CI listeners have with speech perception, it should be the case that CI listeners who are particularly challenged by word segmentation tasks should also be challenged by other speech perception tasks.
It is important to note the aspects of the experimental design that also limit the possible conclusions that can be made. One can consider, for instance, the differences in the number of items across the segment classes and positions used. The supplementary material includes a table, giving the number of items per segment class and position, which ranged from 56 (for word-final stops) to 208 (for word-initial fricatives).1 Although these item numbers are large enough to alleviate some concerns about the dramatic effects of individual item outliers on the effects observed, this means that the strength of our conclusions is contingent on the number of items in each segment class. This study used only a single talker, meaning that listeners may have had the chance to adapt to that speaker's particular style of speech (Cutler and Butterfield, 1990) as well as the distribution of cues in that speaker's voice that signaled word boundaries (Lehiste, 1960). To the extent that the patterns observed in the acoustic-based cues that signaled word boundaries differed from those of previous studies, it may simply be that this individual talker did not show the expected patterns. However, adaptation to individual talkers is an action that listeners are expected to perform frequently in speech contexts. Perhaps more pressingly, the study was also observational rather than experimental, meaning that some of the correlations observed may stem from incidental aspects of the stimuli rather than the true influence of any particular cue on the perception of word boundaries. Additional experimental studies are necessary to confirm or dispel any of the conclusions reached here.
VI. CONCLUSIONS
Word segmentation is an important component of speech perception. In the present study, we examined the word segmentation abilities of CI listeners, NH listeners presented vocoded speech, and NH listeners presented unprocessed speech. Our results suggest that, for this particular talker, cues including as duration, silence, changes in F0, and top-down cues about lexical frequency all consistently affected word segmentation. Longer segments, segments next to periods of silence, segments with rising (for word-initial segments) F0, and segments that led to more frequent word strings were all more likely to be perceived to be next to a word boundary than their equivalents with different cue levels. In general, the same sequences that were challenging for one combination of condition and listener group were challenging for the other two with a small, significant, and uniform decrease in the performance for both the CI listeners and NH listeners presented vocoded speech relative to NH listeners presented unprocessed speech.
Listeners presented processed speech generally downweighted the use of the acoustic cues and upweighted the use of the top-down cues relative to NH listeners presented unprocessed speech. This was true regardless of whether the acoustic cues were those thought to be degraded (e.g., F0) or unaffected (e.g., duration) by CI signal processing. This could have ramifications for broader speech perception deficits in these populations as this finding suggests that listeners presented processed speech may be generally less confident or less proficient at using even intact acoustic cues. It could also influence our understanding of the clinical consequences of CI use as even some acoustic cues believed to be intact for CI listeners were used less than for NH listeners presented unprocessed speech. Although we limited our analysis to CI listeners who were postlingually deafened, which may complicate the generalizability of these findings, the use of top-down cues was stronger for the CI listeners. This suggests that listeners may compensate for challenges with acoustic cues to word segmentation by substituting knowledge-based cues instead.
ACKNOWLEDGMENTS
This work was supported by a National Science Foundation (NSF) Graduate Research Fellowship award, a NSF SBE Postdoctoral Research Fellowship award (No. 1714858), an Acoustical Society of America Hunt Postdoctoral Research Fellowship, and a University of Maryland, College Park Graduate School Flagship Fellowship to C.C.H., as well as a National Institutes of Health (NIH) F31 award to B.N.J. (Award No. 1F31DC017362) and a NIH R01 award to M.J.G. (Award No. R01AG051603). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We would like to thank Shelby Creelman, Hannah Cohen, Laura Goudreau, Hannah Johnson, Kelly Miller, Hallie Saffeir, Adelia Witt, Calli Yancey, and Erica Younkin for their help in testing the listeners for this study.
Footnotes
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0006448 for additional information about participants, materials, stimulus coding, and full specifications of final models.
References
- 1. Basirat, A. (2017). “ Word segmentation in phonemically identical and prosodically different sequences using cochlear implants: A case study,” Clin. Linguist. Phon. 31, 478–485. 10.1080/02699206.2017.1283708 [DOI] [PubMed] [Google Scholar]
- 2. Berg, K. A. , Noble, J. H. , Dawant, B. M. , Dwyer, R. T. , Labadie, R. F. , Gifford, R. H. , and Dwyer, R. T. (2019). “ Speech recognition as a function of the number of channels in perimodiolar electrode recipients,” J. Acoust. Soc. Am. 145, 1556–1564. 10.1121/1.5092350 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Chatterjee, M. , and Peng, S. C. (2008). “ Processing F0 with cochlear implants: Modulation frequency discrimination and speech intonation recognition,” Hear. Res. 235, 143–156. 10.1016/j.heares.2007.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Cole, R. A. , Jakimik, J. , and Cooper, W. E. (1980). “ Segmenting speech into words,” J. Acoust. Soc. Am. 67, 1323–1332. 10.1121/1.384185 [DOI] [PubMed] [Google Scholar]
- 5. Croghan, N. B. H. , Duran, S. I. , and Smith, Z. M. (2017). “ Re-examining the relationship between number of cochlear implant channels and maximal speech,” J. Acoust. Soc. Am. 142, EL537–EL543. 10.1121/1.5016044 [DOI] [PubMed] [Google Scholar]
- 6. Cutler, A. , and Butterfield, S. (1990). “ Durational cues to word boundaries in clear speech,” Speech Commun. 9, 485–495. 10.1016/0167-6393(90)90024-4 [DOI] [Google Scholar]
- 7. Cutler, A. , and Butterfield, S. (1992). “ Rhythmic cues to speech segmentation: Evidence from juncture misperception,” J. Mem. Lang. 31, 218–236. 10.1016/0749-596X(92)90012-M [DOI] [Google Scholar]
- 8. Cutler, A. , and Norris, D. (1988). “ The role of strong syllables in segmentation for lexical access,” J. Exp. Psychol. Hum. Percept. Perform. 14, 113–121. 10.1037/0096-1523.14.1.113 [DOI] [Google Scholar]
- 9. D'Alessandro, H. D. , Ballantyne, D. , Boyle, P. J. , De Seta, E. , DeVincentiis, M. , and Mancini, P. (2018). “ Temporal fine structure processing, pitch, and speech perception in adult cochlear implant recipients,” Ear Hear. 39, 679–686. 10.1097/AUD.0000000000000525 [DOI] [PubMed] [Google Scholar]
- 10. Davies, M. (2008). “ The Corpus of Contemporary American English: 450 million words, 1990–present,” available at: http://corpus.byu.edu/coca/ (Last viewed 07/26/2021).
- 11. Davis, M. H. , Johnsrude, I. S. , Hervais-Adelman, A. , Taylor, K. , and McGettigan, C. (2005). “ Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences,” J. Exp. Psychol. Gen. 134, 222–241. 10.1037/0096-3445.134.2.222 [DOI] [PubMed] [Google Scholar]
- 12. Deocampo, J. A. , Smith, G. N. L. , Kronenberger, W. G. , Pisoni, D. B. , and Conway, C. M. (2018). “ The role of statistical learning in understanding and treating spoken language outcomes in deaf children with cochlear implants,” Lang. Speech. Hear. Serv. Sch. 49, 723–739. 10.1044/2018_LSHSS-STLT1-17-0138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Donaldson, G. S. , Rogers, C. L. , Johnson, L. B. , and Oh, S. H. (2015). “ Vowel identification by cochlear implant users: Contributions of duration cues and dynamic spectral cues,” J. Acoust. Soc. Am. 138, 65–73. 10.1121/1.4922173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Dorman, M. F. , Loizou, P. C. , Spahr, A. J. , and Maloff, E. (2002). “ Factors that allow a high level of speech understanding by patients fit with cochlear implants,” Am. J. Audiol. 11, 119–123. 10.1044/1059-0889(2002/014) [DOI] [PubMed] [Google Scholar]
- 15. Duez, D. (1982). “ Silent and non-silent pauses in three speech styles,” Lang. Speech 25, 11–28. 10.1177/002383098202500102 [DOI] [PubMed] [Google Scholar]
- 16. Duez, D. (1985). “ Perception of silent pauses in continuous speech,” Lang. Speech 28, 377–389. 10.1177/002383098502800403 [DOI] [PubMed] [Google Scholar]
- 17. Edwards, J. , and Beckman, M. E. (1988). “ Articulatory timing and the prosodic interpretation of syllable duration,” Phonetica 45, 156–174. 10.1159/000261824 [DOI] [Google Scholar]
- 18. Everhardt, M. K. , Sarampalis, A. , Coler, M. , Başkent, D. , and Lowie, W. (2020). “ Meta-analysis on the identification of linguistic and emotional prosody in cochlear implant users and vocoder simulations,” Ear Hear. 41, 1092–1102. 10.1097/AUD.0000000000000863 [DOI] [PubMed] [Google Scholar]
- 19. Fisher, C. , and Tokura, H. (1996). “ Acoustic cues to grammatical structure in infant-directed speech: Cross-linguistic evidence,” Child Dev. 67, 3192–3218. 10.2307/1131774 [DOI] [PubMed] [Google Scholar]
- 20. Friesen, L. M. , Shannon, R. V. , Baskent, D. , and Wang, X. (2001). “ Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants,” J. Acoust. Soc. Am. 110, 1150–1163. 10.1121/1.1381538 [DOI] [PubMed] [Google Scholar]
- 21. Gianakas, S. P. , and Winn, M. B. (2019). “ Lexical bias in word recognition by cochlear implant listeners,” J. Acoust. Soc. Am. 146, 3373–3383. 10.1121/1.5132938 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Gifford, R. H. , and Revit, L. J. (2010). “ Speech perception for adult cochlear implant recipients in a realistic background noise: Effectiveness of preprocessing strategies and external options for improving speech recognition in noise,” J. Am. Acad. Audiol. 21, 441–451. 10.3766/jaaa.21.7.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Goldwater, S. , Griffiths, T. L. , and Johnson, M. (2009). “ A Bayesian framework for word segmentation: Exploring the effects of context,” Cognition 112, 21–54. 10.1016/j.cognition.2009.03.008 [DOI] [PubMed] [Google Scholar]
- 24. Gow, D. W. , and Gordon, P. C. (1995). “ Lexical and prelexical influences on word segmentation: Evidence from priming,” J. Exp. Psychol. Hum. Percept. Perform. 21, 344–359. 10.1037/0096-1523.21.2.344 [DOI] [PubMed] [Google Scholar]
- 25. Grieco-Calub, T. M. , Simeon, K. M. , Snyder, H. E. , and Lew-Williams, C. (2017). “ Word segmentation from noise-band vocoded speech,” Lang. Cogn. Neurosci. 32, 1344–1356. 10.1080/23273798.2017.1354129 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Grosjean, F. , and Itzler, J. (1984). “ Can semantic constraint reduce the role of word frequency during spoken-word recognition?,” Bull. Psychon. Soc. 22, 180–182. 10.3758/BF03333798 [DOI] [Google Scholar]
- 27. Hawthorne, K. (2018). “ Prosody-driven syntax learning is robust to impoverished pitch and spectral cues,” J. Acoust. Soc. Am. 143, 2756–2767. 10.1121/1.5031130 [DOI] [PubMed] [Google Scholar]
- 28. Heffner, C. C. , Dilley, L. C. , McAuley, J. D. , and Pitt, M. A. (2013). “ When cues combine: How distal and proximal acoustic cues are integrated in word segmentation,” Lang. Cogn. Process. 28, 1275–1302. 10.1080/01690965.2012.672229 [DOI] [Google Scholar]
- 29. Heffner, C. C. , Newman, R. S. , and Idsardi, W. J. (2017). “ Support for context effects on segmentation and segments depends on the context,” Atten., Percept., Psychophys. 79, 964–988. 10.3758/s13414-016-1274-5 [DOI] [PubMed] [Google Scholar]
- 30. Heng, J. , Cantarero, G. , Elhilali, M. , and Limb, C. J. (2011). “ Impaired perception of temporal fine structure and musical timbre in cochlear implant users,” Hear. Res. 280, 192–200. 10.1016/j.heares.2011.05.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Hillenbrand, J. M. , and Houde, R. A. (1996). “ Role of F0 and amplitude in the perception of intervocalic glottal stops,” J. Speech Hear. Res. 39, 1182–1190. 10.1044/jshr.3906.1182 [DOI] [PubMed] [Google Scholar]
- 32. Jaekel, B. N. , Newman, R. S. , and Goupell, M. J. (2017). “ Speech rate normalization and phonemic boundary perception in cochlear-implant users,” J. Speech, Lang. Hear. Res. 60, 1398–1416. 10.1044/2016_JSLHR-H-15-0427 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kalathottukaren, R. T. , Purdy, S. C. , and Ballard, E. (2017). “ Prosody perception and production in children with hearing loss and age- and gender-matched controls,” J. Am. Acad. Audiol. 28, 283–294. 10.3766/jaaa.16001 [DOI] [PubMed] [Google Scholar]
- 34. Klatt, D. H. (1976). “ Linguistic uses of segmental duration in English: Acoustic and perceptual evidence,” J. Acoust. Soc. Am. 59, 1208–1221. 10.1121/1.380986 [DOI] [PubMed] [Google Scholar]
- 35. Kong, Y.-Y. , Winn, M. B. , Poellmann, K. , and Donaldson, G. S. (2016). “ Discriminability and perceptual saliency of temporal and spectral cues for final fricative consonant voicing in simulated cochlear-implant and bimodal hearing,” Trends Hear. 20, 1–15. 10.1177/2331216516652145 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Lehiste, I. (1960). “ An acoustic-phonetic study of internal open juncture,” Phonetica 5, 5–56. 10.1159/000258062 [DOI] [Google Scholar]
- 37. Lenth, R. (2016). “ lsmeans: Least-squares means.,” available at https://cran.r-project.org/package=lsmeans (Last viewed 07/26/2021).
- 38. Lieberman, P. (1960). “ Some acoustic correlates of word stress in American English,” J. Acoust. Soc. Am. 32, 451–454. 10.1121/1.1908095 [DOI] [Google Scholar]
- 39. Loizou, P. C. (2006). “ Speech processing in vocoder-centric cochlear implants,” in Cochlear Brainstem Implants, edited by Møller A. R. ( Karger, Basel: ), pp. 109–143. [DOI] [PubMed] [Google Scholar]
- 40. Makowski, D. , Ben-Shachar, M. S. , and Lüdecke, D. (2019). “ bayestestR: Describing effects and their uncertainty, existence and xignificance within the Bayesian framework,” J. Open Source Softw. 4, 1541. 10.21105/joss.01541 [DOI] [Google Scholar]
- 41. Marslen-Wilson, W. D. (1987). “ Functional parallelism in spoken word-recognition,” Cognition 25, 71–102. 10.1016/0010-0277(87)90005-9 [DOI] [PubMed] [Google Scholar]
- 42. Marx, M. , James, C. , Foxton, J. , Capber, A. , Fraysse, B. , Barone, P. , and Deguine, O. (2015). “ Speech prosody perception in cochlear implant users with and without residual hearing,” Ear Hear. 36, 239–248. 10.1097/AUD.0000000000000105 [DOI] [PubMed] [Google Scholar]
- 43. Mattys, S. L. , Brooks, J. , and Cooke, M. (2009). “ Recognizing speech under a processing load: Dissociating energetic from informational factors,” Cogn. Psychol. 59, 203–243. 10.1016/j.cogpsych.2009.04.001 [DOI] [PubMed] [Google Scholar]
- 44. Mattys, S. L. , White, L. , and Melhorn, J. F. (2005). “ Integration of multiple speech segmentation cues: A hierarchical framework,” J. Exp. Psychol. Gen. 134, 477–500. 10.1037/0096-3445.134.4.477 [DOI] [PubMed] [Google Scholar]
- 45. Moberly, A. C. , Lowenstein, J. H. , Tarr, E. , Caldwell-Tarr, A. , Welling, D. B. , Shahin, A. J. , and Nittrouer, S. (2014). “ Do adults with cochlear implants rely on different acoustic cues for phoneme perception than adults with normal hearing?,” J. Speech, Lang. Hear. Res. 23, 530–545. 10.1044/2014_JSLHR-H-12-0323 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Moore, B. C. J. (2008). “ The role of temporal fine structure processing in pitch perception, masking, and speech perception for normal-hearing and hearing-impaired people,” J. Assoc. Res. Otolaryngol. 9, 399–406. 10.1007/s10162-008-0143-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Morris, D. , Magnusson, L. , Faulkner, A. , Jönsson, R. , and Juul, H. (2013). “ Identification of vowel length, word stress, and compound words and phrases by postlingually deafened cochlear implant listeners,” J. Am. Acad. Audiol. 24, 879–890. 10.3766/jaaa.24.9.11 [DOI] [PubMed] [Google Scholar]
- 48. Nakatani, L. H. , and Dukes, K. D. (1977). “ Locus of segmental cues for word juncture,” J. Acoust. Soc. Am. 62, 714–719. 10.1121/1.381583 [DOI] [PubMed] [Google Scholar]
- 49. Nakatani, L. H. , and Schaffer, J. A. (1978). “ Hearing ‘words’ without words: Prosodic cues for word perception,” J. Acoust. Soc. Am. 63, 234–245. 10.1121/1.381719 [DOI] [PubMed] [Google Scholar]
- 50. Oller, D. K. (1973). “ The effect of position in utterance on speech segment duration in English,” J. Acoust. Soc. Am. 54, 1235–1247. 10.1121/1.1914393 [DOI] [PubMed] [Google Scholar]
- 51. O'Neill, E. R. , Kreft, H. A. , and Oxenham, A. J. (2019a). “ Speech perception with spectrally non-overlapping maskers as measure of spectral resolution in cochlear implant users,” J. Assoc. Res. Otolaryngol. 20, 151–167. 10.1007/s10162-018-00702-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. O'Neill, E. R. , Kreft, H. A. , and Oxenham, A. J. (2019b). “ Cognitive factors contribute to speech perception in cochlear-implant users and age-matched normal-hearing listeners under vocoded conditions,” J. Acoust. Soc. Am. 146, 195–210. 10.1121/1.5116009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Peng, S.-C. , Chatterjee, M. , and Lu, N. (2012). “ Acoustic cue integration in speech intonation recognition with cochlear implants,” Trends Amplif. 16, 67–82. 10.1177/1084713812451159 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Pierrehumbert, J. (1979). “ The perception of fundamental frequency declination,” J. Acoust. Soc. Am. 66, 363–369. 10.1121/1.383670 [DOI] [PubMed] [Google Scholar]
- 55. Quené, H. (1993). “ Segment durations and accent as cues to word segmentation in Dutch,” J. Acoust. Soc. Am. 94, 2027–2035. 10.1121/1.407504 [DOI] [PubMed] [Google Scholar]
- 56. Repp, B. H. , Liberman, A. M. , Eccardt, T. , and Pesetsky, D. (1978). “ Perceptual integration of acoustic cues for stop, fricative, and affricate manner,” J. Exp. Psychol. Hum. Percept. Perform. 4, 621–637. 10.1037/0096-1523.4.4.621 [DOI] [PubMed] [Google Scholar]
- 57. Rietveld, A. C. M. (1980). “ Word boundaries in the French language,” Lang. Speech 23, 289–296. 10.1177/002383098002300306 [DOI] [Google Scholar]
- 58. Rietveld, T. , Kerkhoff, J. , and Gussenhoven, C. (2004). “ Word prosodic structure and vowel duration in Dutch,” J. Phon. 32, 349–371. 10.1016/j.wocn.2003.08.002 [DOI] [Google Scholar]
- 59. Rødvik, A. K. , Torkildsen, J. von Koss , Wie, O. B. , Storaker, M. A. , and Silvola, J. T. (2018). “ Consonant and vowel identification in cochlear implant users measured by nonsense words: A systematic review and meta-analysis,” J. Speech, Lang. Hear. Res. 61, 1023–1050. 10.1044/2018_JSLHR-H-16-0463 [DOI] [PubMed] [Google Scholar]
- 60. Rosen, S. (1992). “ Temporal information in speech: Acoustic, auditory, and linguistic aspects,” Philos. Trans. R. Soc. London B Biol. Sci. 336, 367–373. 10.1098/rstb.1992.0070 [DOI] [PubMed] [Google Scholar]
- 61. Schvartz-Leyzac, K. C. , and Chatterjee, M. (2015). “ Fundamental-frequency discrimination using noise-band-vocoded harmonic complexes in older listeners with normal hearing,” J. Acoust. Soc. Am. 138, 1687–1695. 10.1121/1.4929938 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Schvartz-Leyzac, K. C. , Zwolan, T. A. , and Pfingst, B. E. (2017). “ Effects of electrode deactivation on speech recognition in multichannel cochlear implant recipients,” Cochlear Implants Int. 18, 324–334. 10.1080/14670100.2017.1359457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Scott, D. R. (1982). “ Duration as a cue to the perception of a phrase boundary,” J. Acoust. Soc. Am. 71, 996–1007. 10.1121/1.387581 [DOI] [PubMed] [Google Scholar]
- 64. Shannon, R. V. , Zeng, F.-G. , Kamath, V. , Wygonski, J. , and Ekelid, M. (1995). “ Speech recognition with primarily temporal cues,” Science 270, 303–304. 10.1126/science.270.5234.303 [DOI] [PubMed] [Google Scholar]
- 65. Shatzman, K. B. , and McQueen, J. M. (2006). “ Segment duration as a cue to word boundaries in spoken-word recognition,” Percept. Psychophys. 68, 1–16. 10.3758/BF03193651 [DOI] [PubMed] [Google Scholar]
- 66. Sheldon, S. , Pichora-Fuller, M. K. , and Schneider, B. A. (2008). “ Priming and sentence context support listening to noise-vocoded speech by younger and older adults,” J. Acoust. Soc. Am. 123, 489–499. 10.1121/1.2783762 [DOI] [PubMed] [Google Scholar]
- 67. Shi, R. , and Lepage, M. (2008). “ The effect of functional morphemes on word segmentation in preverbal infants,” Dev. Sci. 11, 407–413. 10.1111/j.1467-7687.2008.00685.x [DOI] [PubMed] [Google Scholar]
- 68. Souza, P. E. , Arehart, K. , Miller, C. W. , and Muralimanohar, R. K. (2011). “ Effects of age on F0 discrimination and intonation perception in simulated electric and electroacoustic hearing,” Ear Hear. 32, 75–83. 10.1097/AUD.0b013e3181eccfe9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Spinelli, E. , Grimault, N. , Meunier, F. , and Welby, P. (2010). “ An intonational cue to word segmentation in phonemically identical sequences,” Atten., Percept., Psychophys. 72, 775–787. 10.3758/APP.72.3.775 [DOI] [PubMed] [Google Scholar]
- 70. Swerts, M. (1997). “ Prosodic features at discourse boundaries of different strength,” J. Acoust. Soc. Am. 101, 514–521. 10.1121/1.418114 [DOI] [PubMed] [Google Scholar]
- 71. Tajudeen, B. A. , Waltzman, S. B. , Jethanamest, D. , and Svirsky, M. A. (2010). “ Speech perception in congenitally deaf children receiving cochlear implants in the first year of life,” Otol. Neurotol. 31, 1254–1260. 10.1097/MAO.0b013e3181f2f475 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Townshend, B. , Cotter, N. , Van Compernolle, D. , and White, R. L. (1987). “ Pitch perception by cochlear implant subjects,” J. Acoust. Soc. Am. 82, 106–115. 10.1121/1.395554 [DOI] [PubMed] [Google Scholar]
- 73. Turk, A. E. , and Shattuck-Hufnagel, S. (2000). “ Word-boundary-related duration patterns in English,” J. Phon. 28, 397–440. 10.1006/jpho.2000.0123 [DOI] [Google Scholar]
- 74. Turk, A. E. , and Shattuck-Hufnagel, S. (2014). “ Timing in talking: What is it used for, and how is it controlled?,” Philos. Trans. R. Soc. London B Biol. Sci. 369, 1–13. 10.1098/rstb.2013.0395 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Tyler, M. D. , and Cutler, A. (2009). “ Cross-language differences in cue use for speech segmentation,” J. Acoust. Soc. Am. 126, 367–376. 10.1121/1.3129127 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Umeda, N. (1977). “ Consonant duration in American English,” J. Acoust. Soc. Am. 61, 846–858. 10.1121/1.381374 [DOI] [PubMed] [Google Scholar]
- 77. Vaissière, J. (1983). “ Language-independent prosodic features,” in Prosody: Models and Measurements, edited by Cutler A. ( Springer, Berlin: ), pp. 53–66. [Google Scholar]
- 78. van Kuijk, D. , and Boves, L. (1999). “ Acoustic characteristics of lexical stress in continuous telephone speech,” Speech Commun. 27, 95–111. 10.1016/S0167-6393(98)00069-7 [DOI] [Google Scholar]
- 79. Wightman, C. W. , Shattuck-Hufnagel, S. , Ostendorf, M. , and Price, P. J. (1992). “ Segmental durations in the vicinity of prosodic phrase boundaries,” J. Acoust. Soc. Am. 91, 1707–1717. 10.1121/1.402450 [DOI] [PubMed] [Google Scholar]
- 80. Winn, M. B. , Chatterjee, M. , and Idsardi, W. J. (2012). “ The use of acoustic cues for phonetic identification: Effects of spectral degradation and electric hearing,” J. Acoust. Soc. Am. 131, 1465–1479. 10.1121/1.3672705 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Winn, M. B. , Won, J. H. , and Moon, I. J. (2016). “ Assessment of spectral and temporal resolution in cochlear implant users using psychoacoustic discrimination and speech cue categorization,” Ear Hear. 37, e377–e390. 10.1097/AUD.0000000000000328 [DOI] [PMC free article] [PubMed] [Google Scholar]


