Word segmentation from noise-band vocoded speech

Tina M Grieco-Calub; Katherine M Simeon; Hillary E Snyder; Casey Lew-Williams

doi:10.1080/23273798.2017.1354129

. Author manuscript; available in PMC: 2018 Jul 20.

Published in final edited form as: Lang Cogn Neurosci. 2017 Jul 20;32(10):1344–1356. doi: 10.1080/23273798.2017.1354129

Word segmentation from noise-band vocoded speech

Tina M Grieco-Calub ^a, Katherine M Simeon ^a, Hillary E Snyder ^a, Casey Lew-Williams ^b

PMCID: PMC6028043 NIHMSID: NIHMS940242 PMID: 29977950

Abstract

Spectral degradation reduces access to the acoustics of spoken language and compromises how learners break into its structure. We hypothesised that spectral degradation disrupts word segmentation, but that listeners can exploit other cues to restore detection of words. Normal-hearing adults were familiarised to artificial speech that was unprocessed or spectrally degraded by noise-band vocoding into 16 or 8 spectral channels. The monotonic speech stream was pause-free (Experiment 1), interspersed with isolated words (Experiment 2), or slowed by 33% (Experiment 3). Participants were tested on segmentation of familiar vs. novel syllable sequences and on recognition of individual syllables. As expected, vocoding hindered both word segmentation and syllable recognition. The addition of isolated words, but not slowed speech, improved segmentation. We conclude that syllable recognition is necessary but not sufficient for successful word segmentation, and that isolated words can facilitate listeners’ access to the structure of acoustically degraded speech.

Keywords: Word segmentation, noise-band vocoding, spectral degradation, speech rate, isolated words

Introduction

Language is replete with structure, and normal-hearing listeners are equipped to detect it. For decades, researchers have been drawn to understanding how learners discover words in continuous speech, an inherently challenging task given that connected speech has no reliable pause-defined cues to word boundaries (Romberg & Saffran, 2010; Saffran, Aslin, & Newport, 1996). One cue that listeners can use to segment words is the co-occurrence relation between sounds and syllables, often referred to as transitional probability (TP). For example, given syllables X and Y, learners are sensitive to the probability with which X will transition to Y (and vice versa), and this domain-general sensitivity to TPs has been demonstrated in several perceptual domains (Aslin, Saffran, & Newport, 1998; Fiser & Aslin, 2002; Kirkham, Slemmer, & Johnson, 2002; Lew-Williams, Pelucchi, & Saffran, 2011). This mechanism is posited to not only enable language learning in infants but also to facilitate word segmentation in adults.

An underlying assumption in previous research is that successful word segmentation from contiguous speech hinges both on accurate recognition of individual speech units, such as phonemes and syllables, as well as on tracking of syllable sequences over time. Insufficient spectral fidelity, however, compromises successful discrimination of speech. Environmental factors (i.e. background noise), biological differences (i.e. hearing loss), and the use of hearing devices (i.e. cochlear implants) restrict access to spectral cues that are important for speech unit recognition (Donaldson & Kreft, 2006; Gordon-Salant, Yeni-Komshian, Fitzgibbons, & Cohen, 2015; Munson, Donaldson, Allen, Collison, & Nelson, 2003; Xu & Pfingst, 2008; Zhou, Xu, & Lee, 2010). When recognition is impaired, there are consequences for processing both within and beyond the domain of language. A range of studies using behavioural, physiological, and neuroimaging methods provide robust evidence that encoding degraded speech is also cognitively demanding (Davis & Johns-rude, 2007; Mattys, Davis, Bradlow, & Scott, 2012; Rönn-berg et al., 2013). For example, the use of dual-task paradigms has documented robust declines in secondary task performance as speech becomes more degraded in a primary task (e.g. Broadbent, 1958; Downs & Crum, 1978; Grieco-Calub, Ward, & Brehm, 2017; Pals, Sarampalis, & Başkent, 2013; Pichora-Fuller, Schneider, & Daneman, 1995; Rabbitt, 1966; Rakerd, Seitz, & Whearty, 1996; Sarampalis, Kalluri, Edwards, & Hafter, 2009; Ward, Shen, Souza, & Grieco-Calub, 2017). Pupillometry studies have also shown increased cognitive effort associated with processing of degraded speech input (e.g. Winn, Edwards, & Litovsky, 2015). These findings are supported by neuroimaging work showing more distributed neural activation during tasks that involve recognition of degraded vs. unprocessed speech (Hervais-Adelman, Carlyon, Johnsrude, & Davis, 2012; Obleser, Wise, Dresner, & Scott, 2007; Wild et al., 2012). Given the cognitive demands of listening to degraded speech, and given that listeners have different levels of access to the acoustic subtleties in speech, there may be important individual differences in the ability to track relations between speech units (such as syllables) across time.

Following the prediction that spectrally degraded speech will interfere with word segmentation, the question arises as to whether clarity is a prerequisite for successful detection of co-occurrence, or if listeners can rely on other cues in the input. In natural speech, TPs are just one cue to structure, and other cues are readily available (Johnson & Jusczyk, 2001). For example, people often use isolated words, which affect the prosody and time course of incoming information by providing silent pauses and reducing the rate of incoming speech. Inserting pauses in continuous speech isolates certain sequences in time, providing an overt word boundary that could potentially prevent difficulty with segmentation. Previous studies have shown that isolated words are a common feature of child-directed speech and facilitate language learning (Aslin, Woodward, LaMendola, & Bever, 1996; Brent & Siskind, 2001; Church, Bernhardt, Shi, & Pichora-Fuller, 2005; Jusczyk, 1999; Jusczyk & Aslin, 1995; Lew-Williams et al., 2011). A study by Brent and Siskind (2001) showed that 9% of mothers’ child-direct utterances contained isolated words, and that the frequency of hearing a word in isolation was a significant and unique predictor of later knowledge of that word. While isolated words are not required for speech segmentation, they may be able to serve as a temporal and/or prosodic cue that enhances a listener’s ability to track sequential statistics across time (Bortfeld, Morgan, Golinkoff, & Rathbun, 2005; Cunillera, Càmara, Laine, & Rodríguez-Fornells, 2010b; Cunillera, Laine, & Rodríguez-Fornells, 2016; Lew-Williams et al., 2011). We predicted that isolated words would highlight sequences in an otherwise continuous stream of spectrally degraded speech and, thereby, support segmentation.

Another temporal cue that may benefit listeners in degraded listening conditions is speech rate, or the number of morphemes or words produced per unit of time. Recent work by Palmer and Mattys (2016) demonstrated that slower syllable rates improved performance on a segmentation task in adults, even when they controlled for the total duration of the speech stream. They showed that adults who were familiarised to an artificial language at a slow rate (i.e. 2.27 syllables per second) correctly segmented a greater proportion of sequences than adults who were familiarised to the language at a normal or fast rate (4.17 or 7.45 syllables per second, respectively). In a follow-up experiment, adults performed either a phonological or visual two-back task while being familiarised to the artificial language. They found that the inclusion of either task eliminated the benefit of the slower speech rate, suggesting that increased cognitive load impaired segmentation. Given these findings, we predicted that reducing the speech rate would support listeners’ abilities to represent individual units in degraded listening conditions and, in turn, support the tracking of units over time.

The present study was designed to test the hypotheses that successful word segmentation is contingent on full access to acoustical speech cues, and that temporal cues – such as isolated words and reduced speech rate – aid learning from spectrally degraded speech. In Experiment 1, normal-hearing adults participated in word segmentation and syllable recognition tasks using speech that was either unprocessed or spectrally degraded by a 16-channel (16-ch) or 8-channel (8-ch) noise-band vocoder. Adults listened to an artificial language consisting of four trisyllabic nonsense words (Lew-Williams & Saffran, 2012), and were then asked to distinguish between previously heard trisyllabic sequences vs. trisyllabic sequences that never occurred in the speech stream. We predicted that (1) successful word segmentation from the artificial speech stream will be dependent on spectral fidelity; (2) successful segmentation will rely on accurate recognition of individual speech units; and (3) the addition of temporal cues to the speech stream will facilitate segmentation. Together, these experiments provide insight into the effects of degraded speech on the detection of patterns, the cues that support word segmentation, and the scalability of word segmentation tasks to a previously untested dimension of natural learning conditions.

Experiment 1

Methods

Participants

Participants were 60 native English-speaking adults (mean = 22.0 years, range = 18–33 years). All participants reported normal hearing and no significant medical or otologic history. Participants completed an informed consent process prior to participation and were compensated for their time. Five additional participants were tested but excluded from analyses due to equipment malfunction (2), the presence of tinnitus (1), previous exposure to the stimuli (1), and performing more than 3 standard deviations below the mean on the word segmentation task (1). All procedures were approved by the Institutional Review Board of Northwestern University.

Stimuli

A native English-speaking female recorded 24 CV syllables. Twelve syllables (/bi/, /bu/, /da/, /do/, /go/, /ku/, /la/, /pa/, /pi/, /ro/, /ti/, /tu/) were concatenated to generate a pause-free, monotone, artificial speech stream. The same speech stimuli were used in Lew-Williams and Saffran (2012). To maintain natural coarticulation, each syllable was recorded in the middle of a three-syllable sequence, in every possible coarticulation context. Middle syllables were spliced using Praat (Boersma & Weenink, 2009) to generate four trisyllabic nonsense words (Table 1) that were repeated in quasi-random order to create a continuous speech stream with consistent rate (3.1 syllables/second) and pitch (F0 = 196 Hz). Successive syllables within the four trisyllabic words had TPs of 1.0. Two of the words were high-frequency words, appearing twice as often as two low-frequency words (70 vs. 35, respectively). The duration of the concatenated speech stream consisted of 210 words, was 3 minutes 15 seconds, and contained no acoustic cues to word boundaries. The test stimuli were the two low-frequency words from the familiarisation phase (TP = 1.0); two frequency-matched part-words, consisting of the last syllable of one high-frequency word and the first two syllables of the other high-frequency word (TP = 0.5); and two trisyllabic non-words, consisting of syllables that were used in the familiarisation language but never co-occurred (tirodo, robaku, lagupi, dolati; TP = 0). There were two counterbalanced artificial languages, such that each test item was a word for half of participants and a part-word for the other half (Table 1). The remaining 12 syllables (/bo/, /du/, /ga/, /gu/, /ka/, /ki/, /li/, /lo/, /po/, /ri/, /ru/, /ta/) were created to ensure that each consonant and vowel occurred an equal number of times during the syllable recognition task, which contained all 24 syllables.

Table 1.

Word segmentation task.

Familiarisation		Test
Language 1	Language 2	Language 1	Language 2
pabiku^a	tudaro^a	pabiku^c	pabiku^d
tibudo^a	pigola^a	tibudo^c	tibudo^d
golatu^b	bikuti^b	tudaro^d	tudaro^c
daropi^b	budopa^b	pigola^d	pigola^c

Open in a new tab

Note: Pronunciation: /a/=“ah”; /i/=“ee”; /o/=“oh”; /u/=“oo”

Low-frequency.

High-frequency.

Word.

Part-word.

Noise-band vocoding

Spectral degradation of speech stimuli was accomplished by noise-band vocoding using TigerCIS software (publicly available). Noise-band vocoding provides a way to systematically vary the amount of spectral information by indicating the number of independent frequency channels while maintaining the slow-changing temporal and amplitude features of the speech waveform. To create noise-band vocoded stimuli for the present study, the auditory stimuli were pre-filtered to include frequencies between 200 and 7000 Hz and then subsequently divided into 16 or 8 independent frequency channels using the Greenwood function, which approximates the frequency distribution of the basilar membrane of the inner ear (Greenwood, 1990). The selection of 16-ch and 8-ch conditions was based on both prior work as well as extensive pilot testing. Prior work suggests that normal-hearing adults encode speech with as little as four spectral channels (Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995), but these studies often require extended training and previously known speech. Pilot testing confirmed this: listeners were unable to segment sequences from an artificial language with spectral fidelity of four channels. Better performance was observed with eight channels, which also approximates the average spectral resolution of most cochlear implant users. A 16-ch condition was included to provide more spectral fidelity, short of the fine structure that improves speech perception. Within each frequency channel (16 or 8), the temporal envelope was extracted using half-wave rectification and low-pass filtering at 400 Hz (24 dB/octave slope), which is the process by which the temporal fine structure is stripped from the signal. The extracted envelope from each channel was then multiplied by bandpass noise with the same frequency bandwidth as the frequency channel. Finally, channel-specific output was summed and converted to a digital signal.

Procedure

Participants were randomly assigned to one of three listening conditions: unprocessed speech, 16-ch noise-band vocoded speech, or 8-ch noise-band vocoded speech. In each listening condition, participants performed a word segmentation task and a syllable recognition task. Participants were seated at a computer, in front of Genelec 8030A loudspeakers.

Word segmentation

Before the task, participants were told to listen to a “set of sounds” presented through the loudspeakers. During familiarisation to the artificial language, speech was presented at 65 dB SPL, and the text, “Listen to the sound clip”, was presented on the computer monitor. The artificial speech stream was repeated twice for a total of 6 minutes 30 seconds. At test, participants heard a pair of isolated trisyllabic words, presented sequentially with 500 milliseconds of silence between each word, and had 3 seconds to press a button indicating the word that they perceived to be from the artificial language. In a two-alternative forced-choice task, participants were tested on words vs. non-words (12 trials) and part-words vs. non-words (12 trials), for a total of 24 trials presented in two blocks of 12 trials, with trial order counterbalanced. In several previous investigations of word segmentation using artificial languages, participants were tested directly on words vs. part-words. However, pilot testing in both the unprocessed and degraded listening conditions revealed that participants were consistently at chance performance when these two word types were directly compared. Thus, in our forced-choice test phase, words were tested against non-words, and part-words were tested against to non-words, thereby determining whether or not participants showed success in discriminating sequences that had vs. had not been heard previously. Trials were quasi-randomly presented, with a 1-min break between each block of 12 test trials. This design is consistent with published methods that revealed successful word segmentation from clear speech in normal-hearing adults (Saffran, Newport, & Aslin, 1996).

Syllable recognition task

The computer screen displayed a custom graphical user interface with an 8 × 3 grid of pushbuttons designed in Matlab. Each pushbutton was assigned a label corresponding to one of the 24 syllables (see Stimuli). On each trial, participants heard a syllable and were asked to select the pushbutton that corresponded to the syllable. Presentation order was randomised without replacement. Each syllable was presented twice, resulting in 48 trials.

Statistical analysis

Accuracy on each trial in the word segmentation and syllable recognition tasks was binary –correct and incorrect responses were coded as 1 and 0, respectively – and is reported as percent correct in the Results. Logistic mixed effects modelling using the lme4 package in R (R Core Team, 2012) was used to statistically evaluate accuracy, the dependent variable, in each task. For the word segmentation task, the fixed effects included condition (categorical variable: unprocessed, 16-ch, 8-ch) and word type (categorical variable: part-words, words). The random effects structure was designed to account for variability associated with the participants across each condition and test items. Specifically, we included intercepts of participants and test items as well as slopes of condition and word type for test items (N = 8; 4 test sequences × 2 non-word competitors). For the syllable recognition task, condition was the only fixed effect and the random effects structure included intercepts of participants and test items as well as the slopes of condition for test items (N = 24). In each model, condition was coded to test successive differences using the contr.sdif function: the first contrast compared the difference between the unprocessed and 16-ch condition; the second contrast compared the difference between the 16-ch and 8-ch conditions.

Results and discussion

Participants were tested on their word segmentation ability by selecting familiar vs. novel trisyllabic words. On average, participants who were exposed to the speech stream with full spectral representation (unprocessed condition) distinguished familiar from novel words at accuracies that were significantly above chance performance (unprocessed: 71.4% ± 4.9%, mean ± SE; chance = 50%, t[19] = 4.4, p < .001; Figure 1(A)). Mean accuracy was not statistically different from chance performance in the 16-ch condition (59% ± 5%; t[19] = 1.8, p = .09) or 8-ch condition (55.2% ± 3.7%; t [19] = 1.4, p = .18). The results of logistic mixed effects modelling showed that participants in the unprocessed condition segmented words statistically better than participants in the 16-ch condition (β = −0.68, z = −2.05, p < .05). Participants in the 16-ch and 8-ch conditions, respectively, did not differ in their word segmentation ability (β = −0.18, z = −0.54, p = .59). Across the conditions, participants segmented the two word types, words and part-words, equivalently (β = 0.10, z = 0.68, p = .49), suggesting that higher TPs within the trisyllabic sequences (i.e. TP = 1 vs. TP = 0.5) did not confer an additional benefit for segmentation (Figure 2(A)).

Mean (±SE) accuracy of word segmentation. Accuracy is defined as the percent of correctly selected syllable sequences from the artificial language on a two-alternative-forced-choice (2-AFC) task. 16-channel: 16-channel noise-band vocoded; 8-channel: 8-channel noise-band vocoded. The dotted line represents chance performance.

Mean (±SE) accuracy for segmenting words (*dark grey*) and part-words (*light grey*). Accuracy is defined as the percent of correctly selected syllable sequences from the artificial language on a two-alternative-forced choice (2-AFC) task. The dotted line represents chance performance.

Consistent with prior work (e.g. Shannon et al., 1995), syllable recognition varied systematically with spectral fidelity (Figure 3(A)). The results of the logistic mixed effects model showed that participants in the unprocessed condition identified significantly more syllables than participants in the 16-ch condition (β = −1.94, z = −3.1, p = .002). Additionally, participants in the 16-ch condition identified a significantly greater number of syllables than participants in the 8-ch condition (β = −2.59, z = −6.18, p < .001).

Mean (±SE) accuracy of syllable recognition. Accuracy is defined as the percent of correctly selected syllables on a 24-AFC task.

Central to our investigation is whether syllable recognition supports word segmentation. To statistically test the relation between participants’ syllable recognition and word segmentation, pairwise comparisons were implemented. Results showed that participants’ ability to recognise individual syllables did not statistically relate to their ability to segment words from the artificial language (unprocessed: Pearson’s R = −.124, p = .60; 16-ch: Pearson’s R = .12, p = .62; Pearson’s R = −.16, p = .5). In the unprocessed condition, participants were at ceiling performance on the syllable recognition task, and all but 3 (17/20) participants segmented words from the artificial language at percentages that were statistically greater than chance performance. Participants in the 8-ch condition showed poor accuracy both in syllable recognition and word segmentation. In contrast, performance on the two tasks diverged in the 16-ch condition: although the majority of participants recognised syllables (e.g. 16/20 participants had accuracies of >87.5%), they were unsuccessful at segmenting words from the speech stream. These results highlight the fact that although individual syllable recognition is necessary for word segmentation (as evidenced in the 8-ch condition), it is not sufficient. Additionally, the results raise the possibility that word segmentation under degraded conditions may not be related to recognition of individual syllables, but rather to listeners’ inability to track the statistics of the artificial language due to the cognitive load imposed by listening through degraded auditory input (e.g. Mattys et al., 2012; Rönnberg et al., 2013; Wild et al., 2012).

The objective of Experiment 1 was to determine if word segmentation is dependent on spectral fidelity. The results from the noise-band vocoded conditions suggest that spectral degradation disrupts adults’ abilities to segment words from contiguous speech. This result provides the first evidence that the ability to track syllable sequences is impaired in degraded spectral conditions, even in the presence of intact syllable recognition. The dissociation between “low-level” syllable recognition and “higher-level” word segmentation in the 16-ch condition suggests that the locus of difficulty when listening to degraded speech is tracking units over time, not in recognising individual syllables. The results also raise the possibility that the cognitive processes involved in resolving degraded speech overlap with those involved in segmenting recurring syllable sequences. This finding is relevant to young children with hearing loss, whose performance on clinical tests of speech perception may underestimate their ability to track novel syllable sequences in natural discourse. Ultimately, this may be a source of individual variability in spoken language outcomes in this population because individual differences in word segmentation contribute to variability in processing of higher-order structures in language (Misyak & Christiansen, 2012).

There are two primary ways of interpreting the findings from Experiment 1. One possibility is that adults are unable to segment words from degraded speech. This is unlikely, as adults have been shown to adapt to vocoded speech over time and demonstrate successful learning (Hervais-Adelman, Davis, Johnsrude, Taylor, & Carlyon, 2011). Alternatively, adults may find it taxing to segment words from degraded speech, but be able to rely on other features of natural speech in the presence of degraded input to support the detection of structure in language. One such feature is the presence of isolated words, which provide a salient cue to word boundaries in natural speech. In Experiment 2, we inserted silent pauses before and after a subset of tokens of low-frequency words in the speech stream, thus providing a prosodic/temporal cue that could facilitate successful segmentation.