Top-Down Influences of Written Text on Perceived Clarity of Degraded Speech

Ediz Sohoglu; Jonathan E Peelle; Robert P Carlyon; Matthew H Davis

doi:10.1037/a0033206

. 2013 Jun 10;40(1):186–199. doi: 10.1037/a0033206

Top-Down Influences of Written Text on Perceived Clarity of Degraded Speech

Ediz Sohoglu ^1,^*, Jonathan E Peelle ², Robert P Carlyon ³, Matthew H Davis ^3,^*

Editor: James T Enns

PMCID: PMC3906796 PMID: 23750966

Abstract

An unresolved question is how the reported clarity of degraded speech is enhanced when listeners have prior knowledge of speech content. One account of this phenomenon proposes top-down modulation of early acoustic processing by higher-level linguistic knowledge. Alternative, strictly bottom-up accounts argue that acoustic information and higher-level knowledge are combined at a late decision stage without modulating early acoustic processing. Here we tested top-down and bottom-up accounts using written text to manipulate listeners’ knowledge of speech content. The effect of written text on the reported clarity of noise-vocoded speech was most pronounced when text was presented before (rather than after) speech (Experiment 1). Fine-grained manipulation of the onset asynchrony between text and speech revealed that this effect declined when text was presented more than 120 ms after speech onset (Experiment 2). Finally, the influence of written text was found to arise from phonological (rather than lexical) correspondence between text and speech (Experiment 3). These results suggest that prior knowledge effects are time-limited by the duration of auditory echoic memory for degraded speech, consistent with top-down modulation of early acoustic processing by linguistic knowledge.

Keywords: prior knowledge, predictive coding, top-down, vocoded speech, echoic memory

An enduring puzzle is how we understand speech despite sensory information that is often ambiguous or degraded. Whether listening to a speaker with a foreign accent or in a noisy room, we recognize spoken language with accuracy that outperforms existing computer recognition systems. One explanation for this considerable feat is that listeners are highly adept at exploiting prior knowledge of the environment to aid speech perception.

Prior knowledge from a variety of sources facilitates speech perception in everyday listening. Previous studies have shown that lip movements, which typically precede arriving speech signals by ∼150 ms (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009), improve speech intelligibility in noise (Ma, Zhou, Ross, Foxe, & Parra, 2009; Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007; Sumby, 1954). Another strong source of prior knowledge is the linguistic context in which an utterance is spoken. Listeners are quicker to identify phonemes located at the ends of words rather than nonwords (Frauenfelder, Segui, & Dijkstra, 1990). For sentences presented in noise, word report is more accurate when the sentences are syntactically and semantically constrained (Boothroyd & Nittrouer, 1988; Kalikow, 1977; Miller & Isard, 1963).

Although the influence of prior knowledge on speech perception is widely acknowledged, there is a longstanding debate about the underlying mechanism. Much of this controversy has centered on one particular effect of prior knowledge: the influence of lexical context on phonological judgments in phonetic categorization and phoneme monitoring tasks (e.g., Frauenfelder et al., 1990; Ganong, 1980; Warren, 1970). One explanation for this phenomenon is that it reflects top-down modulation of phonological processing by higher-level lexical knowledge (McClelland & Elman, 1986; McClelland, Mirman, & Holt, 2006). There are alternative accounts, however, that do not invoke top-down processing. According to these strictly bottom-up accounts, lexical information is combined with phonological information only at a late decision stage where the phonological judgment is formed (Massaro, 1989; Norris, McQueen, & Cutler, 2000).

In the current study, we used a novel experimental paradigm to assess top-down and bottom-up accounts of prior knowledge effects on speech perception. Listeners’ prior knowledge was manipulated by presenting written text before acoustically degraded spoken words. Previous studies have shown that this produces a striking effect on the reported clarity of speech (Sohoglu, Peelle, Carlyon, & Davis, 2012; Wild, Davis, & Johnsrude, 2012; see also Mitterer & McQueen, 2009) that some authors have interpreted as arising from a decision process (Frost, Repp, & Katz, 1988). This is because written text was found to modulate signal detection bias rather than perceptual sensitivity. However, modeling work has shown that signal detection theory cannot distinguish between bottom-up and top-down accounts of speech perception (Norris, 1995; Norris et al., 2000). Hence, an effect of written text on signal detection bias could arise at a late decision stage in a bottom-up fashion or at an early sensory level in a top-down manner. In the current study, we used written text to manipulate both when higher-level knowledge becomes available to listeners and the degree of correspondence between prior knowledge and speech input. We will argue that these manipulations more accurately distinguish between top-down and bottom-up accounts.

In the experiments described below, speech was degraded using a noise-vocoding procedure, which removes its temporal and spectral fine structure while preserving low-frequency temporal information (Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995). Vocoded speech has been a popular stimulus with which to study speech perception because the amount of sensory detail (both spectral and temporal) can be carefully controlled to explore the low-level acoustic factors contributing to speech intelligibility (e.g., Deeks & Carlyon, 2004; Loizou, Dorman, & Tu, 1999; Roberts, Summers, & Bailey, 2011; Rosen, Faulkner, & Wilkinson, 1999; Whitmal, Poissant, Freyman, & Helfer, 2007; Xu, Thompson, & Pfingst, 2005). Furthermore, vocoded speech is widely believed to approximate the information available to deafened individuals who have a cochlear implant (see Shannon et al., 1995). Hence, findings from studies employing vocoded speech not only have implications for understanding the cognitive processes subserving speech perception in normal hearing individuals, but also in the hearing impaired.

When written text is presented before vocoded speech, listeners report that the amount of acoustic degradation is reduced (Wild et al., 2012; see also Goldinger, Kleider, & Shelley, 1999; Jacoby, Allan, Collins, & Larwill, 1988). This suggests that written text modifies listeners’ judgments about the low-level acoustic characteristics of speech. In analogy to the top-down and bottom-up explanations of lexical effects on phonological judgments, there are two mechanisms that could enable prior knowledge from written text to modulate listeners’ judgments about the perceived clarity of vocoded speech. One possibility is that abstract (lexical or phonological) knowledge obtained from text has the effect of modulating early acoustic processing, giving rise to enhanced perceptual clarity (top-down account, see Figure 1a). Alternatively, information from written and spoken sources could be combined at a late decision stage, where the clarity judgment is formed, without modulating early acoustic processing (bottom-up account, see Figure 1b). We now describe the three experiments we have conducted to test these competing accounts of written text effects.

Two competing accounts of how prior lexical or phonological knowledge from written text influences decisions about perceived (acoustic) clarity of vocoded speech. Gray colored boxes indicate the representations that are that potentially modified by written text in each account. A) Top-down account: prior knowledge influences early acoustic processing prior to the decision stage (as in McClelland & Elman, 1986). B) Bottom-up account: prior knowledge and acoustic information are combined at a late decision stage without modulating early acoustic processing (as in Norris et al., 2000).

Experiment 1

Experiment 1 introduces the paradigm that we used to assess the impact of prior knowledge from written text on the perception of vocoded speech. Listeners were presented with vocoded spoken words that varied in the amount of sensory detail, and were asked to rate the perceived clarity of speech. Listeners’ prior knowledge of speech content was manipulated by presenting matching, mismatching, or neutral text before each spoken word. We first characterized the effect of manipulating prior knowledge by asking whether the rated clarity of vocoded speech can be modified not only by the amount of sensory detail conveyed by the vocoder, but also by written text. We assessed both positive and negative effects of prior knowledge on the rated clarity of speech by comparing clarity ratings obtained from matching and mismatching contexts with those from the neutral condition.

One situation that can potentially distinguish between top-down and bottom-up accounts is when knowledge of speech content comes not before but after speech has been heard. This is because there is good evidence that memory for low-level acoustic information (auditory echoic memory) has a limited duration. Although estimates of the duration of echoic memory vary depending on the paradigm used (e.g., Crowder & Morton, 1969; Massaro, 1970, 1974; Sams, Hari, Rif & Knuutila, 1993), a consensus has emerged on an early auditory store that preserves unanalyzed acoustic information for around 200–300 ms (see Massaro, 1972; Cowan, 1984; Loveless, Levänen, Jousmäki, Sams, & Hari, 1996).¹ In contrast, nonsensory information in working memory is widely believed to have a much longer duration (several seconds or longer). It therefore follows that if effects of prior knowledge are attributable to a top-down component that modulates acoustic processing, written text will be less effective in influencing speech perception when presented after the 200−300 ms duration of auditory echoic memory. On the other hand, if effects of prior knowledge arise from a strictly bottom-up mechanism, the influence of written text should be apparent even when presented several seconds after speech. This is because the critical stage of processing in such an account is at the decision stage (where speech information and higher-level knowledge converges). Here, information has been abstracted from the sensory input and hence can be maintained without decay in working memory over several seconds.

To test these two accounts, the critical manipulation in Experiment 1 involved varying the timing of written text and speech so that on some trials text was presented 800 ms before speech onset (the before condition) and on other trials text was presented 800 ms after speech onset (the after condition). We used monosyllabic spoken words (lasting around 600 ms) in order for acoustic representations of speech to decay by the time of text presentation in the after condition. As described above, this should reduce the influence of text only if attributable to a top-down component.

Materials and Methods

Participants

Nineteen participants were tested after being informed of the study’s procedure, which was approved by the Cambridge Psychology Research Ethics Committee. All were native speakers of English, aged between 18 and 40 years and reported no history of hearing impairment or neurological disease.

Stimuli and procedure

A total of 360 monosyllabic words were presented in spoken or written format. The spoken words were 16-bit, 44.1 kHz recordings of a male speaker of southern British English and their duration ranged from 372 to 903 ms (M = 600, SD = 83).

Written text was presented 800 ms before or after the onset of each spoken word (see Figure 2). Written text contained a word that was the same (matching) or different (mismatching) to the spoken word, or a string of x characters (neutral). Written words for the mismatching condition were obtained by permuting the word list for their spoken form. As a result, each written word in the mismatching condition was also presented as a spoken word and vice versa. Mean string length was equated across conditions. Written text was composed of black lowercase characters presented for 200 ms on a gray background.

Stimulus characteristics in Experiment 1. A) Example written-spoken word pairs used for matching, mismatching, and neutral conditions. B) Order of events in the before condition. C) Order of events in the after condition.

The amount of sensory detail in speech was varied using a noise-vocoding procedure (Shannon et al., 1995), which superimposes the temporal envelope from separate frequency regions in the speech signal onto corresponding frequency regions of white noise. This allows parametric variation of spectral detail, with increasing numbers of channels associated with increasing perceptual clarity. Vocoding was performed using a custom Matlab (MathWorks Inc.) script, using 1, 2, 4, 8, or 16 spectral channels logarithmically spaced between 70 and 5,000 Hz. Envelope signals in each channel were extracted using half-wave rectification and smoothing with a second-order low-pass filter with a cut-off frequency of 30 Hz. The overall RMS amplitude was adjusted to be the same across all audio files. Pilot data showed that mean word identification accuracy (across participants) for speech with 2, 4 and 8 channels of sensory detail is 3.41% (SD = 1.93), 17.05% (SD = 1.98) and 68.18% (SD = 2.77), respectively. Identification accuracy for 1 channel and 16 channel speech was not tested because it is known from previous studies that for open-set assessment of word recognition, speech with these amounts of sensory detail are entirely unintelligible and perfectly intelligible, respectively (e.g., Obleser, Eisner, & Kotz, 2008; Sheldon, Pichora-Fuller, & Schneider, 2008a).

Manipulations of written text timing (before/after), congruency (matching/mismatching/neutral) and speech sensory detail (1/2/4/8/16 channels) were fully crossed, resulting in a 2 × 3 × 5 factorial design with 12 trials in each condition. Trials were randomly ordered during each of two presentation blocks of 180 trials. The words assigned to each sensory detail and congruency condition were randomized over participants. Given that words were randomly assigned to each participant, we only report the outcome of standard analyses by participants because analyses by items are unnecessary with randomized or counterbalanced designs (Raaijmakers, Schrijnemakers, & Gremmen, 1999).

Stimulus delivery was controlled with E-Prime 2.0 software (Psychology Software Tools, Inc.). Participants were instructed to rate the clarity of each spoken word on a scale from 1 (Not clear) to 8 (Very clear). To prompt participants to respond, a response cue consisting of a visual display of the rating scale was presented 1,200 ms after the onset of the spoken word (see Figure 2). Participants used a keyboard to record their response and had no time limit to do so. Subsequent trials began 1,000 ms after participants entered their responses. Prior to the experiment, participants completed a practice session of 30 trials containing all conditions but using a different set of words to those used in the main experiment.

Results

Ratings of perceived clarity are shown in Figure 3. As expected, a repeated measures ANOVA revealed that increasing sensory detail significantly enhanced clarity ratings (F_(4,72) = 277, MS = 494, η_p² = .939, p < .001). The congruency of written text also enhanced clarity ratings (F_(2,36) = 43.8, MS = 23.2, η_p² = .709, p < .001). Critically, there was a significant interaction between the congruency and timing of written text (F_(8,144) = 6.78, MS = 1.37, η_p² = .274, p < .001), indicating that the effect of written text on clarity ratings was most apparent when written text appeared before speech onset.

Rated speech clarity in Experiment 1. The provision of increasing sensory detail and prior knowledge from matching text both led to an enhancement in rated speech clarity. The effect of matching text was most pronounced when text appeared before speech onset. Error bars represent SEM across participants corrected for between-participants variability (see Loftus & Masson, 1994).

To further characterize the influence of written text on clarity ratings, we performed planned contrasts testing for positive effects of matching text and negative effects of mismatching text on clarity relative to neutral text (Δ clarity). As can be seen in Figure 4, matching text significantly enhanced clarity ratings compared with neutral text (F_(1,18) = 52.9, MS = 26.3, η_p² = .746, p < .001). There was also a significant interaction between written text congruency (matching/neutral) and the amount of speech sensory detail (F_(4,72) = 5.89, MS = .859, η_p² = .246, p < .001). We determined the nature of this interaction by conducting a trend analysis on the difference between matching and neutral ratings in the before condition only (i.e., when the effect of written text was most apparent). There was a significant quadratic trend (F_(1,18) = 14.7, MS = 4.18, η_p² = .450, p < .01), indicating that the influence of matching text on clarity ratings was most pronounced for speech with an intermediate amount of sensory detail.

Rated speech clarity in Experiment 1 relative to the neutral condition (Δ clarity). Whereas matching text enhanced speech clarity, mismatching text reduced clarity. Light horizontal gray lines represent no difference in clarity from neutral condition (i.e., zero Δ clarity). Error bars represent SEM across participants corrected for between-participants variability (see Loftus & Masson, 1994). M = Matching; MM = MisMatching; N = Neutral.

Whereas matching text enhanced clarity ratings, mismatching text significantly reduced clarity ratings relative to neutral text (F_(1,18) = 7.89, MS = 1.76, η_p² = .305, p < .05). This reduction effect was also dependent on the amount of speech sensory detail as there was a significant interaction between written text congruency (mismatching/neutral) and the amount of speech sensory detail (F_(4,72) = 4.06, MS = .843, η_p² = .184, p < .01). As with our previous analysis for matching text, we conducted a trend analysis on the difference between mismatching and neutral ratings in the before condition to examine how the influence of mismatching text varied with sensory detail. In contrast to matching text, there was a significant linear (and not quadratic) trend (F_(1,18) = 7.90, MS = 4.78, η_p² = .305, p < .05), suggesting that the reduction of clarity ratings in response to mismatching text varied in a monotonically increasing manner for speech with a greater amount of sensory detail.

A final analysis examined whether the influence of written text on clarity ratings was apparent for the extreme cases of 1 channel speech (unintelligible without support from written text) and 16 channel speech (highly intelligible). As before, we restricted this analysis to the before condition data. Clarity ratings were significantly greater in the matching relative to mismatching conditions for both 1 channel (one-tailed t₍₁₈₎ = 2.82, η² = .306, p < .01) and 16 channel speech (one-tailed t₍₁₈₎ = 5.74, η² = .647, p < .001). This suggests that matching text can enhance ratings of speech clarity over a wide range of conditions (i.e., for unintelligible as well as intelligible speech) even though the extent of this enhancement may differ depending on the amount of sensory detail present (as shown by the congruency by sensory detail interaction). The pattern was different for mismatching text; although there was a significant reduction in clarity ratings in the mismatching relative to neutral conditions for 16 channel speech (one-tailed t₍₁₈₎ = −2.04, η² = .188, p < .05), there was no significant reduction for 1 channel speech (one-tailed t₍₁₈₎ = .938, p = .18). This pattern was confirmed by a significant interaction between congruency (mismatching/neutral) and sensory detail (1/16 channels) (F_(1,18) = 4.59, MS = 1.13, η_p² = .203, p < .05). One explanation for the absence of an effect of mismatching text for 1 channel speech is that clarity ratings were at floor for this amount of sensory detail and therefore could not be reduced further by mismatching text.

Discussion

The results from Experiment 1 demonstrate that prior knowledge of speech content from written text has a measurable effect on the rated clarity of vocoded speech, which replicates previous findings from studies that used a similar paradigm to the one employed here (Goldinger et al., 1999; Jacoby et al., 1988; Sohoglu et al., 2012; Wild et al., 2012).

Our results also suggest that prior knowledge can have both facilitatory and inhibitory effects on speech clarity ratings. Relative to the neutral condition in which prior knowledge of speech content was absent, matching text enhanced clarity ratings, whereas mismatching text reduced ratings. Although the magnitude of these effects varied with the amount of speech sensory detail, a striking finding is that facilitatory effects of written text occurred across the entire range of sensory detail tested (i.e., from entirely unintelligible 1 channel speech to completely intelligible 16 channel speech). This finding is consistent with the study of Frost et al. (1988), who reported that listeners are able to detect correspondence between text and speech even when speech is presented as signal correlated noise (containing only the temporal envelope of speech, i.e., similar to the 1 channel condition here). Such findings indicate that prior knowledge can influence perception with only a minimal amount of sensory information. Nonetheless, for prior knowledge to have adaptive value, its influence must be restricted to auditory signals that contain some speech information to minimize the occurrence of misperceptions. Indeed, Frost et al. also demonstrated that the influence of written text does not extend to white noise that completely lacks speech envelope information.

Finally, the most revealing result for existing accounts of speech perception is our observation that the effects of written text on the rated clarity of vocoded speech were less pronounced when written text was presented 800 ms after speech onset compared to when it was presented 800 ms before speech onset. This finding is readily predicted by a top-down account whereby abstract (lexical or phonological) knowledge obtained from written text modifies lower-level acoustic representations of speech. An important prediction of the model is that for prior knowledge to be effective in modifying speech perception, acoustic representations of speech must persist long enough to permit direct interaction with lexical or phonological representations from written text. Because the majority of the spoken words in Experiment 1 had a duration of ∼600 ms and because previous findings indicate that auditory echoic memory has a limited duration of around 200–300 ms (Cowan, 1984; Loveless et al., 1996; Massaro, 1970, 1974), acoustic representations of speech would have mostly decayed in the condition when text was presented 800 ms after speech onset. As a result, written text would have been less effective in modifying speech clarity. In contrast, it is less obvious how a purely bottom-up account would explain this finding. In such an account, prior knowledge and sensory information are combined at a late decision stage where information has been abstracted from auditory and visual inputs and where information is easily maintained over the period that was required here.

Experiment 2

In Experiment 2, we sought further evidence that written text influences the rated clarity of vocoded words by modifying auditory echoic traces of speech. As noted above, previous work has estimated the duration of auditory echoic memory to be around 200–300 ms. We therefore manipulated the timing of written text in a finer-grained manner in order to determine whether the duration of auditory echoic memory is precisely reflected in the timecourse of prior knowledge effects. Speech was presented with matching or mismatching text only as this comparison yielded the largest effect of prior knowledge in Experiment 1. The stimulus onset asynchrony (SOA) between written text and speech was varied gradually from −1,600 ms (text before speech onset) to +1,600 ms (text after speech onset) to sample the underlying timecourse of clarity enhancement by prior knowledge.

If the duration of echoic memory is reflected in the timecourse relating clarity enhancement to SOA, two predictions follow. First, in conditions when text is presented before speech onset (negative SOAs), the influence of written text should be maximal and not vary with SOA. This is because in these conditions, abstract lexical or phonological representations from text will be able to modulate acoustic input immediately upon speech arrival and therefore without being constrained by echoic memory decay. Second, in conditions when text is presented after speech onset (positive SOAs), the influence of written text should start to decay only for SOAs longer than 200–300 ms (after echoic memory decay). Note the assumption here is that echoic memory stores acoustic information corresponding to sublexical portions of speech. This is necessarily the case as the 200–300 ms duration of echoic memory is shorter than the typical ∼600 ms duration of monosyllabic words employed in the current study.