Speech categorization in context: Joint effects of nonspeech and speech precursors

Lori L Holt

doi:10.1121/1.2195119

. Author manuscript; available in PMC: 2006 Nov 2.

Published in final edited form as: J Acoust Soc Am. 2006 Jun;119(6):4016–4026. doi: 10.1121/1.2195119

Speech categorization in context: Joint effects of nonspeech and speech precursors

Lori L Holt ^1,^a)

PMCID: PMC1633715 NIHMSID: NIHMS13205 PMID: 16838544

Abstract

The extent to which context influences speech categorization can inform theories of pre-lexical speech perception. Across three conditions, listeners categorized speech targets preceded by speech context syllables. These syllables were presented as the sole context or paired with nonspeech tone contexts previously shown to affect speech categorization. Listeners’ context-dependent categorization across these conditions provides evidence that speech and nonspeech context stimuli jointly influence speech processing. Specifically, when the spectral characteristics of speech and nonspeech context stimuli are mismatched such that they are expected to produce opposing effects on speech categorization the influence of nonspeech contexts may undermine, or even reverse, the expected effect of adjacent speech context. Likewise, when spectrally matched, the cross-class contexts may collaborate to increase effects of context. Similar effects are observed even when natural speech syllables, matched in source to the speech categorization targets, serve as the speech contexts. Results are well-predicted by spectral characteristics of the context stimuli.

I. INTRODUCTION

Context plays a critical role in speech categorization. Acoustically identical speech stimuli may be perceived as members of different phonetic categories as a function of the surrounding acoustic context. Mann (1980), for example, has shown that listeners’ categorization of a series of speech stimuli ranging perceptually from /ga/ to /da/ is shifted toward more “ga” responses when these target syllables are preceded by /al/. The same stimuli are more often categorized as “da” when /ar/ precedes them. Such context-dependent phonetic categorization is a consistent finding in speech perception (e.g., Lindblom and Studdert-Kennedy, 1967; Mann and Repp, 1981; see Repp, 1982 for review).

Consideration of how to account for context-dependent speech perception highlights larger theoretical issues of how best to characterize the basic representational currency and processing characteristics of speech perception. Relevant to this interest, an avian species (Japanese quail, Coturnix coturnix japonica) has been shown to exhibit context-dependent responses to speech (Lotto et al., 1997). Birds operantly trained to peck a lighted key in response to a /ga/ stimulus peck more robustly in later tests when test syllables are preceded by /al/. Correspondingly, birds trained to peck to /da/ peck most vigorously to test stimuli when the are preceded by /ar/. Thus, birds exhibit shifts in pecking behavior contingent on preceding context analogous to context-dependent human speech categorization. The birds had no previous experience with speech, so their behavior cannot be explained on the basis of learned covariation of acoustic attributes across contexts or on the basis of existing phonetic categories. It is also unlikely that quail have access to specialized speech processes or knowledge of the human vocal tract. The parallels between quail and human behavior suggest a possible role for general auditory processing, not specific to speech or dependent upon extensive experience with the speech signal, in context-dependent speech perception.

In accord with the hypothesis that general, rather than speech-specific, processes play a role in context-dependent speech perception there is evidence that nonspeech acoustic contexts affect speech categorization by human listeners. Following the findings of Mann (1980), Lotto and Kluender (1998) synthesized two sine-wave tones, one with a higher frequency corresponding to the third formant (F3) offset frequency of /al/ and the other with a lower frequency corresponding to the /ar/ F3 offset frequency. When these non-speech stimuli preceded a /ga/ to /da/ target stimulus series like that studied by Mann (1980), speech categorization was influenced by the precursor tones. Listeners more often categorized the syllables as “ga” when they were preceded by the higher-frequency sine-wave tone modeling /al/. The same stimuli were more often categorized as “da” when the tone modeling /ar/ preceded them. Thus, nonspeech stimuli mimicking very limited spectral characteristics of speech contexts also influence speech categorization.

Nonspeech-elicited context effects on speech categorization appear to be a general phenomenon. Holt (1999; Holt and Lotto, 2002) reports that sine-wave tones or single formants situated at the second formant (F2) frequency of /i/ versus /u/ shift categorization of syllables ranging perceptually from /ba/ to /da/ in the same manner as the vowels they model. Likewise, flanking nonspeech frequency-modulated glides that follow the F2 formant trajectories of /bVb/ and /dVd/ syllables influence categorization of the intermediate vowel (Holt et al., 2000). A number of other studies demonstrate interactions of nonspeech context and speech perception (Fowler et al., 2000; Kluender et al., 2003; Watkins and Makin, 1994, 1996a, 1996b) and the effects appear to be reciprocal. Stephens and Holt (2003) report that preceding /al/ and /ar/ syllables modulate perception of following non-speech stimuli. Follow-up studies have demonstrated that listeners are unable to relate the sine-wave tone precursors typical of these studies to the phonetic categories the tones model (Lotto, 2004); context-dependent speech categorization is elicited even with nonspeech precursors that are truly perceived as nonspeech events.

There is evidence that even temporally nonadjacent non-speech precursors can influence speech categorization. Holt (2005) created “acoustic histories” composed of 21 sine-wave tones sampling a distribution defined in the acoustic frequency dimension. The acoustic histories terminated in a neutral-frequency tone that was shown to have no effect on speech categorization. In this way, the context immediately adjacent to the speech target in time was constant across conditions. The mean frequency of the acoustic histories differentiated conditions, with distribution means approximating the tone frequencies of Lotto and Kluender (1998). Despite their temporal nonadjacency with speech targets, the nonspeech acoustic histories had a significant effect on categorization of members of a following /ga/ to /da/ speech series. In line with previous findings, the higher-frequency acoustic histories resulted in more “ga” responses whereas the lower-frequency acoustic histories led to more “da” responses. These effects were observed even when as much as 1.3 s of silence or 13 repetitions of the neutral tone separated the acoustic histories and the speech targets in time.

In each of the cases for which effects of nonspeech contexts on speech categorization have been observed, the non-speech contexts model limited spectral characteristics of the speech contexts. As simple pure tones or glides, they do not possess structured information about articulatory gestures. Moreover, even the somewhat richer nature of the acoustic history tone contexts of Holt (2005) are far removed from the stimuli that may be perceived as speech in sine-wave speech studies (e.g., Remez et al. 1994). The commonality shared between the tones composing the acoustic histories and sine-wave speech is limited to the fact that both make use of sinusoids. The tonal sine-wave speech stimuli are composed of three or four concurrent time-varying sinusoids, each mimicking the center frequency and amplitude of a natural vocal resonance measured from a real utterance. Thus, the sine-wave replicas that may give rise to speech percepts possess an overall acoustic structure that much more closely mirrors the speech spectrum it models. By contrast, the single sine-waves of, for example, Lotto and Kluender (1998) or the sequences of sine waves of Holt (2005) are far more removed from the precise time-varying characteristics of speech. The tones composing the acoustic histories of Holt (2005) are single sinusoids of equal amplitude, separated in time (not continuous), and randomized on a trial-by-trial basis. The nonspeech contexts provide neither acoustic structure consistent with articulation nor acoustic information sufficient to support phonetic labeling (see Lotto, 2004). What they do share with the speech contexts they model is a very limited resemblance to the spectral information that differentiates, for example, the /al/ from /ar/ contexts that have been shown to influence speech categorization (Mann, 1980).

The directionality of the context-dependence is likewise predictable from this spectral information. Across the observations of context-dependent speech categorization for speech and nonspeech contexts, the pattern of context-dependent categorization is spectrally contrastive (Holt, 2005; Lotto et al., 1997; Lotto and Kluender, 1998); precursors with acoustic energy in higher frequency regions (whether speech or nonspeech, e.g., /al/ or nonspeech sounds modeling the spectrum of /al/) shift categorization toward the speech category characterized by lower-frequency acoustic energy (i.e., /ga/) whereas lower-frequency precursors (/ar/ or nonspeech sounds modeling /ar/) shift categorization toward the higher-frequency alternative (i.e., /da/). The auditory perceptual system appears to be operating in a manner that serves to emphasize spectral change in the acoustic signal. Contrastive mechanisms are a fundamental characteristic of perceptual processing across modalities. General mechanisms of auditory processing that produce spectral contrast may give rise to the results observed for speech and non-speech contexts in human listeners with varying levels and types of language expertise (Mann, 1986; Fowler et al., 1990) and in quail subjects (Lotto et al., 1997). Neural adaptation and inhibition are simple examples of neural mechanisms that exaggerate contrast in the auditory system (Smith, 1979; Sutter et al., 1999), but others exist at higher levels of auditory processing (see e.g., Delgutte, 1996; Ulanovsky et al., 2003; 2004) that produce contrast without a loss in sensitivity (Holt and Lotto, 2002). The observation of nonspeech context effects on speech categorization when context and target are presented to opposite ears (Holt and Lotto, 2002; Lotto et al., 2003) and findings demonstrating effects of non-adjacent nonspeech context on speech categorization (Holt, 2005) indicate that the mechanisms are not solely sensory.¹ Moreover, there is evidence that mechanisms producing spectral contrast may operate over multiple time scales (Holt, 2005; Ulanovsky et al., 2003, 2004).

By this general perceptual account, speech- and nonspeech-elicited context effects emerge from common processes that are part of general auditory processing. These mechanisms are broadly described as spectrally contrastive in that they emphasize spectral change in the acoustic signal, independent of its classification as speech or nonspeech or whether the signal carries information about speech articulation. So far, observed effects have been limited to the influence of speech or nonspeech contexts on speech categorization (or, conversely, the effects of speech contexts on nonspeech perception, Stephens and Holt, 2003). However, an account that relies upon spectral contrast makes strong directional predictions about context-dependent speech categorization in circumstances in which both speech and non-speech contexts are present. Specifically, this account predicts that when both speech and nonspeech are present as context, their effects on speech categorization will be dictated by their spectral characteristics such that they may either cooperate or conflict in their direction of influence on speech categorization as a function of how they are paired. If the speech and nonspeech contexts are matched in the distribution of spectral energy that they possess such that they are expected to shift speech categorization in the same direction, then nonspeech may collaborate with speech to produce greater effects of context than observed for speech contexts alone. Conversely, when nonspeech and speech contexts possess spectra that push speech categorization in opposing directions, nonspeech contexts should be expected to lessen the influence of speech contexts on speech categorization. As a means of empirically examining the hypotheses arising from this account, the present experiments examine speech categorization when both speech and nonspeech signals serve as acoustic context, specifically investigating the degree to which they may jointly influence speech categorization.

II. EXPERIMENT 1

The aim of this study thus is to assess the relative influence of speech and jointly presented nonspeech contexts on speech categorization. Experiment 1 examines speech categorization of a /ga/ to /da/ syllable series across three contexts: (1) preceding /al/ and /ar/ syllables; (2) the same speech syllables paired with spectrally matched nonspeech acoustic histories (as described by Holt, 2005) that shift speech categorization in the same direction (e.g., High Mean acoustic histories paired with /al/); (3) the same speech syllables paired with spectrally mismatched nonspeech acoustic histories that shift speech categorization in opposing directions (e.g., Low Mean acoustic histories paired with /al/). Whereas the speech contexts remain consistent across conditions, the nonspeech contexts vary. Thus, if speech and non-speech contexts fail to jointly influence speech categorization there will be no significant differences in speech categorization across conditions and, as in previous studies, speech targets preceded by /al/ will be more often categorized as “ga” than the same targets preceded by /ar/. If, however, the two sources of acoustic context mutually influence speech categorization as predicted by a general perceptual/cognitive account of context effects in speech perception then the observed context effects will vary across conditions and the relative influence of each context source on speech categorization can be assessed.

A. Methods

1. Participants

Ten adult monolingual English listeners recruited from the Carnegie Mellon University community participated in return for a small payment or course credit. All participants reported normal hearing.

2. Stimuli

Stimulus design is schematized in Fig. 1. For each stimulus an acoustic history composed of 21 sine-wave tones preceded a speech syllable context stimulus, a 50-ms silent interval, and a speech target drawn from a stimulus series varying perceptually from /ga/ to /da/.

a. Speech

Speech target stimuli were identical to those described previously (Holt, 2005; Wade and Holt, 2005). Natural tokens of /ga/ and /da/ spoken in isolation were digitally recorded from an adult male monolingual English speaker (CSL, Kay Elemetrics; 20-kHz sample rate, 16-bit resolution). From a number of natural productions, one /ga/ and one /da/ token were selected that were nearly identical in spectral and temporal properties except for the onset frequencies of F2 and F3. LPC analysis was performed on each of the tokens and a nine-step sequence of filters was created (Analysis-Synthesis Laboratory, Kay Elemetrics) such that the onset frequencies of F2 and F3 varied approximately linearly between /g/ and /d/ endpoints. These filters were excited by the LPC residual of the original /ga/ production to create an acoustic series spanning the natural /ga/ and /da/ end points in approximately equal steps. Each stimulus was 589 ms in duration. The series was judged by the experimenter to comprise a gradual shift between natural-sounding /ga/ and /da/ tokens and this impression was confirmed by regular shifts in phonetic categorization across the series by participants in the Holt (2005) and Wade and Holt (2005) studies. These speech series members served as categorization targets for each experimental condition. Spectrograms of odd-number series stimuli are shown in Fig. 2.

FIG. 2 — Spectrograms of the odd-numbered stimuli along the nine-step /ga/ to /da/ series that served as categorization targets in Experiments 1 and 2.

In addition, there were two speech context stimuli. These 250-ms syllables corresponded perceptually to /al/ and /ar/ and were composed of a 100-ms steady-state vowel followed by a 150-ms linear formant transition. Stimuli were synthesized using the cascade branch of the Klatt (1980) synthesizer. These stimuli were identical to those shown in earlier reports to produce spectrally contrastive context effects on perception of speech (Lotto and Kluender, 1998) and non-speech (Stephens and Holt, 2003). Lotto and Kluender (1998) provide full details of stimulus synthesis.

b. Nonspeech

Acoustic histories were created as described by Holt (2005). Each acoustic history was composed of 21 70-ms sine-wave tones (30-ms silent intervals) with unique frequencies. Distributions’ mean frequencies (1800 and 2800 Hz) were chosen based on the findings of Lotto and Kluender (1998), who demonstrated that single 1824 versus 2720 Hz tones produce a spectrally contrastive context effect on speech categorization targets varying perceptually from /ga/ to /da/. “Low Mean” acoustic histories were composed of 1300–2300 Hz tones (M = 1800 Hz, 50-Hz steps). “High Mean” acoustic histories possessed tones sampling 2300–3300 Hz (M = 2800 Hz, 50-Hz steps).

To minimize effects elicited by any particular tone ordering, acoustic histories were created by randomizing the order of the 21 tones on a trial-by-trial basis. Each trial was unique; acoustic histories within a condition were distinctive in surface acoustic characteristics, but were statistically consistent with other stimuli drawn from the distribution defining the nonspeech context. Thus, any influence of acoustic histories on speech categorization is indicative of listeners’ sensitivity to the long-term spectral distribution of the acoustic history and not merely to the simple acoustic characteristics of any particular segment (for further discussion see Holt, 2005).

Tones comprising the acoustic histories were synthesized with 16-bit resolution and sampled at 10 kHz using MATLAB (Mathworks, Inc.). Linear onset/offset amplitude ramps of 5 ms were applied to all tones. Target speech stimuli were digitally down-sampled from their recording rate of 20–10 kHz and both tones and speech tokens were digitally matched to the rms energy of the /da/ end point of the target speech series.

As discussed in Sec. I, very broad interpretation of the kind of acoustic energy that may carry articulatory information may cause concern that the High and Low mean acoustic histories could serve as information about articulatory events and perhaps lead listeners to identify the nonspeech acoustic histories phonetically. To allay this concern, 10 monolingual English participants who reported normal hearing were tested in a pilot stimulus test. These participants did not serve as listeners in any of the reported experiments and had not participated in experiments of this sort before. These listeners identified the High and Low mean acoustic histories as “al” or “ar” in the context of the following speech syllable pairs described above. If the limited spectral information that the acoustic histories model from the /al/ and /ar/ contexts serves as information about articulatory events, we should expect High mean acoustic histories to elicit more “al” responses and Low mean acoustic histories to elicit more “ar” responses. This was not the case. Listeners’ phonetic labeling of the High versus Low mean acoustic histories as “al” was not greater for the High mean acoustic histories (M_High = 51.1, SE=0.52) than Low mean acoustic histories (M_Low = 51.0, SE=1.19; t<1 in a paired-samples t-test).

c. Stimulus construction

Two sets of stimuli were constructed from these elements. To create the hybrid nonspeech/speech contexts preceding the speech targets, each of the nine /ga/ to /da/ target stimuli was appended to the /al/ and /ar/ speech contexts with a 50-ms silent interval separating the syllables. Each of the resulting 18 disyllables was appended to two nonspeech contexts, one an acoustic history defined by the High Mean distribution and the other an acoustic history with a Low Mean. This pairing of disyllables with acoustic histories was repeated 10 times, with a different acoustic history for each repetition. This resulted in 360 unique stimuli, exhaustively pairing /al/ and /ar/ speech contexts with High and Low mean nonspeech contexts and the nine target speech series stimuli across 10 repetitions. A second set of stimuli with only speech contexts preceding the speech targets also was created; /al/ and /ar/ stimuli were appended to each of the speech target series members with a 50-ms interstimulus silent interval for a total of 18 stimuli. These stimuli were presented 10 times each during the experiment.

3. Design and procedure

The pairing of speech and nonspeech contexts in stimulus creation yielded the two experimental conditions illustrated in Fig. 1. Stimuli making up the Conflicting condition possessed acoustic histories and speech context syllables that have been shown to have opposing effects on speech categorization (Holt, 2005; Lotto and Kluender, 1998; Mann, 1980). The Cooperating condition was made up of stimuli possessing speech and nonspeech precursor contexts that shift speech categorization in the same direction. Note that these pairings can also be described in terms of the spectral characteristics of the component context stimuli because spectral characteristics well-predict the directionality of context effects on speech categorization (e.g., Holt, 2005; Lotto and Kluender, 1998). For example, High Mean acoustic histories were matched with /al/ (also possessing greater high-frequency acoustic energy) in the spectrally matched Cooperating condition and with /ar/ (with greater low-frequency energy) in the spectrally mismatched Conflicting condition.

Seated in individual sound-attenuated booths, listeners categorized the speech target of each stimulus by pressing electronic buttons labeled “ga” and “da.” Listeners completed two blocks in a single session; the order of the blocks was counterbalanced. In one block, the hybrid nonspeech plus speech contexts preceded the speech targets. In this block, stimulus presentation was mixed across the Conflicting and Cooperating conditions. In the other (Speech Only) block, participants heard only /al/ or /ar/ preceding the speech targets. Thus, each listener responded to stimuli from all three conditions.

Acoustic presentation was under the control of Tucker Davis Technologies System II hardware; stimuli were converted from digital to analog, low-pass filtered at 4.8 kHz, amplified and presented diotically over linear headphones (Beyer DT-150) at approximately 70 dB SPL(A).

B. Results

Results were analyzed in terms of average percent “ga” responses across stimulus repetitions and are plotted in the top row of Fig. 3. The nonoverlapping categorization curves illustrated in each of the top panels of Fig. 3 are indicative of an influence of context for each condition (see also the marginal means plotted in Fig. 4). Critically, although the immediately preceding speech context was constant across conditions, the observed context effects were not identical. Repeated-measures analysis of variance results are described in the following. Probit boundary analysis (Finney, 1971) of participants’ category boundaries across conditions reveals the same pattern of results. The results of these analyses are provided in Table I.

FIG. 4 — Marginal means across condition and experiment.

TABLE I.

Category boundaries were estimated for each participant’s response to each condition of the experiment. The mean probit boundary across participants is presented in terms of the stimulus step across the nine-step /ga/ to /da/ categorization target series. The results parallel those of the ANOVA analyses across the speech stimulus series reported in the text.

Experiment	Condition	Precursor	Mean probit boundary	Standard error	t-test
1	Speech Only	/al/	7.0	0.21	t(9) = 3.13, p=0.01
		/ar/	6.46	0.27
	Cooperating	High+ /al/	7.16	0.23	t(9) = 3.71, p=0.005
		Low+ /ar/	5.96	0.29
	Conflicting	Low+ /al/	6.12	0.21	t(9) = 3.76, p=0.005
		High+ /ar/	6.82	0.25
2	Speech Only	/al/	6.79	0.24	t(9) = 3.59, p=0.01
		/ar/	5.97	0.36
	Cooperating	High+ /al/	7.21	0.23	t(9) = 5.94, p30.0001
		Low+ /ar/	5.98	0.24
	Conflicting	Low+ /al/	6.70	0.22	t(9) = 0.3, p=0.77
		High+ /ar/	6.64	0.19

Open in a new tab

1. Speech Only condition

The average percent “ga” responses across participants were submitted to a 2×9 (Context×Target Speech Stimulus) repeated measures ANOVA. This analysis revealed a significant effect of Context, F(1,9) = 12.12, p=0.007, $η_{p}^{2}$ =0.574. Consistent with earlier findings (Lotto and Kluender, 1998; Mann, 1980), listeners categorized speech targets preceded by /al/ as “ga” significantly more often (M = 60.44, SE=2.86, here and henceforth, means refer to “ga” responses averaged across target speech stimuli and participants)than the same targets preceded by /ar/ (M = 55.22, SE=2.57). These data confirm that, on their own, the speech context precursors have a significant effect on categorization of neighboring speech targets. Probit boundary values are presented in Table I.

2. Cooperating condition

A 2×9 (Context×Target Speech Stimulus) repeated measures ANOVA revealed that there was also a significant effect of Cooperating nonspeech/speech contexts on speech categorization, F(1,9) = 40.22, p<0.0001, $η_{p}^{2}$ =0.817. As would be expected from the influence that speech and non-speech contexts elicit independently (Lotto and Kluender, 1998; Holt, 2005), the effect observed in the Cooperating condition was spectrally contrastive; categorization was shifted in the same direction as in the Speech Only condition. When listeners heard speech targets preceded by High Mean acoustic histories paired with /al/, they more often categorized the targets as “ga” (M = 62.22, SE=2.05) than when the same targets were preceded by Low Mean acoustic histories paired with /ar/ (M = 49.11, SE=2.34).

The primary aim of this study was to examine potential joint effects of speech and nonspeech acoustic contexts in influencing speech target categorization. A 2×2×9 (Condition×Context×Target Speech Stimulus) repeated measures ANOVA of the categorization patterns of the Speech Only condition versus those of the Cooperating condition indicates that when speech and nonspeech contexts are spectrally matched such that they are expected to influence speech categorization similarly, they collaborate to produce an even greater context effect on speech target categorization (M_High+/_al_/ = 62.22 vs M_Low+/_ar_/ = 49.11) than do the speech targets on their own (M_/_al_/ = 60.44 vs M_/_ar_/ = 55.22), as indicated by a significant Context by Condition interaction, F(1,9) = 6.42, p=0.03, $η_{p}^{2}$ =0.416.

3. Conflicting Condition

A 2×9 (Context×Target Speech Stimulus) repeated measures ANOVA of responses to Conflicting condition stimuli revealed that when the spectra of speech and non-speech contexts predicted opposing effects on speech categorization, there was also a significant effect of context, F(1,9) = 25.97, p=0.001, $η_{p}^{2}$ =0.743. Note, however, the direction of this effect. Listeners more often categorized target syllables as “ga” when they were preceded the High Mean acoustic histories paired with /ar/ speech precursors (% “ga” responses: M_High+/_ar_/ = 59.89, SE=2.41 vs M_Low+/_al_/ = 49.11, SE=2.34). In this example, the /ar/ speech context independently predicts more “da” responses (Mann, 1980) whereas the High Mean nonspeech acoustic histories independently predict more “ga” responses (Holt, 2005). Listeners more often responded “ga,” following the expected influence of the nonspeech context rather than that of the speech context that immediately preceded the speech targets. These results indicate that when the spectra of nonspeech and speech contexts are put in conflict, the influence of temporally nonadjacent nonspeech context may be robust enough even to undermine the expected influence of temporally adjacent speech contexts.

Of note, a 2×2×9 (Condition×Context×Target Speech Stimulus) repeated measures ANOVA comparing the Conflicting condition to the Speech Only condition revealed no main effect of Context, F(1,9) = 2.98, p=0.119, $η_{p}^{2}$ =0.249, but a significant Condition by Context interaction, F(8,72) = 83.17, p<0.0001, $η_{p}^{2}$ =0.902. This indicates that the context effect produced by the speech contexts plus conflicting nonspeech contexts was statistically equivalent in magnitude, although opposite in direction, to that produced by the speech contexts alone.

4. Comparison of Cooperating vs Conflicting conditions

The relative contributions of speech and nonspeech contexts can be assessed with a 2×2×9 (Condition×Context ×Target Speech Stimulus) repeated measures ANOVA comparing the effects of nonspeech/speech hybrid contexts across Cooperating and Conflicting conditions. This analysis reveals an overall main effect of Context (context was coded in terms of the nonspeech segment of the precursor), F(1,9) = 37.207, p<0.0001, $η_{p}^{2}$ =0.805, such that listeners more often labeled speech targets as “ga” when nonspeech precursors were drawn from the High Mean acoustic history distribution (M = 61.06, SE=2.01) than the Low Mean distribution (M = 51.50, SE=2.09). The contribution of the speech contexts to target syllable categorization is reflected in this analysis by the significant Condition by Acoustic History interaction, F(1,9) = 9.69, p=0.01, $η_{p}^{2}$ =0.518. With /al/ precursors, targets were somewhat more likely to be categorized as “ga” (M = 58.056, SE=1.9) whereas with /ar/ precursors the same stimuli were less likely to be categorized as “ga” (M = 54.50, SE=2.13). Thus, across conditions there is evidence of the joint influence of speech and nonspeech contexts. Moreover, the directionality of the observed effects is well-predicted by the spectral characteristics of the speech and nonspeech contexts.

C. Discussion

The percept created by the experiment 1 hybrid nonspeech/speech stimuli is one of rapidly presented tones preceding a bi-syllabic speech utterance. One could easily describe these nonspeech precursors as extraneous to the task of speech categorization and, indeed, listeners were not required to make any explicit judgments about them during the perceptual task. The task in this experiment was speech perception. Yet, even in these circumstances nonspeech contexts contributed to speech categorization. Speech does not appear to have a privileged status in producing context effects on speech categorization, even when afforded the benefit of temporal adjacency with the target of categorization.

Although general perceptual/cognitive accounts of speech perception are most consistent with these effects and can account for the directionality of the observed context effects, it is nonetheless surprising even from this theoretical perspective that the effect of nonspeech contexts is so robust. The results run counter to modular accounts that would suggest that there are special-purpose mechanisms for processing speech that are informationally encapsulated and therefore impenetrable to influence by nonlinguistic information (Liberman et al., 1967; Liberman and Mattingly, 1985). The very simple sine-wave tones that comprised the nonspeech contexts are among the simplest of acoustic signals. To consider them information for speech perception by a speech-specific module would require a module so broadly tuned as to be indistinguishable from more interactive processing schemes. The results of Experiment 1 also are difficult to reconcile with a direct realist perspective on speech perception. The direct realist interpretation of the categorization patterns observed in the Speech Only condition is that the speech contexts provide information relevant to parsing the dynamics of articulation (Fowler, 1986; Fowler and Smith, 1986; Fowler et al., 2000). It is unclear from a direct realist perspective why, in the presence of clear speech contexts providing information about articulatory gestures, listeners would be influenced by nonspeech context sounds at all, let alone be more influenced by the nonspeech contexts than the speech contexts in the Conflicting condition. It does not appear that context must carry structured information about articulation to have an impact on speech processing.

III. EXPERIMENT 2

The stimuli created for Experiment 1 were constructed as a compromise among stimuli used in previous experiments investigating speech and nonspeech context effects. The /ga/ to /da/ speech target series of Holt (2005) was chosen for its naturalness in an effort to provide the most conservative estimate of context-dependence (synthesized or otherwise degraded speech signals are typically thought to be more susceptible to contextual influence). The synthetic /al/ and /ar/ contexts were taken from the stimulus materials of Lotto and Kluender (1998) because they produce a robust influence on speech categorization along a /ga/ to /da/ series (see also Stephens and Holt, 2003). Nonetheless, there are stimulus differences originating from the synthetic nature of the /al/ and /ar/ speech contexts of Experiment 1 and the more natural characteristics of the speech targets. This could lead the two sets of speech materials to be perceived as originating from different sources. If this was the case, the independence of the sources should reduce or eliminate articulatory gestural information relevant to compensating for intraspeaker effects of coarticulation (a within-speaker phenomenon) via gestural parsing. Although previous research has provided evidence of cross-speaker phonetic context effects (Lotto and Kluender, 1998), it may nonetheless be argued that Experiment 1 does not provide the most conservative test of nonspeech/speech context effects because of the possible perceived difference in speech source across syllables.

Therefore, Experiment 2 was conducted in the same manner as Experiment 1, but using natural /al/ and /ar/ productions recorded from the same speaker that produced the end point stimuli of the /ga/ to /da/ speech target stimulus series. The experiment thus serves as both a replication of the findings of Experiment 1 and an opportunity to investigate whether the influence of nonspeech context on speech categorization is robust enough to persist even when speech contexts and targets originate from the same source.