Lexical, Syntactic, and Stress-Pattern Cues for Speech Segmentation

Lisa D Sanders; Helen J Neville

doi:10.1044/jslhr.4306.1301

. Author manuscript; available in PMC: 2008 Oct 23.

Published in final edited form as: J Speech Lang Hear Res. 2000 Dec;43(6):1301–1321. doi: 10.1044/jslhr.4306.1301

Lexical, Syntactic, and Stress-Pattern Cues for Speech Segmentation

Lisa D Sanders ¹, Helen J Neville ¹

PMCID: PMC2572147 NIHMSID: NIHMS51846 PMID: 11193954

Abstract

Many sources of segmentation information are available in speech. Previous research has shown that one or another segmentation cue is used by listeners under certain circumstances. However, it has also been shown that none of the cues are absolutely reliable. Therefore, it is likely that people use a combination of segmentation cues when listening to normal speech. This study addresses the issue of how young adults use multiple segmentation cues (lexical, syntactic, and stress-pattern) in combination to break up continuous speech. Evidence that people use more than one cue at a time was found. Furthermore, the results suggest that people can use segmentation cues flexibly such that remaining cues are relied upon more heavily when other information is missing.

Speech comprehension requires breaking continuous streams of sounds into units that can be recognized. Most listeners solve the problem of dividing long streams of phonemes into linguistically meaningful units effortlessly, but it is unclear how this is done. The lack of knowledge about how listeners segment continuous speech is evident in the difficulties of creating automatic speech recognition software that parses speech as humans do (Bernstein & Franco, 1996; Brent, 1999).

Lexical, syntactic, and acoustic information are all available in speech and may be helpful in segmentation. However, all of the cues that have been shown to be possible sources of segmentation information are misleading under some circumstances. For example, word recognition itself provides segmentation information. Successful recognition of one word in a speech stream, which can sometimes be achieved even before the word has ended (Marslen-Wilson & Welsh, 1978) would allow a listener to predict both the rest of the word and the subsequent word boundary. It has been suggested, that along with contextual information, lexical recognition may play a primary role in segmentation (Quene, 1992). However, there are circumstances under which lexical recognition would fail as a segmentation cue such as when short words which cannot be recognized until after their acoustic offset or words which are embedded in other words are encountered (Frauenfelder, 1985; Luce, 1986). These conditions occur frequently enough in normal speech, that relying on lexical information alone would lead to inaccurate speech segmentation.

Little research has been conducted on the role of syntactic information in speech segmentation, but knowledge about phrase structure and parts of speech could be useful (Cole, Jakimik, & Cooper, 1980; Tyler & Wessels, 1983). For example, knowing that a speaker is using an adverb would allow a listener to consider an 'ly' ending as part of the current word rather than the beginning of the next. However, syntactic structure is often not obvious until well after the acoustic offsets of words and would not always provide useful segmentation information even when it was.

Many types of acoustic information have been shown to play a role in speech segmentation. For example, people can use phonotactic constraints to parse speech between phonemes which never occur in combination within a word, but do occur in combination across word boundaries (Brent, 1997; Brent & Cartwright, 1996). Allophonic variation, variability in the way phonemes are pronounced, can serve as a segmentation cue when it correlates with the position of phonemes in words (Church, 1987; Umeda & Coker, 1974). In languages with clear syllable boundaries such as French, listeners have been shown to segment speech between each syllable (Cutler, Mehler, Norris, & Segui, 1986; Mehler, Dommergues, Frauenfelder & Segui, 1981). Japanese speakers can use morae, a unit that sometimes but not always corresponds to a syllable, to segment speech (Cutler & Otake, 1994; Otake, Hatano, Cutler, & Mehler, 1993). From a corpus of spontaneous British speech, Cutler and Carter (1987) found that 90% of the words began with strong stress. After factoring in frequency, they determined that 75% of the strong stresses encountered in English speech are word initial. Native English speakers have been shown to take advantage of this typical stress pattern to segment speech (Cutler & Butterfield, 1992; Cutler & Norris, 1988, McQueen, Norris, & Cutler, 1994; Norris, McQueen, & Cutler, 1995). However, each of these cues is either sometimes misleading (stress pattern), only occasionally present (phonotactic and allophonic cues), or would result in the segmentation of speech into many more units than is necessary (syllable and mora).

Another possible segmentation cue is the transitional probabilities of syllables within and between words. Harris (1955) suggested that the probability of hearing any two syllables in succession is higher if the syllables both occur within a word (for example, 'ba-by') than if the syllables occur as the last sound in one word and the first sound in the next (for example '-by doll). Others have shown that adults and infants can use this type of statistical information to segment speech if they are given artificial languages made of nonsense words or environmental sounds in which the transitional probabilities of syllables within words are very high and the transitional probabilities between words is very low (Cowan, 1991; Hayes & Clark, 1970; Saffran, Aslin, & Newport, 1996; Saffran, Newport, & Aslin, 1996). However, these probabilities are not known for any non-artificial languages so it is not clear that the information is present or useful under normal circumstances. Furthermore, word boundaries would not be marked by a dip in transitional probabilities when the last syllable of one word and the first syllable of the next could be contained within a single word (for example 'bay be' and 'baby'). Therefore, transitional probabilities will also be a misleading or an absent segmentation cue in some examples of speech.

Clearly, there are many sources of information people can use to segment speech, although none of them seem to be absolutely reliable. Therefore, it is likely that people use multiple cues rather than limiting themselves to a single, imperfect one. However, most segmentation studies have focused on the use of only one cue. Therefore, little is known about how the use of different sources of segmentation information may interact to produce the very accurate speech segmentation people demonstrate every day. Can people flexibly use multiple sources of information such that available cues become more important when others are absent? Is there a single cue which is relied upon most heavily such that other cues are only used when it is misleading or absent? These questions can only be addressed by measuring the effects of different segmentation cues within the same experiment.

Those studies which have considered multiple sources of segmentation information (including: Norris, McQueen, & Cutler, 1995; Quene, 1992; Vroomen, van Zon, & de Gelder, 1996) have used degraded stimuli such as one or two word utterances that encourage the use of some cues (usually acoustic) and discourage the use of other cues (usually contextual, lexical, and syntactic). To get an accurate measure of the relative use of different cues in normal speech processing, it is necessary to use examples of continuous speech that contain all of the usual semantic, syntactic, and acoustic information encountered in speech. The purpose of this study was to assess the use of multiple segmentation cues using full sentences presented as continuous speech. Although it was not possible to measure the effects of all the cues described here, two types of linguistic cues, lexical and syntactic, and one acoustic cue, stress pattern, were manipulated.

Another issue that must be addressed in studying speech segmentation is the type of task used. It is desirable to use a task that relates to where speech streams are segmented in a transparent way. For example, by asking people where they perceive word-onsets or the lack of word-onsets. The first experiment used this approach by asking subjects to detect target phonemes in sentences and to report whether the targets were word-initial or word-medial. This type of task has the additional advantage that all potential segmentation cues may be made available in the stimuli by using natural speech. However, it is also a difficult task in which decisions are made well after a target is presented. This makes it impossible to determine if the information used to determine if sounds were word-medial or word-initial influenced on-line segmentation, played a role in reparsing any missegmented sounds, affected only the meta-linguistic decision, or some combination of these.

Therefore, it is also desirable to use segmentation tasks that can be performed quickly enough to be certain they are directly tapping into on-line segmentation. Tasks such as phoneme monitoring and syllable monitoring result in faster reaction times than phoneme localization suggesting that these decisions can be made at an earlier point in the segmentation process than localization decisions. Converging evidence from this type of experiment would add support to the idea that information used to localize phonemes was involved in segmentation as opposed to meta-linguistic decisions. However, this type of task requires the additional assumption, for which there is little evidence, that units corresponding to a segment or fall at the beginning of a segment will be detected more quickly. Furthermore, they require subjects to pay greater attention to the surface features of speech than may be typical. Experiment II of this study employs a phoneme monitoring task using the same stimuli as experiment I so the results of the two can be compared.

Other tasks such as gating and shadowing can be used to ensure that subjects have no additional information after the point of interest and require no meta-linguistic knowledge. However, they cannot be conducted without disrupting the presentation of normal continuous speech. In an attempt to understand how continuous speech is segmented, it will be necessary to use converging evidence from a variety of techniques. Since the focus of this study was to explore how multiple segmentation cues are used in normal speech processing, it was important to be certain that all possible segmentation cues were available in the stimuli. Therefore, a phoneme localization and phoneme detection task were used.

Experiment I - Phoneme Localization

Method

Participants

Sixteen monolingual English speakers participated in experiment I (Mage = 20.9 years, 11 females). All were right-handed university students who were paid $7 per hour.

Stimuli

A total of 900 sentences were created to vary the amount of lexical, syntactic, and useful stress-pattern information available to the listener. Lexical and syntactic information was varied by replacing all of the content words, or all of the words, in a sentence with pronounceable nonwords. Stress pattern information was varied by using words that contained targets in different positions and had strong stress on different syllables.

Thirteen single phonemes and eight phoneme combinations where chosen as targets. All of the targets were consonant sounds that occur in both word-initial and word-medial positions in English.

The following five types of words were selected for use in the experiment: (a) began with a target and had strong stress on the first syllable, (b) began with a target and had weak stress on the first syllable, (c) contained a word-medial target and had strong stress on the syllable in which the target occurred, (d) contained a word-medial target and had weak stress on the syllable in which the target occurred, and (e) did not contain a target. Examples of each type of word can be seen in Table 1 and a full list of all target-containing words can be found in Appendix A.

Table 1.

Examples of Words Which Contain or Do Not Contain Target Phonemes

Condition	Target sound	Word
Target Present
Strong stress, Initial position	/b/	bottles
Strong stress, Medial position	/b/	tobacco
Weak stress, Initial position	/b/	balloon
Weak stress, Medial position	/b/	timber
Target Absent	/b/	afghan

Open in a new tab

All of the words were two, three, or four syllables long. Word-medial targets were chosen to be the first or first two phonemes of the second syllable of a word. Syllabification in English is not always clear-cut. For example, the /v/ in “gavels” could be considered the last sound in the first syllable (gav-els) or the first sound in the second syllable (ga-vels). Both the sonority sequencing principle and the principle of maximizing onsets would predict that the targets chosen for this experiment would in fact be considered the beginning of the second syllables (e.g. ga-vels) with the exception of /s/ and /st/. An item analysis was used to determine whether or not these (or any other small group of stimuli) were producing the results reported.

Sixty words from each of the five categories (for a total of 300 words) were chosen to be matched on the targets they contained, part of speech, written and spoken word frequency, and word length. Every target phoneme and phoneme combination was equally represented in the five groups. Words which have an infrequent English-stress pattern (weak stress on the first syllable, strong stress on the second) tended to be of low frequency (M = 21.38, range = 0 to 267), (Kucera & Francis, 1967), so low frequency words with typical English stress-pattern (strong stress on the first syllable, weak on the second) were used as well (M = 20.73, range = 0 to 290). This selection resulted in no significant differences in written or spoken frequencies (M = 2.23, range = 0 to 55), (Brown, 1984) across the word types. Furthermore, there were no reliable differences in the number of letters in words which contained target phonemes or their non-target matches across conditions (M = 7.72, range = 4 to 11).

Sentences were then constructed around these selected words such that a target never occurred anywhere in a sentence other than in the selected word. It was important that the words that might contain targets could not be predicted before they were actually heard. Therefore, the cloze probability of each of the selected words in its sentence was measured by giving 40 naive subjects all of the words in the sentences up to the critical one, and asking them to write down the word they thought would come next. If more than 25% of the subjects chose the same word to continue a sentence (even if the word they chose was not the one to be used in the experiment), the sentence was excluded. This procedure resulted in a low cloze probability for the selected words in the sentences that were used (M = .032, range = .000 to .215). Additionally, target phonemes never occurred in the first three or the last three words of a sentence (M = 9th word) and there were no significant differences in target position in the sentences across word type. (See Appendix B for means and ranges describing each of these aspects of the words and sentences).

The 300 normal English sentences (semantic sentences) that were created contained extensive semantic, syntactic, and prosodic information. To make sentences that maintained syntactic and prosodic information, but had less meaning (syntactic sentences), all of the content words in these sentences were replaced with pronounceable nonwords. Morphemes as such 'ed', 'ing', and 'ly' were kept when replacing words which contained these units with nonwords. To ensure that the resulting sentences were pronounceable, all nonwords were created by replacing every consonant with one from the same class (stop, fricative, or nasals and liquids) and every vowel with another vowel randomly. The only exceptions to this were 1) when the resulting group of sounds created another English word the procedure was repeated until a nonword was made 2) the target phoneme or phoneme combination for a sentence was excluded from the group of sounds used in the replacement and 3) the target phoneme or combination was not changed.

To create a group of sentences that contained normal English prosody but which had less grammatical structure than the syntactic sentences (acoustic sentences), the remaining words and morphemes were replaced in the same manner described above. A set of 5 sentences in each of the three sentence forms is shown in Table 2.

Table 2.

Examples of Semantic, Syntactic, and Acoustic Sentences for All Conditions

Condition	Type	Sentence
Target Present
SI	Semantic	In order to recycle bottles you have to separate them.
	Syntactic	In order to lefatal bokkers you have to thagamate them.
	Acoustic	Ah ilgen di lefatal bokkerth ha maz di thagamate fon.
SM	Semantic	If the only thing in it were tobacco it wouldn't cause so much harm.
	Syntactic	If the ilmy shord in it were dobatty it wouldn't gaff so much hilm.
	Acoustic	Os fa ilmy shord el ok hon dobatty ag hapsel gaff sha nes hilm.
WI	Semantic	The child stopped crying when a balloon was given to her.
	Syntactic	The ferp trepped plawing when a barreal was kaffen to her.
	Acoustic	Sa ferp trepp plawel ron i barreal hof kaffem gi wem.
WM	Semantic	I saved money since lowgrade timber worked for this project.
	Syntactic	I cheft rono since miltrok delber meld for this plassig.
	Acoustic	O cheft rono zalf miltrok delber meld sith foch plassig.
Target Absent	Semantic	Try looking under the afghan for the toy you lost.
	Syntactic	Qui medding under the ithdon for the kay you moft.
	Acoustic	Qui medden amkel fa ithdon sal cha kay wa moft.

Open in a new tab

Note. All example sentences use /b/ as the target phoneme which is indicated by italics in the sentences. SI = Strong stress, Initial position; SM = Strong stress, Medial position; WI = Weak stress, Initial position; WM = Weak stress, Medial position.

Each sentence was digitized (22 Khz sampling rate, 16 bit) using Goldwave software on a Pentium PC by a female native English speaker at a normal speaking rate (M = 4.26 words per second). The speaker was aware of the purpose of this study and knew to record the syntactic and acoustic sentences with the same prosody as the semantic sentences they were created from. Silence at the beginning and end of all sentences were removed from the sound files, and the highest amplitude of each sound was normalized to 1. The auditory versions of the semantic, syntactic, and acoustic form of each sentence were close in total length (M difference = 49.42ms, range = 0ms to 288ms). This confirmed that the three forms of each sentence were successfully recorded at similar speech rates. Furthermore, there were no overall differences in sentence length (M = 3438ms) or position of the target (M = 1616ms) among conditions. (See Appendix C for means and ranges of these measurements.)

Two measures of stress, length of the target phoneme with its following vowel and maximum amplitude over that range, were used to ensure that stress was consistent across the different types of sentences. Amplitude was measured on a relative scale from zero to one. The targets in the semantic sentences had an average maximum amplitude of .56 (range =.12 to 1) and an average length of 137ms (range = 22 to 263ms). The syntactic (maximum amplitude = .57 (range = .11 to 1), length = 135ms (range = 38 to 267ms)) and acoustic (maximum amplitude = .55 (range = .10 to 1), length = 133ms (range = 46 to 252ms)) versions of the sentences did not differ in maximum amplitude or length of the targets. There were some differences in stress across the different target conditions. Target phonemes in the middle of words tended to be slightly louder (F(1,708) = 22.75, p < .01) and slightly longer (F(1,708) = 53.46, p < .01) than those at the beginnings of words. However, these difference were fairly small in magnitude (difference in mean maximum amplitude = .06, difference in mean length = 23ms). As expected, targets that received strong stress were louder (F(1,708) = 722.78, p < .01) and longer (F(1,708) = 450.86, p < .01) than targets that were weakly stressed (difference in mean maximum amplitude = .40, difference in mean length = 66ms).

Furthermore, pitch contours were examined across entire sentences and fundamental frequency changes over the syllables in which targets occurred were measured. For each set of three versions (semantic, syntactic, and acoustic) of the same sentence the pitch contours were judged to be equivalent. An example of the spectrograph and pitch analysis of a set of sentences can be seen in Figure 1. The three versions of each sentence had similar fundamental frequencies measured at the target onset (Mdifference = 27Hz, range = 1 to 91Hz) and at the end of the syllables in which a target occurred (Mdifference = 34Hz, range = 1 to 103Hz). In all of the sentences the frequency change between the onset of the target and offset of the syllable in which the target occurred was in the same direction for the three versions. There was no effect of sentence type or interaction of sentence type with stress or position for either of the frequency measurements or the difference between the two. Neither stress nor position had a significant effect on the first frequency measurement. However, the stress affected change in frequency (F(1,708) = 814.31, p<.01) such that fundamental frequency increased over strongly stressed syllables (Mchange = +60Hz) but tended to stay the same over weakly stressed syllables (Mchange = +4Hz). Furthermore, stress and position interacted (F(1,708) = 114.1, p<.01) such frequency tended to increase for the weak-initial syllables (Mchange = +29Hz) and decrease for the weak-medial syllables (Mchange = −21Hz). Means and ranges of the changes in frequency, amplitude, and length of the syllables in which targets occurred can be found in Appendix D.

Spectograph and Pitch Analysis of the Semantic, Syntactic, and Acoustic Versions of one of the Ttest Sentences: Visual inspections across entire sentences and statistical analyses of pitch and pitch change across the syllables which contained targets revealed no differences in pitch contours across sentence type.

An additional test of the acoustic similarity of the semantic, syntactic, and acoustic sentences was performed by giving a group of people who did not know English a task which involved all three types of sentences. Native Spanish speakers (N = 9) who had little or no exposure to English were asked to detect target phonemes or phoneme combinations in the sentences. There was no effect of sentence type on detection accuracy (F(2,16) = .58, p = .571) or on reaction times (F(2,16) =.01, p = .993). These results suggest that any differences in performance with the different types of sentences found for other groups must be dependent on experience with English, not acoustic differences in the sentences which could be detected by anyone with auditory language experience.

Examples of the target sounds pronounced in isolation were used, in part, to indicate which sound subjects were to listen for in each sentence. Each target followed by /∂/ (e.g. /b/became/b∂/,/fr/ became /fr∂/) was pronounced by the same speaker who recorded the sentences, and was digitized in the same way as the sentences.

Procedure

Participants were brought in for two 1.5 hour sessions that were at least three days, but not more than two weeks, apart. During each session they completed 60 practice trials and 450 test trials. They sat in a sound-attenuated room with a computer monitor 55 inches away, and headphones on. All sounds were presented binaurally at approximately 60 dB above normal hearing threshold.

During each trial, participants first heard the sound of the target phoneme or phoneme combination for that sentence and simultaneously saw a letter or letters representing that target appear on the screen. Subjects were instructed to listen for the target sound (not the presence of the letter or the /∂/ sound) in the sentence that followed 1100ms after the end of the target sound. The letter that represented the target was left on the screen for the entire trial.

For this experiment, participants had a choice of three responses. They were asked to press one button if they heard a target phoneme and believed it was at the beginning of a word or nonword, were asked to press a different button if they heard the target in the middle of a word or nonword, and were asked to not respond at all if they did not hear a target. They were given examples of targets that occur at the beginning of words, such as the /d/ in 'devil', 'destroy', 'dossly', and 'daclin', and targets that occur in the middle of words such as the 'd' in 'wisdom', 'tradition', 'blomder', and 'padell'. Participants were reminded that the different buttons were to be used to indicate the two different positions a phoneme might have in a word, and had nothing to do with the position of the word in the sentence. Furthermore, they were asked to press a button as soon as they heard the target sound in the sentence and were instructed not to wait for the sentence to end to make their response. Both accuracy and speed of the responses were emphasized. The next trial began 1500ms after the end of a sentence regardless of if, or when the subject responded.

After 60 practice trails, participants were given feedback about their performance. They were given estimates of the percentage of trials on which they correctly detected a target, the percentage of trials on which they correctly determined where in a word a target occurred, and the average amount of time it took them to respond after a target occurred. During the test trials, participants were offered a break after every 20 sentences. Both the experimenter and the participant were required to press a button in order to continue after these breaks.

The order of presentation for the three forms (semantic, syntactic, and acoustic) of each sentence was balanced. Different forms of the same sentence were not presented with fewer than eighty trials in between. Within these constraints, sentences were randomized and presented in the same order for all subjects. Half of the participants were asked to use the left-most button to indicate a target was word-initial; half were asked to use the right-most button to indicate a target was word-initial.

Sounds were presented on a Pentium PC using a Data Translation (DT 2821) D-to-A converter. Sounds were low-pass filtered at 7500Hz to prevent aliasing. Presentation and responses were controlled and recorded by C++ programs. Reaction times could be measured accurately within +/− 4ms.

Results

Localization accuracy was measured by dividing the number of trials on which subjects successfully detected a target phoneme and determined whether it was word-initial or word-medial by the number of trials on which subjects correctly detected the target. Reaction times to perform this task were measured from the point at which the target phoneme or phonemes were presented.

A 3 (sentence type) by 2 (stress) by 2 (target position) repeated measures ANOVA was performed. Additionally planned comparisons between sentence type (semantic and syntactic, syntactic and acoustic) and stress patterns (normal stress-pattern and infrequent stress pattern) were performed. Item analyses (percentage of subjects who responded correctly for each item and mean reaction time for each item) were used to determine if the results are generalizable across words or if they were driven by a small subset of the stimuli. Concerns about violations of linearity with percentage data were met by using a natural log transformation of the proportions (score = ln(proportion correct/1-proportion correct)). Finally, the data were split into groups according to the type of targets subjects were listening for: consonant clusters, voiceless stops, voiced stops, fricatives, and nasals. Means and standard errors of percent correct and reaction times can be found in Appendix E. ANOVA tables and t-tests for analysis by subject and by item can be seen in Appendix F.

Sentence Type

Sentence type affected localization accuracy (F(2,30) = 385.3, p < .01) such that performance was better for semantic sentences (M = 98%) than syntactic sentences (M = 80%)(t(15) = 15.9, p < .01) and for syntactic sentences than acoustic sentences (M = 67%)(t(15) = 13.48, p < .01) as can be seen in Figure 2. Sentence type also affected reaction times (F(2,30) = 16.54, p < .01) such that responses were faster to semantic sentences (M = 1363ms) than to syntactic sentences (M = 1512ms)(t(15) = 4.88, p < .01). For each of the sentence types, performance across position and stress was better than chance (semantic: t(15) = 110.7, p < .01; syntactic: t(15) = 21.8, p < .01; acoustic: t(15) = 12.49, p < .01). Each of these effects was confirmed by both the item analysis and the analysis on the natural log transformed data.

Stress Pattern

There was a stress by position interaction (F(1,15) = 42.49, p < .01) on phoneme localization. When the data were grouped to compare normal English stress pattern (strong-initial and weak medial) to an infrequent English stress pattern (weak-initial and strong-medial), it was found that people were more accurate with the normal pattern (Mnormal = 87%, Minfrequent = 76%, F(1,15) = 60.99, p < .01). Furthermore, stress pattern interacted with sentence type (F(2,30) = 17.1, p < .01) such that performance was better with normal stress pattern than infrequent stress pattern for all sentence types (semantic: Mnormal = 99%, Minfrequent = 96%, F(1,15) = 16.73, p < .01; syntactic; Mnormal = 85%, Minfrequent = 75%, F(1,15) = 59.07, p < .01; acoustic, Mnormal = 77%, Minfrequent = 58%, F(1,15) = 91.39, p < .01), but this effect was larger when less information was available in the sentence as can be seen in Figure 3.

Additionally, the effect of stress pattern could be seen by comparing the four combinations of stress (strong and weak) and position (initial and medial) directly. For the semantic sentences, subjects were more likely to correctly identify the location of a strongly stressed phoneme when it was in the word-initial position (M = 99%) than when it was in the word-medial position (M = 96%)(t(15) = 3.52, p < .01). Furthermore, in this same sentence type, people were better able to localize a weakly stressed phoneme when it was in the word-medial position (M = 99%) than when it was in the word-initial position (M = 97%)(t(15) = 2.40, p < .01). For the syntactic sentences, accuracy was higher with the weakly stressed phoneme in the word-medial position (M = 92%) than in the word-initial position (M = 77%)(t(15) = 10.17, p < .01). The means were in the predicted direction for the strongly stressed targets (Minitial = 78%, Mmedial = 73%), but this difference was not significant. The same was true for the acoustic sentences in that the difference in word-initial and word-medial localization for the strong stress was not significant (Minitial = 60%, Mmedial = 53%), but the difference in position for the weak stress was (Minitial = 62%, Mmedial = 93%)(t(15) = 10.61, p < .01). By examining these means, it is also clear that the sentence by stress by position interaction F(2,30) = 24.46, p < .01) is driven by the greater difference between weak-initial and weak-medial phonemes in the acoustic sentences than in the syntactic sentences and in the syntactic sentences than in the semantic sentences.

Although there was a stress by position interaction on reaction times (F(1,15) = 12.83, p < .01), no other stress-pattern comparisons were significant. The ln transformation and item analysis of localization accuracy confirmed all of the significant results found in the ANOVA and t-tests described above. When targets were grouped by phoneme class the following pattern of means held up across all groups: semantic > syntactic > acoustic and normal stress-pattern > infrequent stress-pattern.

Discussion

Subjects were extremely accurate (98%) at determining the position of the targets they detected in the normal English sentences. Their performance with the syntactic sentences was high (80%), but the lack of full semantic and lexical information was associated with decreased accuracy.

This suggests that word recognition played an important role in subjects’ determining whether targets were word-initial or word-medial. Furthermore, people were more accurate at the phoneme localization task with the syntactic sentences than with the acoustic sentences (67%). This suggests that the remaining lexical information in the syntactic sentences or the presence of more syntactic information in the form of morphemes and function words aided subjects in the phoneme localization task.

Localization accuracy was measured as the number of trials on which target position was correctly determined divided by the number of trials on which a target was successfully detected. Differences in localization accuracy on the three sentence types reflected differences in the ability of subjects to determine where in the speech stream word-onsets occurred rather than differences in detectability of the targets in the different contexts. However, even when only detected targets are included it remains possible that differences in attention to, or memory for, the different types of sentences affected localization performance. For example, it is likely that people were better able to remember the normal English sentences than the non-word sentences. In a meta-linguistic task like phoneme localization, better memory could allow for more accurate responses. From this experiment it is not clear whether the varying amounts of lexical and syntactic information affected on-line segmentation, or later reparsing and meta-linguistic decisions which were likely to be influenced by many different processes including memory and attention. However, it is clear that the amount of lexical and syntactic information affected the assignment of targets to word-initial or word-medial positions whether this effect was direct through segmentation or more indirect through memory or attention.

It is also important to recognize that even though the syntactic sentences had more grammatical information than the acoustic sentences they also had more lexical items. It is possible that the intact function words in the syntactic sentences were recognized as lexical items and used as lexical information in the phoneme localization task. To determine if the larger amount of lexical information in the syntactic sentences was responsible for better performance, accuracy on trials that had a word immediately preceding the target was compared to accuracy on trials that had a nonword immediately preceding the target. The fact that there were no significant differences in performance for these two types of syntactic sentences suggests it was differences in grammatical structure, and not differences in the number of lexical items, that drove the better performance for the syntactic sentences than the acoustic sentences.

The fact that performance was better for words with normal English stress pattern (strong at the beginning, weak in the middle) than for words with infrequent English stress pattern (weak at the beginning, strong in the middle) for all sentence types suggests that stress pattern plays an important role in determining where word-onsets occur. Although this effect was quite small for the normal English sentences (3% difference in accuracy), it suggests stress-pattern has an effect even in normal continuous speech with all of the semantic, lexical, and acoustic information intact.

Furthermore, stress pattern seems to have a larger effect when other sources of information are absent as is suggested by the stress-pattern by sentence type interaction. Although it is possible that the overall interaction was influenced by a ceiling effect with the semantic sentences, the interaction is still strong when only syntactic and acoustic sentences are included in the analysis. This suggests that people can flexibly use the segmentation cues available to them by relying more heavily on what is present when the number of segmentation cues are decreased. Accuracy was still high (86%) when the stress pattern gave misleading segmentation information, but only if lexical and syntactic information was present.

Strong stresses were more easily localized to word-initial positions than weak stresses were across sentence type. However, this effect was small in comparison to the difference between weak and strong stress for the word-medial targets. People were very accurate at localizing weak stresses in word-medial positions (99%, 92%, and 93% for semantic, syntactic, and acoustic sentences respectively) and poor at localizing strong stresses to word-medial positions (96%, 73%, and 53% for semantic, syntactic, and acoustic sentences respectively). Although this difference was not expected, one possible explanation is that people place more emphasis on one part of a stress-pattern segmentation strategy than the other. It is possible that people more strictly follow the pattern that weak stresses fall in the middle of words than the pattern that strong stresses fall at the beginnings of words. One motivation for stressing the different parts of a stress-pattern strategy unequally, may be that there is a higher cost associated with segmenting speech in places it shouldn't be than there is in failing to segment speech where it should be. Perhaps when speech is segmented in the middle of a word based on stress pattern it is difficult to apply other cues and realize the units need to be considered together. However, when sometimes failing to segment speech where it should be, it may still be possible to consider other cues (such as lexical recognition and syntax) to make that break. Although this is a possible explanation for the pattern of data found for these experiments, more direct tests of the idea would be necessary to support or refute it.

It is also important to recognize that people were able to determine if targets were word-initial or word-medial even when they had no lexical or syntactic information and when stress pattern cues were as likely to be misleading as helpful. Above chance level performance with the acoustic sentences (across the different location and stress levels) indicates that people were able to use some sources of segmentation information in addition to lexical, syntactic, and stress-pattern cues. Perhaps phonotactic and allophonic information as well as transitional probabilities were used in addition to the cues that were directly manipulated in this experiment.

Another important aspect of the localization task was the long response times. On average, it took people almost a second and a half to press one of the buttons after a target was presented. The long reaction times are not surprising considering the difficulty of detecting a target phoneme in over three seconds of continuous speech and determining whether that target was word-initial or word-medial. However, they also serve as an indication that phoneme localization was a difficult meta-linguistic task that is likely to reflect many types of processing in addition to on-line segmentation. Therefore, an easier phoneme detection task was given using the same stimuli.