Toddlers use speech disfluencies to predict speakers’ referential intentions

Celeste Kidd; Katherine S White; Richard N Aslin

doi:10.1111/j.1467-7687.2011.01049.x

. Author manuscript; available in PMC: 2012 Jul 1.

Published in final edited form as: Dev Sci. 2011 Apr 14;14(4):925–934. doi: 10.1111/j.1467-7687.2011.01049.x

Toddlers use speech disfluencies to predict speakers’ referential intentions

Celeste Kidd ¹, Katherine S White ², Richard N Aslin ^1,³

PMCID: PMC3134150 NIHMSID: NIHMS280129 PMID: 21676111

Abstract

The ability to infer the referential intentions of speakers is a crucial part of learning a language. Previous research has uncovered various contextual and social cues that children may use to do this. Here we provide the first evidence that children also use speech disfluencies to infer speaker intention. Disfluencies (e.g. filled pauses ‘uh’ and ‘um’) occur in predictable locations, such as before infrequent or discourse-new words. We conducted an eye-tracking study to investigate whether young children can make use of this distributional information in order to predict a speaker’s intended referent. Our results reveal that young children (ages 2;4 to 2;8) reliably attend to speech disfluencies early in lexical development and are able to use the disfluencies in online comprehension to infer speaker intention in advance of object labeling. Our results from two groups of younger children (ages 1;8 to 2;2 and 1;4 to 1;8) suggest that this ability emerges around age 2.

Introduction

Inferring a speaker’s intention is crucial to successful language learning. For example, mapping a spoken word to the appropriate object in the world requires understanding to which object the speaker intends to refer (e.g. Preissler & Carey, 2005). Though some labeling contexts are unambiguous (e.g. holding a cookie and saying ‘cookie’), most contexts involve multi-word utterances and multiple objects in the child’s visual field, making the mapping problem a difficult one.

Previous work has explored various extra-linguistic cues learners can use for determining speaker intention. Social cues include joint visual attention, pointing, and eye gaze (e.g. Baldwin, 1991; Butterworth & Cochran, 1980; Southgate, van Maanen & Csibra, 2007; Yu, Ballard & Aslin, 2005). There are also contextual cues, such as object presence, object word co-occurrence statistics (e.g. Smith & Yu, 2008), and discourse context (Frank, Goodman, Tenenbaum & Fernald, 2009). In addition to these externally available cues, young children appear to use certain heuristics that facilitate rapid lexical development. One heuristic of particular relevance is the principle of contrast (e.g. Bolinger, 1977; Clark, 1987, 1990; Markman, 1990; Markman, Wasow & Hansen, 2003). Experimental evidence suggests that young word learners make use of the fact that words tend to contrast in meaning, and thus exhibit a bias for a novel referent when encountering a novel word. Use of the principle of contrast for inferring a novel word’s referent has been observed in learners as young as 15 months of age (Halberda, 2003).

Here we investigate a previously unexplored cue for inferring speaker intention: speech disfluencies. Disfluencies (e.g. filled pauses ‘uh’ and ‘um’) occur in highly predictable locations – for example, before unfamiliar or infrequent words, and before words that have not been previously mentioned in the discourse. Since disfluencies occur before an object is labeled, they could enable children to anticipate upcoming referents. Thus, speech disfluencies could enable a young word learner to narrow the pool of possible referents that she considers in a given discourse context. Anticipating the referent could also facilitate processing by enhancing the speed of spoken word recognition, and by allowing cognitive resources to be more quickly reassigned to new learning material following the label (Marchman & Fernald, 2008).

Disfluencies are a reliable property of speech between adults. Fox Tree (1995) estimated that about six disfluencies occur per 100 words, excluding pauses (which are not necessarily disfluencies). Shriberg (1996) estimated that disfluencies occur on average every seven to 15 words in conversation between adults.¹ The rate of disfluency varies as a function of several factors, including the speakers’ familiarity with one another, utterance length, and speech rate (Shriberg, 1996). Disfluencies include pauses, repeated words, lengthened syllables, abandoned phrases, inserted filler phrases, and speech errors. We focus here on the most common type of disfluency, the filled pause – ’uh’ and ‘um’ in English (Shriberg, 1996). This type of disfluency is characteristic of planning problems, such as the lexical retrieval difficulties associated with producing infrequent and discourse-new words (Arnold & Tanenhaus, in press; Clark & Fox Tree, 2002). Consider the following example of a filled pause from the Sachs corpus in CHILDES (MacWhinney, 2000):

(1) CHILD: Telephone?

MOTHER: No, that wasn’t the telephone, honey. That was the, uh, timer.

The filled pause, ‘uh’, occurs before ‘timer’, a word that is infrequent and previously unmentioned in the discourse. Low frequency and discourse-new lexical items like this require more processing time due to the delay involved in lexical retrieval. Disfluencies before these hard-to-retrieve words function to provide the speaker with time to retrieve the word while simultaneously signaling to the listener that the speaker is having difficulty (Clark & Fox Tree, 2002; Fox Tree & Clark, 1997).

One important note on these types of filled-pause disfluencies concerns the determiner that precedes them. The word ‘the’ has two alternative pronunciations: ‘thuh’ (i.e. rhymes with ‘duh’) and ‘thee’ (i.e. rhymes with ‘bee’). The full, unreduced form ‘thee’ is far more likely to be produced in conjunction with other evidence of processing difficulties, such as before delays (unfilled pauses) and fillers (e.g. ‘thee, uh’, ‘thee, you know, thee’), and during repeats (‘thee, thee ’) (Clark & Wasow, 1998; Clark & Fox Tree, 2002; Fox Tree & Clark, 1997). This cannot be said of the more common, reduced form, ‘thuh’. Although ‘thuh’ is more common overall in spontaneous speech, it is far less likely to precede an intermediate suspension of speech. Fox Tree and Clark (1997) reported that in their analysis, 81% of the instances of ‘thee’ were followed by a suspension of speech, compared to only 7% of a matched sample of instances of ‘thuh’. Thus, the unreduced form, ‘thee’, is highly predictive of a subsequent disfluency. As a result, upon hearing an unreduced form, listeners might assume that a retrieval-induced disfluency is forthcoming.

Indeed, previous research with adults suggests that disfluencies facilitate online sentence comprehension: In a series of eye-tracking experiments, adults showed a bias to look at discourse-new or unfamiliar objects when labels were preceded by the types of filled-pause disfluency discussed above (Arnold, Tanenhaus, Altmann & Fagnano, 2004; Arnold, Fagnano & Tanenhaus, 2003; Arnold, Hudson Kam & Tanenhaus, 2007).

Here we ask whether toddlers are able to detect and use disfluencies during online spoken word recognition. In particular, we explore whether young children predict that words preceded by disfluencies will refer to unfamiliar or discourse-new referents. There are two reasons why we might expect children to use disfluencies predictively in this manner. First, there is growing evidence that determiners play an important role in children’s language processing: Young children recognize words more rapidly when preceded by an appropriate and informative function word (Kedar, Casasola & Lust, 2006; Lew-Williams & Fernald, 2007; Zangl & Fernald, 2007), and can even use function words to identify the lexical category of unfamiliar labels (Bernal, Millotte & Christophe, 2007). Second, children are sensitive to the prosody of disfluent speech (Soderstrom & Morgan, 2007). Thus, we test whether children, like adults, use indicators of processing difficulties, such as lengthening of the determiner and a subsequent filled-pause (‘thee uh’), in order to anticipate likely referents for an upcoming noun. Given the demands of word learning, sensitivity to the informativeness of disfluencies could be highly advantageous for the young word learner.

Experiment 1

Methods

Participants

Sixteen parents volunteered their toddlers for the study. Parents were recruited through mailings, posters, and web ads. The toddlers ranged in age from 2;4 to 2;8 (M = 2;6), had no reported hearing deficits, and were from monolingual, English-speaking homes. Participants received either $10 or a toy as compensation.

Stimuli

The stimuli consisted of 32 non-animate objects and their labels.² Half of these items were familiar objects (e.g. ‘ball’) whose labels were the earliest acquired words listed in the MacArthur-Bates Communicative Development Inventories (Dale & Fenson, 1996). The other half were novel objects, matched by adult judgments to the familiar items in brightness and visual complexity. A novel word was created for each novel object. The set of novel words matched the familiar words in syllable lengths, word onsets, and stress patterns (see Appendix 1 for word lists). The items were divided into 16 familiar novel object pairs, such that each pair contained one familiar and one novel item.

Procedure

Each child was seated on a parent’s lap with the child’s eyes approximately 63 cm from the 17-inch LCD monitor of a Tobii 1750 eye-tracker. The auditory and visual stimuli were presented from a host Macintosh computer using PsyScope X software (Cohen, MacWhinney, Flatt & Provost, 1993). Calibration of the eye-tracker was performed using Clearview software. The calibration involved the child fixating a shrinking dot located successively at one of five different screen locations. The parent wore headphones playing music to mask the auditory stimuli and was asked to direct their gaze downward during the experiment to prevent influencing their child’s behavior. The experiment consisted of 16 trials, each featuring a unique familiar novel object pair. Each trial was initiated only when the child attended to a small, animated attention-getter (a video of a laughing baby) presented in the center of the Tobii display.

On each trial, the objects from one of the 16 familiar novel object pairs were presented side-by-side three times in succession (Figure 1). The objects’ locations within a given trial were fixed. During the first two presentations, the familiar object was labeled, first with the carrier phrase ‘I see the X!’, then with the phrase ‘Oooh! What a nice X!’ During the first two presentations, objects appeared on the screen 500 ms before the carrier phrase, and remained on the screen until 2 seconds after the onset of the familiar target word. The first two presentations were separated by a 1 second pause, during which the screen was blank. During the third presentation, children were instructed to look at either the familiar/mentioned object or the novel/unmentioned object, and the instruction was either produced fluently or contained a filled-pause (i.e. ‘thee uh’). The disfluency was preceded by the full, unreduced pronunciation of the determiner, ‘thee’, because this form occurs more commonly before suspensions of speech (Fox Tree & Clark, 1997) and could thus be considered most natural. Similarly, ‘uh’ was chosen as the filled-pause because it is more common than ‘um’ in natural speech. Table 1 displays the phrase used for each of these four different trial types. Disfluencies were equally likely to precede familiar and novel targets, thus preventing children from learning any relationship between disfluencies and target familiarity over the course of the experiment. During the third presentation, objects did not appear until 2 seconds before the target object was labeled (which corresponded to the period of disfluency during disfluent trials). On this third presentation, objects remained on the screen for 3 seconds after the onset of the target label.

Each of the 16 novel-familiar object pairs was presented three times in succession. The familiar object was always labeled during the first two presentations. On the third presentation, children were instructed to look either at the familiar or novel object with either a fluent or disfluent command. The window of analysis used was the 2 seconds before the onset of the final object label (the period of disfluency in disfluent trials).

Table 1.

Trial type examples

	Familiar target	Novel target
Fluent	Look! Look at the ball!	Look! Look at the wug!
Disfluent	Look! Look at thee, uh, ball!	Look! Look at thee, uh, wug!

Open in a new tab

Critically, at the onset of the third presentation, one of the objects was both novel and previously unmentioned in the discourse. Because it was unclear whether toddlers would be sensitive to either of these factors, we jointly manipulated novelty and discourse-new status in order to maximize the chance of observing an effect (i.e. to determine whether toddlers can use disfluencies predictively in any capacity).

Window of analysis

If children do use disfluencies predictively, we would expect to see more looks to the novel/unmentioned object during the period of disfluency. In the disfluent trials, the earliest sign of the disfluency is at the determiner – ’thee’ in disfluent trials versus ‘the’ in fluent speech. Thus, the determiner was chosen as the onset of the window of analysis in disfluent trials. Because our focus was on anticipatory looking to the target, the window of analysis ended at the onset of the target word, 2 seconds later. Young children require an estimated 270 ms to program and initiate saccades in response to a stimulus (Canfield, Smith, Brezsnyak & Snow, 1997). To compensate for this stimulus response latency, we shifted the window of analysis forward by 250 ms. Using this shifted 2-second window, we compared children’s anticipatory fixations across disfluent and fluent trials.

Due to the nature of disfluencies, the disfluent utterance is longer; consequently, the linguistic material in the window of analysis varied across fluent and disfluent trials (see Figure 1 and timecourse plots in Figures 2 and 3). To compensate for this difference, the command ‘Look!’ was repeated in all trials. Thus, in all trials, children had been instructed to look at the screen before the third presentation of the object pair and the onset of the window of analysis. The first ‘Look!’ instruction was successful in directing children’s attention to the screen: on 88.4% of trials, children were looking at the screen immediately prior to the onset of the window of analysis, with no significant difference between fluent and disfluent trials.

Proportion of looks to the novel/unmentioned object over the course of the third picture presentation for trials with novel/unmentioned targets (shifted by 250 ms to compensate for saccade latency). During the two-second window of analysis (the period of disfluency in disfluent trials) children's proportional looking to the novel/unmentioned object was higher overall (p < .007). After the target is labeled (just after 4000 ms), looks increase to the target (the novel/unmentioned object in these trials).

Proportion of looks to the novel/unmentioned object over the course of the third picture presentation for trials with *familiar*/*previously mentioned* targets (shifted by 250 ms to compensate for saccade latency). During the two-second window of analysis, children’s proportional looking to the novel/unmentioned object was higher overall (p < .006). After the target is labeled, looks increase to the target (the *familiar*/*previously mentioned* object).

Results

To ensure that children looked reliably at the appropriate object after it was named, we first calculated for each trial type (fluent, disfluent) the proportion of time the child looked at the target object during the 2-second period after the target was labeled.³ On trials in which the target was familiar, the mean proportion of looking to the target during this window was 0.77. A Wilcoxon signed-rank test found this value to be significantly different from chance (V = 131, p < .0003), suggesting that children reliably mapped familiar words to familiar objects. During trials in which the target was novel, the mean proportion of looking to the target was 0.74, which was also significantly different from chance (V = 132, p < .0002). This result suggests that children used the principle of contrast to infer that the novel label referred to the novel object. Finally, the proportions with which children fixated the target did not differ for novel-target and familiar-target trials (V = 78, p > .63). Taken together, these results suggest that children consistently arrived at the target object, regardless of the trial type.

Next, we calculated the proportion of looks to the novel object at each time point during the critical 2-second window of analysis before the onset of the target word (the period of disfluency in disfluent trials). Figure 2 shows the resulting timecourse plot for trials in which the target was novel. As predicted by our hypothesis, children looked more towards the novel object during the 2 seconds before the onset of the target word when there was a disfluency present (i.e. in disfluent trials). This suggests that the disfluency served as a cue that led children to expect that the upcoming referent would be novel/unmentioned. Figure 3 shows the timecourse plot for trials in which the target was the familiar object. In these trials also, children looked more towards the novel/unmentioned object during the pre-target window of analysis in the disfluent trials than in the fluent ones. This again suggests that the disfluency prompted children to anticipate that the novel/unmentioned referent would be labeled, though in these familiar-target trials, that expectation was ultimately violated. Both Figures 2 and 3 also show that, after the onset of the target word, children correctly identified the target picture.

The timecourse plots suggest that children were sensitive to the presence of the disfluency and were biased to interpret that disfluency as signaling that the upcoming word would refer to the novel/previously unmentioned referent. To test that hypothesis, we compared looks to the novel/unmentioned object across fluent and disfluent trials in the 2-second window of analysis before the onset of the target word. During disfluent trials, children looked at the novel object for 1158 ms. During fluent trials, children looked at the novel object for 893 ms. A Wilcoxon signed-rank test found this difference to be highly significant (V = 125, p < .002). This result suggests that children are sensitive to disfluencies and use them predictively to infer that an upcoming referent is likely to be novel and/or previously unmentioned.

A possible alternative explanation for this result is that children simply paid more attention overall to the display (both objects) during disfluencies. To further examine whether disfluencies cause a selective increase in looking to the novel/unmentioned object, we compared the average proportion of total looking time to the novel object during the same temporal window of analysis. The proportion of time children looked at the novel object was 0.68 in the disfluent trials, as opposed to 0.54 in the fluent trials. A Wilcoxon signed-rank test found this difference to be significant (V = 20, p < .01). Further, the proportion of looking time to the novel object was significantly above chance in the disfluent trials (V = 132, p < .0003), whereas in the fluent trials, children’s looking to the two objects did not differ significantly from chance (V = 87, p = .34). These results demonstrate that disfluencies cause a selective increase in attention to novel and/or previously unmentioned objects, suggesting that children use disfluencies online to create expectations about the speaker’s intended referent.

Experiment 2

The results of Experiment 1 demonstrated that by 2;6, young children can use disfluencies predictively to anticipate the visual referent of a forthcoming word. In Experiment 2 we repeated the experimental procedure with two groups of younger children to explore at what age this ability emerges. We chose 16 months as a lower limit because 15 months is the youngest age at which children have been demonstrated to use the principle of contrast to map novel words to novel referents. Given the nature of our task, an ability to use the principle of contrast to map novel words may be a prerequisite for observing an effect of disfluencies. The design was identical to Experiment 1.