Prediction in a visual language: real-time sentence processing in American Sign Language across development

Amy M Lieberman; Arielle Borovsky; Rachel I Mayberry

doi:10.1080/23273798.2017.1411961

. Author manuscript; available in PMC: 2019 Jan 1.

Published in final edited form as: Lang Cogn Neurosci. 2017 Dec 8;33(4):387–401. doi: 10.1080/23273798.2017.1411961

Prediction in a visual language: real-time sentence processing in American Sign Language across development

Amy M Lieberman ¹, Arielle Borovsky ², Rachel I Mayberry ³

PMCID: PMC5909983 NIHMSID: NIHMS940846 PMID: 29687014

Abstract

Prediction during sign language comprehension may enable signers to integrate linguistic and non-linguistic information within the visual modality. In two eyetracking experiments, we investigated American Sign language (ASL) semantic prediction in deaf adults and children (aged 4–8 years). Participants viewed ASL sentences in a visual world paradigm in which the sentence-initial verb was either neutral or constrained relative to the sentence-final target noun. Adults and children made anticipatory looks to the target picture before the onset of the target noun in the constrained condition only, showing evidence for semantic prediction. Crucially, signers alternated gaze between the stimulus sign and the target picture only when the sentential object could be predicted from the verb. Signers therefore engage in prediction by optimizing visual attention between divided linguistic and referential signals. These patterns suggest that prediction is a modality-independent process, and theoretical implications are discussed.

Keywords: American Sign Language, deaf, semantic processing, prediction, eye-tracking, visual world

Introduction

It is well established that spoken language processing is incremental and dynamic. As listeners perceive and process linguistic input, they actively use the information in the input to anticipate what will come next (see Huettig, Rommers, & Meyer, 2011 for a review). When perceiving a visual scene, listeners can use linguistic input incrementally to narrow down the possible visual referents until sufficient information has been given to identify one unique referent. However, theoretical accounts of how visual and linguistic information is integrated over time are based entirely on a framework where linguistic and non-linguistic information are segregated by sensory modality. That is, spoken language is primarily perceived via the auditory modality, while the accompanying referential information is typically presented via a visual scene (Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). During the mapping of linguistic input onto a visual scene, linguistic and non-linguistic input are neatly separated by sensory modality, allowing for uninterrupted on-line processing of auditory linguistic and visual nonverbal information simultaneously.

Not all languages allow for such segregation of linguistic and non-linguistic information by sensory modality, however. In contrast to the typical situation where auditory language comprehension and non-linguistic visual information processing is sensorially segregated, individuals who communicate using a sign language such as American Sign Language (ASL) face a different task. ASL is produced manually (using the hands, body, facial expressions and other markers) and perceived visually. To comprehend ASL, signers must focus their visual gaze on the linguistic signal (Emmorey, Thompson, & Colvin, 2008).

The question we investigate here is whether linguistic prediction modulates signers’ focus on the linguistic signal relative to the integration of non-linguistic information when both signals occur within the same sensory modality, vision. The demands of on-line, visual sentence processing might preclude the additional uptake of non-linguistic information co-occurring in the visual modality until comprehension is complete. This would suggest that linguistic and non-linguistic information cannot be as readily integrated during on-line comprehension when they occur within the same sensory-modality as compared to when they are segregated by sensory-modality. An alternative possibility, tested here, is that linguistic prediction during on-line comprehension enables the integration of linguistic and non-linguistic information when they occur within the same sensory modality. If so, this would suggest that the phenomenon is a fundamental and modality-independent factor in language comprehension.

Investigating gaze patterns during ASL sentence comprehension is crucial to obtain a full theoretical explanation of how prediction affects the integration of linguistic and non-linguistic information. Additionally, identifying signers’ gaze patterns during language comprehension can also inform theoretical accounts of how listeners may prioritize visual and linguistic information (Knoeferle & Crocker, 2007) as this provides a unique test case where language and referential processing occurs over the same sensory channel. Before describing the study, we first turn to studies of prediction during language comprehension and the importance of eye gaze in sign language processing in both adults and children.

Prediction during language comprehension

The predictive nature of language processing is demonstrated clearly in experiments where semantic information early in the sentence can be used to predict an upcoming target word at the end of the sentence. Altmann & Kamide (1999) first demonstrated this phenomenon using a visual world eye-tracking paradigm. Participants viewed a scene of an agent (e.g. a boy) and a set of concrete objects (e.g. a cake, a train set, a toy car, and a balloon), and then heard sentences that either constrained the target object based on the verb (e.g. “The boy will eat the cake,” in which the cake was the only edible object in the scene), or had no such constraining information (e.g. “The boy will move the cake,” in which at least four objects could be moved). Using this paradigm, Altmann & Kamide found that adult participants were faster to look towards a target in the constrained condition than in the neutral condition. This effect has been replicated in a variety of tasks (Kamide, Altmann, & Haywood, 2003; Nation, Marshall, & Altmann, 2003) and with children as young as age 2 (Fernald, Zangl, Portillo, & Marchman., 2008; Mani & Huettig, 2012), suggesting that predictive processes emerge early in development and may be a central mechanism in language processing (but cf. Huettig & Mani, 2016).

While anticipatory eye-movements and prediction effects during linguistic processing emerge from an early age, there are also important individual differences that appear to influence aspects of prediction ability. Prior studies have found that online predictive abilities correlate with a variety of skills in younger and older children including receptive vocabulary in 3- to 10-year-old children (Borovsky & Creel, 2014; Borovsky, Elman & Fernald, 2012), productive vocabulary in 2-year-olds (Mani & Huettig, 2012), reading ability in 8-year-olds (Mani & Huettig, 2014), literacy skills in adults (Mishra, Singh, Pandey, & Huettig, 2012), and working memory in adults (Huettig & Janse, 2016). Thus, there are individual differences in the degree to which listeners of all ages engage in prediction, and these differences correlate with a variety of abilities tied to linguistic experience and skill.

Gaze and information processing in sign language

In sign language, the task of processing multiple sources of information through vision is acquired early and used routinely to navigate the world. Deaf children learning ASL develop sophisticated strategies for alternating gaze between linguistic information and objects and people in the environment, which enables them to achieve coordinated visual attention with their caregivers (Harris, Clibbens, Chasin & Tibbits, 1989; Waxman & Spencer, 1997). By age two, deaf children with deaf parents show frequent shifts in gaze during mother-child interaction (Lieberman, Hatrak, & Mayberry, 2014). Thus, frequent and meaningful gaze shifts are a natural component of sign language comprehension that develops from an early age.

Despite the modality differences between sign and spoken language, native signers interpret lexical signs much as listeners process spoken words (Bosworth & Emmorey, 2010; MacSweeney et al. 2006; Mayberry, Chen, Witcher, & Klein., 2011). Previous studies of sign lexical processing typically have employed paradigms in which signs are either presented with no visual referents (Emmorey & Corina, 1990; Carreiras, Gutierrez-Sigut, Baquero, & Corina, 2008, Morford & Carlson, 2011), or where the signs and their referents are presented sequentially, such as in priming or picture-matching studies (Bosworth & Emmorey, 2010; Ormel, Hermans, Knoors, & Verhoeven, 2009). To date, there has been little work examining how signers manage visual attention to both a sign stimulus and a concurrent visual scene when the sign stimulus unfolds within a sentence context over time. The current study is a first step in filling this gap.

The current study

Studies of sign language acquisition and processing in typical learners largely suggest modality-independent mechanisms are at play (MacSweeney et al., 2006; Mayberry et al., 2011). Despite these broad similarities, there are also important possible differences in how signers might integrate visual referents when they must navigate between referents and a linguistic signal within the same modality. While prior studies have explored this question using single words (Lieberman, Borovsky, Hatrak & Mayberry, 2015), they do not address the more fundamental question of how signers engage in on-line sentence comprehension in relation to concurrent non-linguistic information in the visual modality. Given the robustness of the phenomenon of predictive processing in auditory language comprehension, and the consistency with which adults and children have been shown to make anticipatory looks to a target picture in sentences, where the target can be predicted based on semantic information present in the auditory modality, the visual-world paradigm serves as an ideal test case for the current study. Specifically, by contrasting gaze behavior during the comprehension of ASL sentences either with or without a semantically constraining verb, we investigate the role prediction may play on the integration of linguistic and non-linguistic information within the same visual modality.

The visual nature of ASL processing creates competition for visual attention. When the visual system must do “double duty” to recognise information in the both the referential and linguistic streams, it is unknown how signers will direct their visual gaze. If signers apply a strategy of waiting until a discrete unit of linguistic information (i.e. a single sentence) is complete, then they will not shift gaze to a referent until the end of the signed sentence, irrespective of the semantic relation between an earlier verb and later sentential object. This conservative “wait and see” strategy could prove optimal, given recent theoretical proposals that individuals may modulate their language processing strategies based on information available in the moment (Kuperberg & Jaeger, 2016). Alternatively, signers may use prediction the same way as spoken language comprehenders and strategically direct anticipatory looks to the target picture when it can be predicted from the verb (Nation et al., 2003; Mani & Huettig, 2012). This strategy may underlie the visual gaze behavior of deaf children, who have been observed to rapidly shift gaze between linguistic input and visual referents during signed discourse (Lieberman et al., 2014).

Using a stimulus set and paradigm modeled after spoken language semantic prediction studies (Mani & Huettig, 2012), we ask whether semantically constraining information from a verb that predicts an upcoming noun modulates gaze toward a non-linguistic visual target. We compare signers’ eye movements while perceiving ASL sentences with a constraining verb such as “EAT¹” in which the verb EAT constrains the possible target words to one edible object pictured on the screen, to sentences that begin with a sign such as “SEE” in which the verb contains no such constraining information. If visual language processing requires a “wait and see” strategy during on-line sentence comprehension, then signers’ gaze patterns to non-linguistic visual targets should not be modulated by the prediction created when the verb semantically constrains the upcoming noun. Alternatively, as we hypothesise, if prediction allows for the integration of linguistic and non-linguistic information within the same modality, as it does when this information is segregated by modality, then signers’ gaze patterns should vary as a function of the presence or absence of the semantic constraint created for the upcoming noun from the preceding verb. In Experiment 1, we test this hypothesis in adult deaf signers. In Experiment 2, we ask whether the gaze patterns observed in adult signers during this sentence processing task are evident in deaf children between the ages of 4 and 8 years old, to determine if there are developmental changes in the timing and consistency of these gaze fixations.

Experiment 1: ASL sentence processing in adult deaf signers

Methods

Participants

Seventeen deaf adults between the ages of 19 to 61 years (M = 32) participated. There were seven females. All of the adults reported using ASL as their primary means of communication. Nine participants had deaf parents and had been exposed to ASL from birth. The remaining eight participants had hearing parents and were first exposed to ASL at various ages—before the age of 2 (n=6), at the age of 5 (n=1) and at the age of 11 (n=1). All participants had been using ASL as their primary form of communication for at least 19 years. One additional adult was tested but was unable to complete the eye-tracking task.

Eye-tracking materials

The stimulus display consisted of four pictures on a screen and one ASL sentence. The stimulus pictures were colorful photo-realistic images presented on a white background square measuring 300 by 300 pixels. The ASL signs were presented on a black background square also measuring 300 by 300 pixels. The pictures and signs were presented on a 17-inch LCD display with a black background, with one picture in each quadrant of the monitor and the sign positioned in the middle of the display, equidistant from the pictures (Figure 1).

Layout of pictures and signed video stimuli

Eight sets of four pictures served as the stimuli for the prediction task. Each picture set contained four objects, each of which could be paired with either a neutral verb or a unique semantically-constraining verb. For example, one set consisted of a jug of milk, a baby doll, a book, and a cake, which were paired, respectively, with the verbs POUR, HUG, READ, and EAT. During the experiment, each set of pictures was presented twice—once in the neutral condition and once in the constrained condition, for a total of 16 critical trials. Each picture in the set was equally likely to serve as a target across versions of the stimuli sets. Thus for each participant, one picture from the set of four served as a target in the constrained condition and a different picture served as the target in the neutral condition; each participant saw 16 of the 32 possible target signs. Target items were counterbalanced across participants so that each participant saw eight target nouns produced in the neutral condition and eight different target nouns produced in the constrained condition. The arrangement of pictures was counterbalanced such that each picture was equally likely to appear in any position, and the same picture never occurred in the same location twice. Finally, the order of trials was pseudo-randomised such that the same picture set never appeared in two consecutive trials.

The linguistic stimulus consisted of an ASL sentence that directed the participant towards one of the target pictures. Each sentence was composed of the structure VERB WHAT TARGET. This sentence structure was chosen to be comparable to spoken language sentences typically used in this paradigm such as “See the cake,” in which there is a determiner separating the verb and the noun. In ASL syntax, there is no article before a noun, and use of a pronoun or determiner (e.g. THAT) would necessitate use of a directional cue (i.e. the determiner would be articulated with spatial modification and a non-manual marker). Pragmatically, the WHAT sign following the verb is not intended as a true question or even a rhetorical question, rather it is a syntactic device to focus the constituent (Wilbur, 1996). While adding the sign WHAT to the sentence created a longer verb window, it enabled us to make a clear delineation between the verb and noun windows for analysis.

In the neutral condition, the verb at sentence onset paired equally well with each of the four pictures on the screen. Four different semantically neutral verbs were used -- FIND, LOOK-FOR, SEE, and SHOW-ME. In contrast, in the constrained condition the verb at sentence onset limited the possible target such that only one picture represented an item that was semantically related to the verb. For example, the ASL sentence “POUR WHAT MILK” appeared with only one “pour-able” item (milk) and three unrelated objects that were implausible as objects of the verb POUR (see appendix for a full list of constraining verb-target pairs).

To create the stimulus ASL sentences, a deaf native signer was filmed as she produced each sentence in a natural, child-directed style. The sentences were then edited using Adobe Premiere software to eliminate extraneous frames such that the verb phrase “VERB WHAT” was of uniform length across sentences (2000ms). This allowed us to compare looking time across trials and conditions using a noun onset point of 2000ms following sentence onset. To control for co-articulation effects in the transition from the sign WHAT to the onset of the target word, we chose a conservative definition of sign onset, in which sign onset was defined as the first frame in which the signer’s hands left the final position of the sign WHAT. This approach enabled us to account for variation in transition time from the articulation of the previous sign to the initial position of the target sign, such as the difference in time it takes to move the hands to the torso vs. the face. The length of the final noun was 1000ms (+/− 33ms). The video ended on the frame following the final movement of the sign. All sign editing was completed by the first author, a proficient ASL signer, and by a research assistant who is a deaf, native ASL signer.

Experimental task

After obtaining consent, participants were brought into the testing room and seated in front of the LCD display and eye-tracking camera. The stimuli were presented using a PC computer running Eyelink Experiment Builder software (SR Research). Instructions were presented in ASL on a pre-recorded video. Instructions were presented in a child-directed sign register and explained to participants that they would be playing a game, where they would see pictures followed by an ASL sign, and that they should click on the picture that matches the sign. Participants were given two practice trials before the start of the experiment. Next a 5-point calibration and validation sequence was conducted. In addition, a single-point drift check was performed before each trial. The 16 experimental trials were then presented in a single block. After the trial block, participants were given a break during which they watched a short, engaging animated video, and then viewed trials of a different nature as part of a separate experiment.

On each trial, the pictures were first presented on the four quadrants of the monitor. Following a 750ms preview period, a central fixation cross appeared. When the participant fixated gaze on the cross, this triggered the onset of the video sentence stimulus. After the ASL sentence was presented, it disappeared and, following a 500ms interval, a small cursor appeared in the center of the screen. The pictures remained on the screen until the participant clicked on a picture, which ended the trial (Figure 2). The mouse clicking behavior served as confirmation that participants understood the task, but as the cursor did not appear until after the relevant analysis windows, mouse clicking should not influence gaze behavior.

Schematic of time periods used for statistical analysis

Eye-movement recording

Eye movements were recorded using an Eyelink 1000 remote eye-tracker with remote arm configuration (SR Research) at 500 Hz. The position of the display was adjusted manually such that the display and eye-tracking camera were placed 580–620 mm from the participant’s face. Eye movements were tracked automatically using a target sticker affixed to the participant’s forehead. Fixations were recorded on each trial beginning at the initial presentation of the picture sets and continuing until the participant clicked on the selected picture. Offline, the data were binned into 50-ms intervals.

Results

Accuracy on the experimental task

All adult participants completed all 16 trials with near 100% accuracy. That is, all participants selected the correct target picture on all 16 trials with the exception of one participant who made one error. This trial was removed from all analyses.

Approach to eye-tracking analysis

The primary goal of the analysis was to determine whether there was an overall effect of prediction, defined as increased looks to the target picture in the constrained condition relative to the neutral condition in the time window that occurs before the onset of the target sign. We then conducted a finer grain inspection of gaze patterns over the time course of the sentence to when and how this prediction effect was manifested.

We divided the time course into two discrete windows of interest for statistical analysis. These windows were derived from the timing of the noun and verb in the linguistic stimulus, as follows. The verb window was defined as the portion of the sentence beginning at verb onset and continuing until noun onset. To be consistent with prior spoken language studies (Mani & Huettig, 2012), the analysis began 300ms following verb onset, and extended for 1700ms which was the point of noun onset. During this verb window, any increase in looks to the target in the constrained relative to the neutral condition would be evidence for anticipatory prediction effects. The second window, defined as the noun window, began at 300ms following noun onset and continued for 1700ms. This end point corresponded to 4000ms from sentence onset, at which point participants had largely either ended the trial or looked away from the target picture. Figure 2 illustrates the windows of analysis.

To analyze gaze patterns in these discrete time windows, we calculated the proportion of gaze samples in each time window on which participants fixated on the target, and aggregated together all trials within each condition for each participant. To visually examine gaze patterns, we plotted proportion of looks to the target in each window. This measure parallels prior eye-movement analyses in language paradigms using a similar design with children (e.g. Mani & Huettig, 2012), and therefore facilitates comparison more directly across spoken and sign language designs (Figure 3a). However, there are some important methodological differences in our current design compared to spoken visual-world tasks which necessitate a measure that takes into account not only the influence of the target relative to the static pictures, but also accounts for looks to the video of the signer. To address this issue, we transformed the mean proportion of fixations to the target using an empirical logit function (Barr, 2008) calculated by the EyetrackingR package in R (Dink & Ferguson, 2015). The empirical logit allows us to derive a measure that takes into account target fixations with respect to all other interest areas on the screen, including both the dynamic sign video and the static pictures.

Adults’ mean fixations (s.e.) to target by time window and condition; a) proportion and b) empirical logit (Elog) of target looks.

We carried out statistical analyses on the empirical logit of target fixations using a linear mixed-effects regression model (Barr, 2008) using the lme4 package in R (Version 3.3.1), with random effects for participants and items (Baayen, Davidson, & Bates, 2008). The model included fixed effects of time window (verb vs. noun, sum-coded and centered), condition (neutral vs. constrained, sum-coded and centered), and the interaction between window and condition. Following the window analysis, we generated a time course visualization of looks to the target, video stimulus, and distractor pictures throughout the 4000ms beginning at sentence onset. Visual inspection of the time course enabled us to observe differences in gaze patterns in the constrained and neutral conditions as the sentence unfolded.

Eye-tracking results

Window analysis

We calculated the mean proportion of looks to the target in the verb and noun time windows in the constrained and neutral conditions (Figure 3). As described above, we fit a linear mixed-effects regression to compare the empirical logit transformation of target fixations by condition (constrained, neutral) and time window (verb, noun). Table 1 outlines the results of this analysis, which revealed main effects of time window and condition, and a significant interaction between time window and condition. Planned pairwise comparisons revealed that participants looked significantly more at the target in the constrained condition versus the neutral condition in the verb window (p = .003) but not in the noun window (p = .6). Thus the data show the expected prediction effect as evidenced by greater (anticipatory) looks to the target during the verb window in the constrained condition only.

Table 1.

Parameter estimates from the best-fitting mixed-effects regression model of the effects of time window and condition on empirical logit of adults’ looks to the target picture

Fixed effects	Estimate	Std. Error	t value	p-value
(Intercept)	3.40	.20	17.12	<.001
Condition	−2.23	.31	−7.08	<.001
Time window	−2.74	.26	−10.56	<.001
Condition x Time window	−3.17	.52	−6.13	<.001

Open in a new tab

Time course of fixations across the sentence

Figure 4 plots the time course of fixations to the target for the first 4000ms following sentence onset. As predicted, the proportion of looks to the target picture increased earlier in the constrained condition than in the neutral condition. Importantly, we also observed an additional gaze pattern unique to sign processing. Specifically, following the initial gaze to the target picture, signers directed a shift back to the linguistic signal, as evidenced by the decline in looks to the target and rise in looks to the ASL sentence between 1000–2000ms following sentence onset. This suggests that signers use a strategy of rapid alternation of gaze between linguistic input and the visual scene to integrate information from both sources of visual input. The robustness of the observed rise and fall in looks to the target during this time window indicates that signers were remarkably consistent in the timing of their gaze shifts from the target picture back to the sentence video. This pattern appears to be an adaptation to dividing attention between visual language and visual referents, a task that is not necessary for listeners perceiving spoken language.

Time course of adults’ mean fixations to the sign video, target picture, and distractor pictures from 0-4000ms following verb onset. The vertical line at 2000ms marks the end of the sign WHAT and the onset of the target noun.

Discussion of Experiment 1

Adult ASL signers showed prediction effects as evidenced by early and sustained looks to the target picture in the semantically constrained condition before the target noun was presented. After the target noun was presented, however, there was no difference in looking time to the target based on condition. Adults clearly used semantically constraining information at the onset of an ASL sentence to predict the target noun occurring at the end of the sentence. Furthermore, adults shifted gaze back to the video following an initial gaze to the target picture in the constrained condition. Together these results suggest that deaf adult signers with native or early exposure to ASL rely on prediction as a means to integrate linguistic and non-linguistic information within the same modality, and thus process semantic information incrementally in a way that is highly similar to spoken language processing. Like spoken language processing, ASL processing involves a shift to the target only when prediction enables gaze away from the linguistic signal to the non-linguistic one. Unique to sign language, however, is the fact that this gaze shift to the target necessitates a gaze shift away from the linguistic signal. Further, following gaze at the target, signers then gaze back at the linguistic stimulus. As discussed below, this may be driven by a desire to ensure that the actual target noun matches the previously predicted target noun, or it may be driven by the visual salience of the dynamic video. Having established that prediction modulates gaze patterns of adult deaf signers during on-line sentence comprehension, we now ask whether the same pattern is observed across development in deaf children.

Experiment 2: ASL sentence processing in child signers

While Experiment 1 demonstrated predictive processing in adult signers, in Experiment 2 we investigated processing in child signers between the ages of 4 and 8. Specifically, we ask whether children in this age range show evidence for predictive processing in the identical sentence comprehension task as stimuli as in Experiment 1. We further ask whether there are individual differences in deaf children’s ability to exploit semantically constraining information based on age and/or ASL knowledge. Hearing children processing spoken language make prediction based on semantic information from the age of two; thus we might expect older deaf children to perform in a manner parallel to adults. However, given the increased cognitive demands on visual attention arising from the need to divide attention between linguistic and referential information, children perceiving sign language may not show the same robust anticipatory looking behavior as that exhibited in studies of young spoken language learners.