Look at that: Spatial deixis reveals experience-related differences in prediction

Tracy Reuter; Mia Sullivan; Casey Lew-Williams

doi:10.1080/10489223.2021.1932905

. Author manuscript; available in PMC: 2023 Jan 1.

Published in final edited form as: Lang Acquis. 2021 Jul 30;29(1):1–26. doi: 10.1080/10489223.2021.1932905

Look at that: Spatial deixis reveals experience-related differences in prediction

Tracy Reuter ¹, Mia Sullivan ¹, Casey Lew-Williams ¹

PMCID: PMC8916748 NIHMSID: NIHMS1714235 PMID: 35281590

Abstract

Prediction-based theories posit that interlocutors use prediction to process language efficiently and to coordinate dialogue. The present study evaluated whether listeners can use spatial deixis (i.e., this, that, these, and those) to predict the plurality and proximity of a speaker’s upcoming referent. In two eye-tracking experiments with varying referential complexity (N = 168), native English-speaking adults, native English-learning 5-year-olds, and non-native English-learning adults viewed images while listening to sentences with or without informative deictic determiners, e.g., Look at the/this/that/these/those wonderful cookie(s). Results showed that all groups successfully exploited plurality information. However, they varied in using deixis to anticipate the proximity of the referent; specifically, L1 adults showed more robust prediction than L2 adults, and L1 children did not show evidence of prediction. By evaluating listeners with varied language experiences, this investigation helps refine proposed mechanisms of prediction, and suggests that linguistic experience is key to the development of such mechanisms.

Keywords: language processing, deixis, prediction, simulation, association, covert imitation

A number of recent theories claim that prediction can support language processing and learning (Chang, Dell, & Bock, 2006; Christiansen & Chater, 2016; Pickering & Gambi, 2018; Pickering & Garrod, 2007, 2013). For example, Pickering and Garrod (2007; 2013) propose that comprehenders predict upcoming speech in order to coordinate dialogue. Supporting this view, a number of studies demonstrate that both adults and children can generate predictions during language processing (for review see Kutas, DeLong, & Smith, 2011). Despite the rapid pace of spoken language, listeners can use a variety of linguistic and nonlinguistic indications to predict upcoming information in speech. For example, in an eye-tracking experiment, Lukyanenko and Fisher (2016) found that two- and three-year-old children used number markings (is and are) to predict singular or plural referents in sentences such as “Where is the good apple?” and “Where are the good cookies?” This finding, among many others, supports the central claim in a number of contemporary psycholinguistic theories that prediction occurs during language processing.

However, language scientists and developmental scientists have continued to debate a range of issues, such as the extent to which prediction supports everyday language processing (Huettig & Mani, 2016; Kutas, DeLong, & Smith, 2011), and whether prediction supports language acquisition (Phillips & Ehrenhofer, 2015; Rabagliati, Gambi, & Pickering, 2016). A particularly central issue is understanding how prediction occurs at the intersection of language comprehension and production. Pickering and Garrod (2013) proposed that comprehenders can generate predictions via two routes: association or simulation. The association route relies on comprehension mechanisms, whereas the simulation route relies on production mechanisms. Importantly, for prediction to occur via simulation, the listener “must be able to represent what the speaker would say, not what he himself would say, and to do this, he needs to take into account the context” (Pickering & Garrod, 2013, p. 341). That is, accurate prediction via simulation requires some consideration for the speaker’s perspective, such as their visual viewpoint within the referential context, their state of mind, their age, or their knowledge of the conversational topic. Comprehenders can either estimate what their conversational partner would be likely to produce within a particular conversational context (i.e., simulation), or they can generate predictions based strictly on their own perspective (i.e., association).

While Pickering and Garrod (2007; 2013) largely focus on the state of prediction in adulthood, they do speculate as to the development of predictive mechanisms, suggesting that children, ostensibly due to overall differences in language proficiency as compared to native-speaking adults, may not predict via simulation and may instead rely on the association route. Indeed, there are a number of reasons to expect that simulation of a speaker’s upcoming utterances may be challenging for children. First, prior findings suggest that while five-year-old children can take a speaker’s perspective into account, they still make egocentric errors regarding what is or is not common ground among interlocutors (Clark, 1992; Nilsen & Graham, 2009). If taking the speaker’s perspective during language processing is generally challenging for children, then they may have difficulty simulating speakers’ perspectives and upcoming words, either deterministically or under certain circumstances (Pickering & Garrod, 2013; Pickering & Gambi, 2018).

Another reason to think that children may not predict via simulation is that they often have difficulty using contextual factors to guide their real-time language processing, such as the number of available referents in a shared visual context, the relations between agents and objects, or the order in which speakers mention referents. Whereas adults are adept in using surrounding visual and discourse context to rapidly resolve linguistic ambiguities, children have difficulty doing so for ambiguous syntax (Hurewitz et al., 2000; Snedeker & Trueswell, 2004; Trueswell, Sekerina, Hill, & Logrip, 1999) and ambiguous pronouns (Arnold, Brown-Schmidt, & Trueswell, 2007). For example, Arnold and colleagues (2007) found that adults combined grammatical gender cues with pragmatic reliance on order-of-mention cues to resolve ambiguous pronouns and accurately identify a speaker’s intended referent, but 3- to 5-year-old children failed to take order-of-mention cues into account. Children’s relative difficulty in integrating surrounding contextual information could derail rapid, accurate simulation of speakers’ upcoming productions.

On the other hand, there are reasons to suspect that children may be capable of generating predictions via simulation. A limited number of prior findings suggest that children can, under some circumstances, incorporate information about the speaker and the surrounding visual and linguistic context when generating predictions. For example, children can generate accurate predictions on the basis of a speaker’s disfluencies (Kidd, White, & Aslin, 2011), as well as the speaker’s identity (Borovsky & Creel, 2014). Thus, while Pickering and Garrod (2013) speculate that children may be limited to the association route, some results suggest that children (like adults) could use both association and simulation routes for prediction. Additionally, as a general point, children have shown the ability to predict in dozens of prior studies on language processing, suggesting broad prowess in using diverse cues to rapidly interpret referential contexts (Borovsky, Elman, & Fernald, 2012; Borovsky & Creel, 2014; Fernald, Zangl, Portillo, & Marchman, 2008; Havron, de Carvalho, Fiévet, & Christophe, 2018; Kedar, Casasola, Lust, & Parmet, 2017; Kidd, White, & Aslin, 2011; Lew-Williams, 2017; Lew-Williams & Fernald, 2007; Lukyanenko & Fisher, 2016; Mani & Huettig, 2012; Reuter, Borovsky, & Lew-Williams, 2019; Waxman, 1999; Waxman, Lidz, Braun, & Lavin, 2009; Ylinen, Bosseler, Junttila, & Huotilainen, 2016; Yurovsky, Case, & Frank, 2017). While these studies do not provide evidence for prediction via simulation – specifically, they do not provide evidence that comprehenders take a speaker’s visual perspective into account or simulate their word productions – they also do not contain evidence against prediction via simulation. They do suggest at least some capacity to use whatever linguistic or visual cues are available to efficiently interpret incoming speech. It is more likely, however, that children in these studies relied on phonological, morphosyntactic, and/or semantic associations between related words to generate real-time predictions, such as the use of informative verbs to anticipate upcoming nouns (Mani & Huettig, 2012).

Second language (L2) learners provide an interesting test case for understanding the emergence of prediction and the possibility of prediction via simulation more specifically. Much like children acquiring their first language, L2 adults could face difficulties in generating accurate predictions as they navigate referential contexts using their second language. While L2 adults have mature perspective-taking abilities, which are often needed for accurate prediction via simulation, they necessarily have less total experience with their L2, as compared to L1 adults, and may therefore have an overall reduced ability to generate predictions during language processing. Indeed, a number of prior findings indicate that L2 adults’ predictions are attenuated and more variable than those of L1 adults, known as the Reduced Ability to Generate Expectations (RAGE) hypothesis (Grüter, Lew-Williams, & Fernald, 2012; Grüter, Rohde, & Schafer, 2017; Kaan, 2014; Lew-Williams, 2017; Lew-Williams & Fernald, 2010; Mitsugi & MacWhinney, 2016). For example, Lew-Williams and Fernald (2010) found that L2 Spanish-learning adults did not reliably use the grammatical gender of definite articles (i.e., el and la) to accurately anticipate upcoming referents in an eye-tracking task. However, much like the aforementioned developmental findings, existing evidence for prediction among L2 adults does not definitively determine whether they use association or simulation as a basis for generating those predictions. Although prior investigations for L1 children and L2 adults do not clearly differentiate which predictive mechanisms (i.e., association or simulation) may underlie the observed effects, they do suggest that everyday language experience is important for language learners of any age to develop the abilities necessary for rapid and accurate predictions – whether via association, via simulation, or both.

The present study aimed to unite this previous literature on L1 adults, L1 children, and L2 adults in order to (1) further what is known about the mechanisms supporting prediction during real-time language processing, and (2) examine how differences in listeners’ language experiences shape the diverse ways in which they could generate predictions. To evaluate prediction via simulation, we selected a particularly apt feature of language: spatial deixis, which includes words such as this, that, these, and those. Spatial deixis is a particularly useful test case for two reasons. First, this and that are singular, whereas these and those are plural. Deictic determiners convey both morphosyntactic and lexical semantic cues to number information, which in combination may support prediction, as found in prior research (Lew-Williams, 2017; Lew-Williams & Fernald, 2009; Lukyanenko & Fisher, 2016). Second, this and these typically indicate referents proximal to a speaker, whereas that and those typically indicate distal referents. Although the specific spatial interpretations of deictic determiners are not encoded morphosyntactically and can vary based on the particular referential context (Clark & Sengul, 1978; Diessel, 1999; Fillmore, 1997; see Levinson, 2004 for review), the lexical semantic information encoded in deictic determiners may in fact allow listeners to anticipate proximal or distal referents in real time. Critically, using spatial deixis to predict the referent’s proximity to the speaker would require taking the speaker’s perspective into account (i.e., simulation), because deictic words are anchored on the speaker’s perspective. Whereas using deictic determiners to predict the plurality of the upcoming referent could be achieved via association, using deictic determiners to predict the proximity of the upcoming referent could only be achieved via simulation – taking the speaker’s visual perspective into account and adjusting predictions accordingly.

In two experiments, we evaluated listeners’ comprehension of spatial deixis with three groups of participants: native English-speaking adults, native English-learning 5-year-olds, and non-native English-learning adults. We refer to these groups as L1 adults, L1 children, and L2 adults, respectively, although there was notable heterogeneity within the L2 adult group. Five-year-old children were targeted for several reasons. Although the learning of deictic words may be slow and error-prone, prior evidence suggests that children reliably comprehend and produce deictic terms beginning around 4 years of age (Clark & Sengul, 1978; Tanz, 1980). Relatedly, previous findings indicate that children between the ages of 3 and 6 years old can use morphosyntactic number markings as a basis for predictions (Lew-Williams, 2017; Lew-Williams & Fernald, 2009; Lukyanenko & Fisher, 2016), therefore 5-year-olds could potentially use deictic determiners’ number markings to anticipate the speaker’s upcoming referent. However, evidence from referential communication tasks indicate that children’s ability to accommodate a speaker’s differing perspective is still developing at this age (Epley, Morewedge, & Keysar, 2004; Nadig, & Sedivy, 2002; Nilsen & Graham, 2009). Given that perspective-taking is one critical component of accurate prediction via simulation, it is possible that 5-year-olds may only be able to generate predictions via association, as claimed by Pickering and Garrod (2013). Based on prior developmental research, we expected that L1 children might use number marking cues during real-time language processing (which do not require perspective-taking), but fail to show reliable evidence of using proximity cues (which do require perspective-taking) as a basis for generating predictions. Relatedly, we examined L2 adults based on Pickering and Garrod’s (2013) claimed that comprehenders with less language experience – both L1 children and L2 adults – rely on the simpler association route for generating predictions. Prior findings do indicate more attenuated and variable predictions among L2 adults, as compared to L1 adults (Grüter et al., 2012; see Kaan, 2014 for a review), but these results do not address whether association or simulation might underlie the observed differences. The present study aims to examine nuanced, experience-based differences in comprehenders’ predictive mechanisms (i.e., association vs. simulation). Specifically, by comparing predictive language processing among L1 adults, L1 children, and L2 adults, the present study aims to better understand how different predictive mechanisms may arise from listeners’ varying language experiences.

Two eye-tracking tasks evaluated each group’s comprehension of spatial deixis terms under conditions of referential complexity (Experiment 1) and reduced referential complexity (Experiment 2). Specifically, a cartoon speaker with the opposite perspective of the participants used various deictic determiners to refer to objects. Some referents were closer to the speaker, and some were closer to the participant; some were singular, and some were plural. The main hypothesis was that L1 adults would predict via simulation, taking the speaker’s perspective into account during real-time language processing. In contrast, L1 children and L2 adults – theoretically due to their relative lack of experience comprehending and producing English sentences – may not predict via simulation. Importantly, L1 children and L2 adults may fail to do so for varying reasons: difficulty comprehending spatial deictic terms (Clark & Sengul, 1978; Tanz, 1980), difficulty taking the speaker’s perspective and inhibiting their own, opposite perspective (Clark, 1992; Nilsen & Graham, 2009), or difficulty incorporating the surrounding linguistic context (Snedeker & Trueswell, 2004). More specifically, we predicted that L1 adults would rapidly use morphosyntactic number marking cues and lexical semantic proximity cues to predict the likely referent from the speaker’s perspective, e.g., use this to look at a referent close to the speaker, or use those to look at referents close to themselves. Additionally, we expected that L1 children and L2 adults would successfully form predictions using morphosyntactic number marking cues, as in prior studies (Lew-Williams, 2017; Lew-Williams & Fernald, 2009; Lukyanenko & Fisher, 2016), but show relatively poorer abilities to predict using lexical semantic proximity cues. Together, by using spatial deixis as a lens to evaluate listeners’ prediction abilities and by comparing three groups of participants with contrasting histories of language experience, these experiments further what is known about how prediction occurs during real-time processing.

Experiment 1

Method

Participants

Participants were 28 native English-speaking adults (11 male), 28 non-native English-learning adults (13 male), and 28 children (10 male) from monolingual English-speaking households. We refer to these groups as L1 adults, L2 adults, and L1 children, respectively. L1 adults and L2 adults were all members of the Princeton University campus community. L1 adults were 18 to 34 years old (M = 20.54 years, SD = 4.23 years), L2 adults were 18 to 34 years old (M = 23.14 years, SD = 4.97 years), and L1 children were 60 to 71 months old (M = 64.9 months, SD = 3.8 months). L1 adults were significantly younger, on average, than L2 adults (t(52) = −2.11, p = 0.039). However, according to self-report measures, L2 adults had significantly fewer years of English exposure than L1 adults (L1 adults: M = 20.5, SD = 4.2, L2 adults: M = 16.7, SD = 4.9, t(53) = 3.14, p = 0.003). Note that ‘L1’ refers to any individual who learned English from birth, and ‘L2’ refers to any individual who learned English later. L2 adults varied in the age at which they reportedly began learning English (M = 6 years, SD = 3 years, range = 1–12 years). Two L2 adults in Experiment 1 learned English prior to age 3, and therefore may not meet traditional criteria for being an L2 learner, but including versus excluding them from analyses did not change any of the statistical analyses. Analyses that exclude these two participants are available in Supplementary Materials on the Open Science Framework (OSF), along with all deidentified data and additional descriptive information about L2 adults.

Given that we were interested in general experiential differences between L1 and L2 adults, and that we did not aim to evaluate a specific L1-L2 pairing, L2 adult participants represented a wide range of native languages: Bulgarian (2), Cantonese (6), German, Hebrew, Indonesian, Italian, Japanese, Kinyarwanda, Korean (4), Mandarin, Modern Greek, Nepalese, Portuguese, Russian (2), Spanish (2), Telugu, and Vietnamese. We tested but excluded one child participant (male, 61 months old) from all analyses due to the caregiver talking during the experiment. The research protocol was approved by the Princeton University Institutional Review Board (IRB record number 7117) and conformed to all guidelines for ethical treatment of participants.

Stimuli

Auditory stimuli were pre-recorded sentences, including: instructions, practice trials, test trials, and filler trials. Instructions introduced the computerized narrator of the task, the spatial context of the task, and the goal of the task: “Hi, I’m Sally! I have a computer game for you. We’ll see some things on this big, long table. Some things will be close to me, over here. Other things will be close to you, over there. I’ll name something I see on the table. Try to find it with your eyes as fast as you can.” Two practice trials occurred immediately after instructions and further reinforced the spatial context of the task by providing a direct juxtaposition of two deictic demonstratives: “Look at that happy cow over there. Now look at this happy cow over here. Look at this pretty horse over here. Now look at that pretty horse over there.” Test trials allowed us to evaluate whether participants could exploit deixis to predict an upcoming referent. Each test sentence was composed of a single command (Look at), one of five demonstratives (the, this, that, these, those), one of two adjectives (beautiful, wonderful), and a singular or plural target noun (baby, doggy, kitty, turtle, apple, cookie, truck, bike). Finally, filler trials included simple, affirmative statements (e.g., “Wow! You’re doing great!”).

A female, native speaker of English recorded auditory stimuli, using child-directed intonation. We used Praat (Boersma & Weenink, 2017) to normalize the duration and intensity of the stimuli, such that each test sentence had a total duration of 2364 ms and a mean intensity of 65 dB. We aimed to assess whether listeners could use deictic determiners (e.g., this) to predict upcoming target nouns (e.g., cookie), so we also used Praat to identify the mean determiner onset (569 ms, range = 420 ms to 710 ms) and mean target noun onset (1639 ms, range = 1429 ms to 1787 ms). Thus, on average, deixis onset occurred 1070 ms before the onset of the target noun.

Visual stimuli were a subset of images from a prior eye-tracking study (Lukyanenko & Fisher, 2016). Images included singular and plural versions of each target noun (e.g., one cookie, two cookies). The versions of the images were matched in size (for details on this approach, see Lukyanenko & Fisher, 2016). Visual stimuli also included an image of a female cartoon speaker. The speaker was positioned behind an image of a table that included depth perspective cues. Specifically, the table was wider at the bottom and narrower at the top of the image. The image of the speaker and table served as a backdrop for the four referent images (Figure 1).

Figure 1: — Sample test trial for Experiment 1. During each test trial, participants heard a sentence referring to one of the four images (e.g., *Look at the/that wonderful cookie*).

During each test trial, four referents appeared. Two of the referents were plural, and two were singular. Two of the referents were proximal to the speaker, and two were distal to the speaker. Each trial included a plural referent and a singular referent proximal to the speaker, and a plural referent and a singular referent distal to the speaker, such that plurality and proximity were not conflated. Referents were visible for 2 seconds prior to the onset of the auditory stimuli.

Trials appeared in one of four quasi-randomized orders. Each order included instruction sentences, two practice trials, 32 test trials (16 with a deictic sentence and 16 with a neutral sentence), and four filler trials. Filler trials occurred every eight trials. Target side (left, right of the speaker), target plurality (singular, plural), and target proximity (proximal, distal to the speaker) were counterbalanced for each target noun. Target side, target plurality, and target proximity did not repeat for more than four consecutive trials. Visual stimuli, auditory stimuli, and experimental designs are available on the Open Science Framework (OSF).

Procedure

The experiment took place at the Princeton Baby Lab, in a sound-attenuated study room. The experimenter sat opposite the participant. Participants sat in a chair, approximately 60 cm from the eye-tracker. Child participants sat in a booster seat. In order for the eye-tracker to measure eye movements, participants wore a small target sticker on their face. The experimenter used EyeLink Experiment Builder software (SR Research, Mississauga, Ontario, Canada) and controlled the task from a Mac host computer. The experimenter first calibrated the eye-tracker for each participant using a standard 5-point calibration procedure. Participants listened to pre-recorded task instructions, then completed the eye-tracking task. Throughout the task, participants viewed stimuli on a 17-inch LCD monitor. An EyeLink 1000 Plus remote eye-tracker, sampling at a rate of 500 Hz, recorded participants’ eye movements. The total duration of the eye-tracking task, including calibration, was approximately 4 minutes. Immediately following the eye-tracking task, L2 adults completed a questionnaire about their language background (see Appendix) in order to explore possible relations between prediction abilities and self-reported language proficiency.

Results

During the experiment, the eye tracker automatically recorded participants’ fixations every 2 ms (500 Hz). We analyzed samples recorded within a 400×200 pixel area surrounding each visual referent and eliminated any samples that were outside of these visual areas of interest (2,973,231 of 8,420,660 samples, 34%) prior to aggregating data within 100-ms time-bins. Further inspection of the data indicated high quality: We evaluated the number of time-bins that each participant contributed per trial, and found that, on average, participants provided data for 77% of time-bins (M = 12.5 bins, SD = 4.43 bins). If a participant did not contribute any data for a trial, then that trial was necessarily excluded from analyses. However, missing trials were rare. Participants contributed data for the vast majority of trials (M = 31 trials, SD = 1.17 trials, range = 26–32 trials), contributing data for 2638 of 2688 total trials (98%). That is, only 2% of trials were necessarily excluded due to missing data. We used R software (version 3.6.0) for all analyses. Deidentified data and R code for reproducible analyses are available on the Open Science Framework (OSF).

We completed three main analyses to evaluate listeners’ task performance. First, we analyzed participants’ looking behaviors during practice trials. If participants understood the spatial context of the task, then we expected them to reliably identify the target referent during practice trials. Next, we evaluated participants’ looking behaviors during singular and plural deictic test sentences (comparing this/that to these/those). We then evaluated participants’ looking behaviors during proximal and distal deictic test sentences (comparing this/these to that/those).¹ If participants use the plurality and proximity information conveyed by deictic determiners to predict the speaker’s referent, then we expected to observe significant differences in looking behaviors before the onset of the target noun, indicating that participants launched anticipatory eye movements in response to the deictic determiners.

We used mixed-effects logistic regression models and cluster-based permutation analyses, detailed below, to evaluate participants’ looking behaviors during deictic test sentences. Mixed-effects models are commonly used to analyze eye-tracking data (for reviews see: Barr, 2008; Barr, Levy, Scheepers, & Tily, 2013; Huettig, Rommers, & Meyer, 2011), and simultaneously account for fixed effects (i.e., variance attributable to the experimental conditions) and random effects (i.e., variance attributable to particular subjects and items). However, mixed-effects model results (namely, effects of time as a factor) must be interpreted cautiously because data from sequential time points are not independent. For example, where a participant is looking at 100 ms is largely dependent on where they were looking at 0 ms. Cluster-based permutation analyses can address this limitation and provide a useful analytical follow-up for mixed-effects models. These nonparametric analyses are commonly used for analyzing neurophysiological time course data (see Maris & Oostenveld, 2007 for a review), and are becoming increasingly common for analyzing eye-tracking data as well (Borovsky, 2017; Borovsky et al., 2015; Borovsky et al., 2016; Chan et al., 2018; Dautriche, Swingley, & Christophe, 2015; Hahn, Snedeker, & Rabagliati, 2015; Oakes et al., 2013; Reuter et al. 2021; Von Holzen & Mani, 2012; Wittenberg, Khan, & Snedeker, 2017). Together, mixed-effects models and cluster-based permutation analyses aid in evaluating whether there may be differences in the time course of prediction among L1 adults, L1 children, and L2 adults.

Practice Trials

We first confirmed that participants understood the spatial context of the eye-tracking task by analyzing their looking behavior during practice trials. Practice trials used deixis contrastively and emphasized the proximity information encoded in deictic terms by pairing the proximal and distal deictic terms with “over here” and “over there” respectively. We analyzed participants’ proportion of target looks during a time window from 200 ms after the exact onset of the deictic determiner to 2000 ms after the exact onset of the target noun using one-tailed one-sample t-tests to compare target looks to chance performance (50%). We found that all groups reliably looked to the target referent during practice trials (L1 adults: t(27) = 11.35, p < 0.001, Cohen’s d = 2.14; L1 children: t(27) = 3.75, p < 0.001, Cohen’s d = 0.71; L2 adults: t(27) = 9.27, p < 0.001, Cohen’s d = 1.75). Findings therefore indicate participants reliably comprehended deictic terms when used contrastively. Importantly, these findings indicated that participants understood the spatial context of the eye-tracking task, which imitated a 3-dimensional conversational setting with 2-dimensional images, with some objects proximal to the speaker and some distal to the speaker.

Deictic Test Trials

In order to assess how participants used deictic determiners (e.g., this) to predict upcoming nouns (e.g., cookie), we analyzed participants’ looks to referents during a time window from 1000 ms before target noun onset to 500 ms after noun onset. If participants use deictic determiners to predict the speaker’s upcoming referent, we expected that effects would emerge at some point during this time. Importantly, if participants use deictic determiners to anticipate the plurality and proximity of the upcoming referent, then we should expect to observe the emergence of effects before the onset of the target noun (0 ms), with consideration in our analyses of the time it takes (approximately 200 ms) to initiate an eye movement (Matin, Shao, & Boff, 1993). Although it is possible that later effects could also reflect prediction (i.e., listeners could be more efficient in comprehending the target noun if it is preceded by an informative deictic determiner), we defined prediction as eye movements initiated prior to noun onset, as is common in prior research (e.g., Kidd, White, & Aslin, 2011; Mani & Huettig, 2012; Reuter et al., 2019). Our figures that summarize results by group and by experiment (Figure 2, Figure 3, and Figure 5) reveal how participants’ looking behaviors changed over time, with 0 representing the exact onset of the target noun. Time measures are not offset by 200 ms to account for the time it takes to launch an eye movement.

Figure 2: — Results from Experiment 1. Proportion of looks to plural referents for L1 adults (n = 28), L1 children (n = 28), and L2 adults (n = 28) during plural deictic sentences (blue) and singular deictic sentences (purple). Line shading represents one standard error from the mean, averaged by subjects. Vertical dashed lines indicate noun onset. Area shading indicates significant effects from cluster-based permutation analyses (ps < 0.05). Results indicate that L1 adults, L1 children, and L2 adults used the plurality of deictic determiners to predict the plurality of the upcoming referent, as evidenced by anticipatory eye movements generated before the onset of the number-marked noun.

Figure 3: — Results from Experiment 1. Proportion of looks to proximal referents for L1 adults (n = 28), L1 children (n = 28), and L2 adults (n = 28) during proximal deictic sentences (blue) and distal deictic sentences (purple). Line shading represents one standard error from the mean, averaged by subjects. Vertical dashed lines indicate noun onset. Area shading indicates significant effects from cluster-based permutation analyses (ps < 0.05). Results indicate that only L1 adults used the proximity information encoded in deictic determiners to predict the proximity of the upcoming referent.

Figure 5: — Results from Experiment 2. Proportion of looks to proximal referents for L1 adults (n = 28), L1 children (n = 28), and L2 adults (n = 28) during proximal deictic sentences (blue) and distal deictic sentences (purple). Line shading represents one standard error from the mean, averaged by subjects. Vertical dashed lines indicate noun onset. Area shading indicates significant effects from cluster-based permutation analyses (ps < 0.05). Results indicate that L1 adults and L2 adults, but not L1 children, used deictic determiners to predict the proximity of the upcoming referent, as evidenced by anticipatory eye movements generated before the onset of the proximal or distal noun.

Deictic Test Trials: Plurality

We first evaluated whether or not participants used deixis to predict the plurality of the referent. We analyzed listeners’ proportion of looks to plural referents for singular deictic sentences (e.g., this/that cookie) and plural deictic sentences (e.g., these/those cookies) with a mixed-effects logistic regression model, using the lme4 package (version 1.1–21, Bates, Maechler, Bolker & Walker, 2015) and the lmerTest package (version 3.1–0, Kuznetsova, Brockhoff, & Christensen, 2017). The model included fixed effects for language group (treatment-coded contrasts: L1 adults, L1 children, L2 adults), condition (treatment-coded contrasts: plural terms = 0, singular terms = 1) and time (100-ms bins, −1000 to 500 ms from noun onset) and their interactions. The model also included random intercepts for subjects and items (Barr, Levy, Scheepers, & Tily, 2013).

As can be seen in Figure 2, model results (with L1 adults as the reference group) revealed a significant interaction of condition and time, indicating that L1 adults increasingly looked to the appropriate plural and singular referents over time (β = −0.96, z = −21.04, p < 0.001). Importantly, as illustrated by Figure 2, results revealed three-way interactions of language group, condition, and time, indicating that the interaction between condition and time for L1 children was more attenuated than that of L1 adults (β = 0.48, z = 7.46, p < 0.001) and that the interaction effect for L2 adults was likewise more attenuated than that of L1 adults’ (β = 0.29, z = 4.54, p < 0.001). Together, results suggest that L1 adults, L1 children, and L2 adults differed in their patterns of looking behavior during plural and singular deictic sentences (Figure 2).

Figure 2 also conveys results from cluster-based permutation analyses (Maris, & Oostenveld, 2007; Wittenberg, Khan, & Snedeker, 2017). For these analyses, we calculated participants’ mean proportion of looks to plural referents within each 100-ms time bin and performed a log-odds transformation on these proportions (Barr, 2008). Next, for each 100-ms time bin, we conducted a linear regression analysis on the log-odds of looking to plural referents. We identified clusters of time bins, defined as 2 or more adjacent time bins with t-values greater than 1.6 – a somewhat conservative value that has been used in prior eye-tracking research (Wittenberg, Khan, & Snedeker, 2017) – and summed t-values within each cluster. We then permuted the data to create the null distribution: We randomly shuffled condition labels 1000 times for each time bin, sampling across all time bins, and repeated the cluster-finding procedure and summation of t-values with these permuted data. Finally, we calculated the p-value for each cluster, defined as the proportion of permuted cluster t-values that were greater than the observed cluster t-value. Findings revealed significant clusters for L1 adults (−500 to 500 ms, cluster t = 91.82, p < 0.001), L1 children (−400 to 500 ms, cluster t = 46.37, p < 0.001), and L2 adults (−800 to 500 ms, cluster t = 77.20, p < 0.001). The observed differences in participants’ looking behavior given plural versus singular deictic sentences – and, critically, the emergence of the effect prior to the onset of the number-marked noun – suggest that all groups used the morphosyntactic number marking of deictic determiners to anticipate the plurality of upcoming referents (Figure 2).

Deictic Test Trials: Proximity

We next evaluated whether participants used deixis to predict the proximity of the referent, using a mixed-effects logistic regression model and cluster-based permutation analyses, repeating the above plurality analyses. Figure 3 illustrates results for proximity analyses. The regression model included fixed effects for language group (treatment-coded contrasts: L1 adults, L1 children, L2 adults), condition (treatment-coded contrasts: proximal terms = 0, distal terms = 1) and time (100-ms bins, −1000 to 500 ms from noun onset) as well as their interactions, and included random intercepts for subjects and items. As can be seen in Figure 3, model results (with L1 adults as the reference group) again revealed a significant interaction of condition and time for L1 adults, indicating that they increasingly looked to the appropriate proximal and distal referents over time (β = −0.82, z = −18.27, p < 0.001). Importantly, as illustrated by Figure 3, results again revealed three-way interactions of language group, condition, and time, indicating that the interaction between condition and time for L1 children was more attenuated than that of L1 adults (β = 0.53, z = 8.42, p < 0.001) and that the interaction effect for L2 adults was likewise more attenuated than that of L1 adults (β = 0.14, z = 2.19, p = 0.029). Moreover, findings from the cluster-based permutation analyses revealed significant clusters for L1 adults (−100 to 500 ms, cluster t = 50.68, p < 0.001), L1 children (300 to 500 ms, cluster t = 14.45, p < 0.001), and L2 adults (200 to 500 ms, cluster t = 32.09, p < 0.001).

The results summarized by Figure 3 collectively suggest that groups varied in using the proximity information of deictic determiners to predict the spatial location of the upcoming target referent. Whereas L1 adults’ condition-based differences in looking behavior diverged before noun onset, indicating anticipatory eye movements to the appropriate referents, L1 children and L2 adults’ looking behavior only diverged after noun onset, suggesting that they did not reliably use proximity information conveyed by the deictic determiner to anticipate the spatial location of the upcoming referent (Figure 3). The effect for L2 adults begins at approximately 200 ms following noun onset, and therefore likely reflects processing of the target noun itself, given that saccades are estimated to take 200 ms to initiate (Matin, Shao, & Boff, 1993). Together, this pattern of results suggests L1 adults quickly and accurately exploited proximity information as a basis for their predictions, whereas L1 children and L2 adults may not have generated predictions based on proximity at all, or may have done so slowly, inconsistently, or inaccurately (Figure 3).

Neutral Test Trials

We also used cluster-based permutation analyses to evaluate participants’ looking behavior during neutral sentences in the same manner as for deictic sentences. The neutral determiner “the” does not provide information about the plurality or spatial location of the upcoming noun. We therefore expected to observe significant effects after the onset of the noun (0 ms), indicating that participants identified the appropriate referent once it was named. Comparing looks to plural referents for singular and plural neutral sentences, findings confirmed significant clusters for L1 adults (200 to 1000 ms, cluster t = 200.22, p < 0.001), L1 children (400 to 1000 ms, cluster t = 64.50, p < 0.001), and L2 adults (200 to 1000 ms, cluster t = 136.39, p < 0.001). Comparing looks to proximal referents for proximal and distal neutral sentences, findings again confirmed significant clusters for L1 adults (300 to 1000 ms, cluster t = 159.86, p < 0.001), L1 children (200 to 1000 ms, cluster t = 58.27, p < 0.001), and L2 adults (200 to 1000 ms, cluster t = 138.40, p < 0.001). These findings indicate that participants, upon hearing the neutral determiner “the”, oriented to the correct referent after it was named.

Language Questionnaire

Immediately after the eye-tracking task, L2 adults completed a questionnaire which included various questions about their language experience. For example, L2 adults reported the age at which they began learning English (M = 6.4 years, SD = 3.2 years, range = 1 to 12 years), and total years of exposure to English (M = 16.7, SD = 4.9, range = 7 to 25 years). L2 adults also reported their self-assessed proficiency in a number of domains using a scale (1 through 9) with 1 indicating low proficiency and 9 indicating high proficiency (Table 1). According to these self-report measures, L2 adults had a high level of English proficiency.

Table 1.

Experiment 1 descriptive statistics for L2 adults’ self-reported English proficiency measures

Self-Report Measure	Min	Max	Mean	SD
Proficiency in Speaking English	4	9	7.57	1.50
Proficiency in Understanding English	5	9	7.96	1.20
Proficiency in Reading English	6	9	8.21	0.96
Proficiency in Writing English	4	9	7.71	1.46
Accent when Speaking English	1	9	6.39	2.38
Comfort when Speaking English	3	9	7.25	1.80

Open in a new tab

We conducted exploratory analyses to correlate L2 adults’ looking behaviors during eye tracking with the questionnaire measures. Specifically, we quantified each L2 adult’s prediction measures as a difference score, subtracting their proportion of target looks during neutral trials from their proportion of target looks during deictic trials during a time window from 1000 ms before target noun onset to 200 ms after target noun onset. Participants with larger difference scores were therefore those who were better able to use deictic determiners to rapidly and accurately predict the speaker’s upcoming referent and generate anticipatory eye movements to the corresponding image. Target looks generated 200 ms or later after the target noun would reflect processing of the target noun itself, as it takes approximately 200 ms to generate a saccade (Matin, Shao, & Boff, 1993). We found that L2 adults’ prediction measures were not significantly correlated with the age at which they began learning English (r(26) = 0.23, p = 0.231), their total years of learning English (r(26) = −0.22, p = 0.255), their total years of English classes (r(26) = 0.14, p = 0.469), or their self-reported proficiency in understanding English (r(26) = 0.003, p = 0.778). Additional descriptive results from the language questionnaire are included in Supplementary Materials on the Open Science Framework (OSF).

Discussion

Experiment 1 results suggest that only adults listening to their first language may be capable of prediction via simulation – taking the speaker’s perspective into account to rapidly and accurately simulate the speaker’s upcoming production (Pickering & Garrod, 2013). Specifically, findings indicate that only L1 adults used deictic determiners to predict the proximity of the speaker’s referent. That is, they were more likely to look towards a proximal referent when they heard “Look at this/these…” as compared to when they heard “Look at that/those…”, indicating consideration of the speaker’s perspective (opposite from their own) to predict the spatial location of the referent. Importantly, the proximity effect emerged before L1 adults could have processed the target noun (e.g., cookie). That is, significant clusters emerged before 0 ms, indicating that L1 adults used the proximity information of deictic determiners to look towards proximal/distal referents before the target was identified. In contrast, significant effects for L1 children and L2 adults only emerged after the target was identified. L1 adults therefore showed robust evidence for prediction, while L1 children and L2 adults did not. Importantly, this conclusion is not based on the fact that significant clusters simply emerged earlier for L1 adults than for L1 children and L2 adults. Rather, it is based on the presence vs. absence of significant clusters before noun onset. L1 children’s and L2 adults’ more attenuated condition effects (i.e., delayed or shallow differences between proximal and distal conditions) might reflect predictions which are absent, inefficient, inconsistent, or inaccurate.

The findings of Experiment 1 are in line with the view that language experience shapes how comprehenders generate predictions during language processing (Pickering & Garrod, 2013). Importantly, all groups were capable of navigating some of the referential complexities of the task; L1 adults, L1 children, and L2 adults all performed above chance during practice trials. Moreover, all groups used the morphosyntactic number markings of spatial deictic terms to predict the plurality of the speaker’s upcoming referent, with significant condition effects emerging before the onset of the target noun (0 ms) for all three groups. This converges with prior results indicating that children can use morphosyntactic number marking cues (i.e., is vs. are) to anticipate the plurality of upcoming referents (Lukyanenko & Fisher, 2016). However, only L1 adults were able to use the lexical semantic proximity information of deictic determiners to anticipate the proximity of the referent.

Why did L1 children and L2 adults fail to show evidence for use of the spatial information conveyed by deictic determiners? Several linguistic and/or cognitive processes might explain the observed pattern of results. One possibility is that L1 children and L2 adults focused only on plurality cues (either inadvertently or deliberately) as a basis for generating predictions and failed to consider proximity cues. Presumably, proximity cues are more semantically ambiguous: Whereas this consistently refers to a singular referent, the spatial location of this changes based on the conversational context (i.e., the speaker’s visual perspective). L1 children and L2 adults may have relied on a relatively less ambiguous cue to meaning – plurality – to generate predictions. In line with this view, L2 adults’ results suggest they very rapidly distinguished plural and singular deictic terms, with a significant effect emerging 300 ms before that of L1 adults. Thus, L1 children and L2 adults may have identified morphosyntactic number markings as an unambiguous predictive cue and failed to consider more semantically ambiguous proximity cues as an additional way to narrow the scope of reference.

A related explanation for the observed pattern of results is that L1 children and L2 adults had difficulty rapidly integrating the two cues to meaning conveyed by deictic terms during real-time processing. That is, L1 children and L2 adults may be capable of using plurality cues or proximity cues to identify the speaker’s upcoming referent, but may not be able to rapidly and accurately combine these two sources of information to generate predictions. Integrating these two cues to meaning may have taxed L1 children’s and L2 adults’ cognitive resources, such as working memory, such that they were unable to meet these task demands (Ito, Corley, & Pickering, 2018). The pattern of results among practice trials and test trials is consistent with this view: When deixis was used contrastively in practice trials, listeners did not need to contend with additional morphosyntactic plurality information, because referents in practice trials were all singular.

In Experiment 2, we aimed to reduce the task demands to determine whether a simplified referential context would facilitate L1 children’s and L2 adults’ use of deictic determiners to predict the proximity of the speaker’s upcoming referent. To do so, we changed the task design so that visual stimuli in Experiment 2 included only two visual referents per test trial, rather than four. Importantly, the two referents were always matched in plurality. If L1 children and L2 adults failed to predict the proximity of referents due to difficulties integrating plurality and proximity information, then this experimental design should facilitate their ability to use deixis to anticipate the spatial location of the speaker’s upcoming referent.