Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2024 Sep 25;20(9):e1012117. doi: 10.1371/journal.pcbi.1012117

Language models outperform cloze predictability in a cognitive model of reading

Adrielli Tina Lopes Rego 1,*, Joshua Snell 2, Martijn Meeter 1
Editor: Ronald van den Berg3
PMCID: PMC11458034  PMID: 39321153

Abstract

Although word predictability is commonly considered an important factor in reading, sophisticated accounts of predictability in theories of reading are lacking. Computational models of reading traditionally use cloze norming as a proxy of word predictability, but what cloze norms precisely capture remains unclear. This study investigates whether large language models (LLMs) can fill this gap. Contextual predictions are implemented via a novel parallel-graded mechanism, where all predicted words at a given position are pre-activated as a function of contextual certainty, which varies dynamically as text processing unfolds. Through reading simulations with OB1-reader, a cognitive model of word recognition and eye-movement control in reading, we compare the model’s fit to eye-movement data when using predictability values derived from a cloze task against those derived from LLMs (GPT-2 and LLaMA). Root Mean Square Error between simulated and human eye movements indicates that LLM predictability provides a better fit than cloze. This is the first study to use LLMs to augment a cognitive model of reading with higher-order language processing while proposing a mechanism on the interplay between word predictability and eye movements.

Author summary

Reading comprehension is a crucial skill that is highly predictive of later success in education. One aspect of efficient reading is our ability to predict what is coming next in the text based on the current context. Although we know predictions take place during reading, the mechanism through which contextual facilitation affects oculomotor behaviour in reading is not yet well-understood. Here, we model this mechanism and test different measures of predictability (computational vs. empirical) by simulating eye movements with a cognitive model of reading. Our results suggest that, when implemented with our novel mechanism, a computational measure of predictability provides better fits to eye movements in reading than a traditional empirical measure. With this model, we scrutinize how predictions about upcoming input affects eye movements in reading, and how computational approaches to measuring predictability may support theory testing. Modelling aspects of reading comprehension and testing them against human behaviour contributes to the effort of advancing theory building in reading research. In the longer term, more understanding of reading comprehension may help improve reading pedagogies, diagnoses and treatments.

Introduction

Humans can read remarkably efficiently. What underlies efficient reading has been subject of considerable interest in psycholinguistic research. A prominent hypothesis is that we can generally keep up with the rapid pace of language input because language processing is predictive, i.e., as reading unfolds, the reader anticipates some information about the upcoming input [13]. Despite general agreement that this is the case, it remains unclear how to best operationalize contextual predictions [3,4]. In current models of reading [59]), the influence of prior context on word recognition is operationalized using cloze norming, which is the proportion of participants that complete a textual sequence by answering a given word. However, cloze norming has both theoretical and practical limitations, which are outlined below [4,10,11]. To address these concerns, in the present work we explore the use of Large Language Models (LLMs) as an alternative means to account for contextual predictions in computational models of reading. In the remainder of this section, we discuss the limitations of the current implementation of contextual predictions in models of reading, which includes the use of cloze norming, as well as the potential benefits of LLM outputs as a proxy of word predictability. We also offer a novel parsimonious account of how these predictions gradually unfold during text processing.

Computational models of reading are formalized theories about the cognitive mechanisms that may take place during reading. The most prominent type of model are models of eye-movement control in text reading (see [12] for a detailed overview). These attempt to explain how the brain guides the eyes, by combining perceptual, oculomotor, and linguistic processes. Despite the success of these models in simulating some word-level effects on reading behaviour, the implementation of contextual influences on the recognition of incoming linguistic input is yet largely simplified. Word predictability affects lexical access of the upcoming word by modulating either its recognition threshold (e.g. E-Z Reader [5] and OB1-reader [6]) or its activation (e.g. SWIFT [7]). This process is embedded in the “familiarity check” of the E-Z Reader model and the “rate of activation” in the SWIFT model. One common assumption among models is that the effect of predictability depends on certain word-processing stages. In the case of the E-Z Reader model, the effect of predictability of word n on its familiarity check depends on the completion of “lexical access” of word n-1. That is, predictability of word n facilitates its processing only if the preceding word has been correctly recognized and integrated into the current sentence representation [13]. In the case of the SWIFT model, the modulation of predictability on the rate of activation of word n depends on whether the processing of word n is on its “parafoveal preprocessing” stage, where activation increases more slowly the higher the predictability, or its “lexical completion” stage, where activation decreases more slowly the higher the predictability [12]. These models ignore the predictability of words that do not appear in the stimulus text, even though they may have been predicted at a given text position and assume a one-to-one match between the input and the actual text for computing predictability values. Because the models do not provide a deeper account of language processing at the syntactic and semantic levels, they cannot allow predictability to vary dynamically as text processing unfolds. Instead, predictability is computed prior to the simulations and fixed.

What is more, the pre-determined, fixed predictability value in such models is conventionally operationalized with cloze norming [14]. Cloze predictability is obtained by having participants write continuations of an incomplete sequence, and then taking the proportion of participants that have answered a given word as the cloze probability of that word. The assumption is that the participants draw on their individual lexical probability distributions to fill in the blank, and that cloze reflects some overall subjective probability distribution. For example, house may be more probable than place to complete I met him at my for participant A, but not for participant B. However, scientists have questioned this assumption [4,10,11]. The cloze task is an offline and untimed task, leaving ample room for participants to consciously reflect on sequence completion and adopt strategic decisions [4]. This may be quite different from normal reading where only ~200ms is spent on each word [15]. Another issue is that cloze cannot provide estimates for low-probability continuations, in contrast with behavioural evidence showing predictability effects of words that never appear among cloze responses, based on other estimators, such as part-of-speech [10,16]. Thus, cloze completions likely do not perfectly match the rapid predictions that are made online as reading unfolds.

Predictability values generated by LLMs may be a suitable methodological alternative to cloze completion probabilities. LLMs are computational models whose task is to assign probabilities to sequences of words [17]. Such models are traditionally trained to accurately predict a token given its contextual sequence, similarly to a cloze task. An important difference, however, is that whereas cloze probability is an average across participants, probabilities derived from LLMs are relative to every other word in the model’s vocabulary. This allows LLMs to better capture the probability of words that rarely or never appear among cloze responses, potentially revealing variation in the lower range [18]. In addition, LLMs may offer a better proxy of semantic and syntactic contextual effects, as they computationally define predictability and how it is learned from experience. The model learns lexical knowledge from the textual data, which can be seen as analogous to the language experience of humans. The meaning of words are determined by the contexts in which they appear (distributional hypothesis [19]) and the consolidated knowledge is used to predict the next lexical item in a sequence [11]. The advantage of language models in estimating predictability is also practical: it has been speculated that millions of samples per context would be needed in a cloze task to reach the precision of language models in reflecting language statistics [20], which is hardly feasible. And even if such an extremely large sample would be reached, we would still need the assumption that cloze-derived predictions match real-time predictions in language comprehension, which is questionable [4,21].

Importantly, language models have been shown to perform as well as, or even outperform, predictability estimates derived from cloze tasks in fitting reading data. Shain and colleagues [20] found robust word predictability effects across six corpora of eye-movements using surprisal estimates from various language models, with GPT-2 providing the best fit. The effect was qualitatively similar when using cloze estimates in the corpus for which they were available. Another compelling bit of evidence comes from Hofmann and colleagues [11], who compared cloze completion probabilities with three different language models (ngram model, recurrent neural network and topic model) in predicting eye movements. They tested the hypothesis that each language model is more suitable for capturing a different cognitive process in reading, which in turn is reflected by early versus late eye-movement measures. Item-level analyses showed that the correlations of each eye movement measure were stronger with each language model than with cloze. In addition, fixation-event based analyses revealed that the ngram model better captured lag effects on early measures (replicating the results from Smith and Levy [10]), while the recurrent neural network more consistently yielded lag effects on late measure. A more recent study [22] found neural evidence for the advantage of language models over cloze, by showing that predictions from LLMs (GPT-3, ROBERTa and ALBERT) matched N400 amplitudes more closely than cloze-derived predictions. Such evidence has led to the belief that language models may be suitable for theory development in models of eye-movement control in reading [11].

The present study marks an important step in exploring the potential of language models in advancing our understanding of the reading brain [23,24], and more specifically, of LLMs’ ability to account for contextual predictions in models of eye-movement control in reading [11]. We investigate whether a model of eye-movement control in reading can more accurately simulate reading behaviour using predictability derived from transformer-based LLMs or from cloze. We hypothesize that LLM-derived probabilities capture semantic and syntactic integration of the previous context, which in turn affects processing of upcoming bottom-up input. This effect is expected to be captured in the early reading measures (see Methods). Since predictability may also reflect semantic and syntactic integration of the predicted word with the previous context [18], late measures are also evaluated.

Importantly however, employing LLM-generated predictions is only one part of the story. A cognitive theory of reading also has to make clear how those predictions operate precisely: i.e., when, where, how, and why do predictions affect processing of upcoming text? The aforementioned models have been agnostic about this. Aiming to fill this gap, our answer, as implemented in the updated OB1-reader model, is as follows.

We propose that making predictions about upcoming words affects their recognition through graded and parallel activation. Predictability is graded because it modulates activation of all words predicted to be at a given position in the parafovea to the extent of each word’s likelihood. This means that higher predictability leads to a stronger activation of all words predicted to be at a given position in the parafovea. Predictability is also parallel, because predictions can be made about multiple text positions simultaneously. Note that this is in line with the parallel structure of the OB1-reader, and this is an important contrasting point with serial processing models, such as E-Z Reader, which assume that words are processed one at a time. The predictability mechanism as proposed here is thus, in principle, not compatible with serial models of word processing. With each processing cycle, this predictability-derived activation is summed to the activity resulting from visual processing of the previous cycle and weighted by the predictability of the previous word, which in turn reflects the prediction certainty up to the current cycle (see Methods for more detailed explanation). In this way, predictability gradually and dynamically affects words in parallel, including non-text words in the model’s lexicon.

Importantly, the account of predictability as predictive activation proposed here diverges from the proportional pre-activation account of predictability by Brothers and Kuperberg [25] in two ways. First, they define predictive pre-activation as the activation of linguistic features (orthographic, syntactic and semantic). However, the evidence is mixed as to whether we predict specific words [10] or more abstract categories [26]. Expectations are likely built about different levels of linguistic representations, but here predictive activation is limited to words, and this activation is roughly proportional to each word’s predictability (thus we agree with the proportionality suggested by Brothers and Kuperberg [25]). Second, predictive activation would be prior to the word’s availability in the bottom-up input. We note that predictability without parafoveal preview is debatable. Most studies claim that predictability effects only occur with parafoveal preview [27], but Parker and colleagues [28] showed predictability effects without parafoveal preview using a novel experimental paradigm. In OB1-reader, predictions are made about words within parafoveal preview, which means that bottom-up input is available when predictions are made about words in the parafovea. Since OB1-reader processes multiple words in parallel, predictions are generated about the identity of all words in the parafovea while their orthographic input is being processed and their recognition has not been completed.

In sum, the model assumptions regarding predictability include predictability being: (i) graded, i.e. more than one word can be predicted at each text position; (ii) parallel, i.e. predictions can be made about multiple text positions simultaneously; (iii) parafoveal, i.e. predictions are made about the words in the parafovea; (iv) dynamic, i.e. predictability effects change according to the certainty of the predictions previously made; and (v) lexical, i.e. predictions are made about words as defined by text spaces and not about other abstract categories.

Assuming that predictability during reading is graded, parallel, parafoveal, dynamic, and lexical, we hypothesize that OB1-reader achieves a better fit to human oculomotor data with LLM-derived predictions than with cloze-derived predictions. To test this hypothesis, we ran reading simulations with OB1-reader either using LLM-derived predictability or cloze-derived predictability to activate words in the model’s lexicon prior to fixation. The resulting reading measures were compared with measures derived from eye-tracking data to evaluate the model’s fit to human data. To our knowledge, this is the first study to combine a language model with a computational cognitive model of eye-movement control in reading to test whether the output of LLMs is a suitable proxy for word predictability in such models.

Results

For the reading simulations, we used OB1-reader [6] (see Model Description in Methods for more details on this model). In each processing cycle from OB1-reader’s reading simulation, the predictability values were used to activate the predicted words in the upcoming position (see Predictability Implementation in Methods for a detailed description). Each simulation consisted in processing all the 55 passages from the Provo Corpus [29] (see Eye-tracking and Cloze Norming in Methods for more details on the materials). The predictability values were derived from three different estimators: cloze, GPT-2 [30] and LLaMA [31]. The cloze values were taken from the Provo Corpus (see Eye-tracking and Cloze Norming in Methods for more details on Cloze Norming). The LLM values were generated from GPT-2 and LLaMA. We compare the performance of a simpler transformer-based language model, i.e. GPT-2, with a more complex one, i.e. LLaMA. Both models are transformer-based, auto-regressive LLMs, which differ in size and next-word prediction accuracy, among other aspects. The version of GPT-2 used has 124 million parameters and 50k vocabulary size. The version of LLaMA used has a much higher number of parameters, 7 billion, but a smaller vocabulary size, 32k. Importantly, LLaMA yields a higher next-word prediction accuracy on the Provo passages, 76% against 64% by GPT-2 (see Language Models in Methods for more details on these models).

We ran 100 simulations per condition in a “3x3 + 1” design: three predictability estimators (cloze, GPT-2 and LLaMA), three predictability weights (low = 0.05, medium = 0.1, and high = 0.2) and a baseline (no predictability). For the analysis, we considered eye-movement measures at word-level. The early eye-movement measures of interest were (i) skipping, i.e. the proportion of participants who skipped the word on first pass; (ii) first fixation duration, i.e. the duration of the first fixation on the word; and (iii) gaze duration, i.e. the sum of fixations on the word before the eyes move forward. The late eye-movement measures of interest were (iv) total reading time, i.e. the sum of fixation durations on the word; and (v) regression, i.e. the proportion of participants who fixated the word after the eyes have already passed the text region the word is located.

To evaluate the model simulations, we used the reading time data from the Provo corpus [29] and computed the Root Mean Squared Error (RMSE) between each eye-movement measure from each simulation by OB1-reader and each eye-movement measure from the Provo corpus averaged over participants. To check whether the simulated eye movements across predictability estimators were significantly different (p < = .05), we ran the Wilcoxon T-test from the Scipy python library.

In addition, we conducted a systematic analysis on the hits and failures of the simulation model across predictability conditions to better understand what drives the differences in model fit between LLMs and cloze predictability. The analysis consisted of comparing simulated eye movements and empirical eye movements on different word-based features, namely length, frequency, predictability, word id (position in the sentence), and word type (content, function, and other). In particular, word type was defined according to the word’s part-of-speech tag. For instance, verbs, adjectives, and nouns were considered content words, whereas articles, particles, and pronouns were considered function words (see Reading Simulations in Methods for more details).

Fit across eye-movement measures

In line with our hypotheses, OB1-reader simulations were closer to the human eye-movements with LLM predictability than with cloze predictability. Fig 1 shows the standardized RMSE of each condition averaged over eye movement measures and predictability weights. To attest the predictability implementation proposed in OB1-reader, we compared the RMSE scores between predictability conditions and baseline. All predictability conditions reduced the error relative to the baseline, which confirms the favourable effect of word predictability on fitting word-level eye movement measures in OB1-reader. When comparing the RMSE scores across predictability conditions, the larger language model LLaMA yielded the least error. When comparing the error among predictability weights (see Fig 2), LLaMA yielded the least error in all weight conditions, while GPT-2 produced less error than cloze only with the low predictability weight. These results suggest that language models, especially with a higher parameter count and prediction accuracy, are good word predictability estimators for modelling eye movements in reading [32]. Note that the model’s average word recognition accuracy was stable across predictability conditions (cloze = .91; GPT-2 = .92; LLaMA = .93). We now turn to the results for each individual type of eye movement (Fig 3).

Fig 1. Standardized RMSE scores of OB1 Reader simulations for a baseline without using word predictions, for cloze-norm predictions and predictions from the GPT-2 and LLaMA LLMs.

Fig 1

RMSE scores are standardized using the human averages and standard deviations as reference. The minimum RMSE value is 1, meaning no difference between eye movements from corpus and eye movements from simulations. Each data point here represents the RMSE score of one simulation averaged over words.

Fig 2. Standardized RMSE scores of OB1-reader simulations per condition and predictability weight.

Fig 2

RMSE scores averaged over eye movement measures. * means p-value < = .05, ** means p-value < = .01, and *** means p-value < = .001.

Fig 3. Standardized RMSE scores of OB1-reader simulations per condition for each eye movement measure.

Fig 3

In the y-axis, eye movement measures are represented with the abbreviations SK (skipping), FFD (first fixation duration), GD (gaze duration), TRT (total reading time), and RG (regression).

First fixation duration

RMSE scores for item-level first fixation duration revealed that predictability from LLaMA yielded the best fit compared to GPT-2 and cloze. LLaMA also yielded the least error in each weight condition (see Fig 4A). When comparing simulated and observed values (S1 Fig), first fixation durations are consistently longer in the model simulations. As for predictability effects (S1 Fig), the relation between predictability and first fixation duration seemed to be weakly facilitatory, with more predictability leading to slightly shorter first fixation duration in both the Provo Corpus and the OB1-reader simulations. This relation occurred in all predictability conditions, suggesting that the LLMs capture a similar relation between predictability and eye movements as cloze norming, and that this relation also exist for eye movements in the Provo Corpus.

Fig 4. Standardized RMSE scores of OB1-reader simulations per condition, eye movement measure and predictability weight.

Fig 4

* means p-value < = .05, ** means p-value < = .01, and *** means p-value < = .001. a RMSE scores for first fixation duration. b RMSE scores for gaze duration. c RMSE scores for skipping rates. d RMSE scores for total reading time. e RMSE scores for regression rates.

Our systematic analysis showed that, across predictability conditions, the model generated more error with longer, infrequent, more predictable, as well as the initial words of the passage compared to the final words (S2 Fig). More importantly, the advantage of LLaMA relative to the other predictability conditions in fitting first fixation duration seems to stem from LLaMA providing better fits for highly predictable words. When comparing simulated and human first fixation durations (S3 Fig), we observed that the difference (i.e. simulated durations are longer) is more pronounced for longer and infrequent words. Another observation is that, across predictability conditions, the model fails to capture wrap-up effects (i.e. longer reaction times towards the end of the sequence), which seems to occur in the human data, but not in the model data.

Gaze duration

LLaMA produced the least averaged error in fitting gaze duration. GPT-2 produced either similar fits to cloze or a slightly worse fit than cloze (see Fig 4B). All predictability conditions reduce error compared to the baseline, confirming the benefit of word predictability for predicting gaze duration. Higher predictability shortened gaze duration (S4 Fig) in both the model simulations (OB1-reader) and in the empirical data (Provo Corpus), and, similarly to first fixation duration, simulated gaze durations were consistently longer than the observed gaze durations. Also consistent with first fixation durations, more error is observed for longer, infrequent words. However, differently from the pattern observed with first fixation duration, gaze durations are better fit by the model for more predictable words and initial words in a passage. LLMs, especially LLaMA, generate slightly less error across predictability values, and LLaMA is slightly better at fitting gaze durations of words positioned closer to the end of the passages (S5 Fig). Simulated values are longer than human values, especially with long words (S6 Fig).

Skipping

Unexpectedly, skipping rates showed increasing error with predictability compared to the baseline. RMSE scores were higher in the LLaMA condition for all weights (see Fig 4C). These results show no evidence of skipping being more accurately simulated with any of the predictability estimations tested in this study. While OB1-reader produced sizable predictability effects on skipping rates, these effects seem to be very slight in the empirical data (S7 Fig). Another unexpected result was a general trend for producing more error for short, frequent and predictable words, with LLaMA generating more error in fitting highly predictable words than GPT-2 and cloze (S8 Fig). Moreover, the model generated more error in fitting function words than content words, which is the inverse trend relative to reading times, for which more error is observed with content words than function words (S9 Fig). A closer inspection of this pattern reveals that the model skips generally less than humans; especially longer, infrequent, and final content words. However, the reverse is seen for function words, which the model skips more often than humans do (S10 Fig).

Total reading time

Improvement in RMSE relative to the baseline was seen for total reading time in all conditions. LLaMA showed the best performance, with lower error compared to cloze and GPT-2, especially in the low and high weight conditions (see Fig 4D). Higher predictability led to shorter total reading time, and this was reproduced by OB1 reader in all conditions. LLaMA showed trend lines for the relation between predictability and total reading time that parallel those seen in the data, suggesting a better qualitative fit than for either cloze or GPT-2 (S11 Fig). Similarly to the error patterns with gaze duration, the model generated more error with longer, infrequent, less predictable and final words across predictability conditions (S12 Fig). Also consistent with the results for gaze duration, total reading times from the simulations are longer than those of humans, particularly for longer and infrequent words (S13 Fig).

Regression

Lastly, RMSE for regression was reduced in all predictability conditions compared to the baseline. The lowest error is generated with LLaMA-generated word predictability across predictability weights (see Fig 4E). Predictability effects on regression are in the expected direction, with higher predictability associated with lower regression rate, but this effect is amplified in the simulations (OB1-reader) relative to the empirical data (Provo Corpus) as indicated by the steeper trend lines in the simulated regression rates (S14 Fig). Similarly to the error patterns for skipping, the model generated more error with shorter, frequent and initial words. In contrast, error decreases as predictability increases in the simulations by the LLMs, especially by LLaMA, which generated less error with highly predictable words (S15 Fig). LLaMA is also slightly better at fitting regression to function words (S9 Fig). Furthermore, fitting simulated regression rates to word length, frequency and position showed similar trends as fitting the human regression rates, with steeper trend lines for the simulated values (S16 Fig).

Discussion

The current paper is the first to reveal that large language models (LLMs) can complement cognitive models of reading at the functional level. While previous studies have shown that LLMs provide predictability estimates that can fit reading behaviour as good as or better than cloze norming, here we go a step further by showing that language models may outperform cloze norming when used in a cognitive model of reading and tested in terms of simulation fits. Our results suggest that LLMs can provide the basis for a more sophisticated account of syntactic and semantic processes in models of reading comprehension. Word predictability from language models improves fit to eye-movements in reading.

Word predictability from language models improves fit to eye-movements in reading

Using predictability values generated from LLMs (especially LLaMA) to regulate word activation in a cognitive model of eye-movement control in reading (OB1-reader) reduced error between the simulated eye-movements and the corpus eye-movements, relative to using no predictability or to using cloze predictability. Late eye-movement measures (total reading time and regression) showed the most benefit in reducing error in the LLM predictability conditions, with decreasing error the higher the predictability weight. One interpretation of this result is that predictability reflects the ease of integrating the incoming bottom-up input with the previously processed context, with highly predictable words being more readily integrated with the previously processed context than highly unpredictable words [33]. We emphasize, however, that we do not implement a mechanism for word integration or sentence processing in OB1-reader, and so cannot support this interpretation from the findings at that level.

Notably, the benefit of predictability is less clear for early measures (skipping, first fixation duration and gaze duration) than for late measures across conditions. The more modest beneficial effect of predictability on simulating first-pass reading may be explained by comparing simulated and observed values. We found that OB1-reader consistently provides longer first-fixation and gaze durations. Slow first-pass reading might be due to free parameters of the model not yet being optimally fit. Estimating parameters in computational cognitive models is not a simple task, particularly for models with intractable likelihood for which computing the data likelihood requires integrating over various rules. Follow-up research should apply more sophisticated techniques for model parameter fitting, for instance using Artificial Neural Networks [34].

Moreover, predictability did not improve the fit to skipping. Even though adding predictability increases the average skipping rate (which causes the model to better approximate the average skipping rate of human readers) there nonetheless appears to be a mismatch between the model and the human data in terms of which individual words are skipped. One potential explanation involves the effect of word predictability on skipping being larger in OB1-reader than in human readers. The model skips highly predictable words more often than humans do. The high skipping rate seems to be related to the relatively slow first-pass reading observed in the simulations. The longer the model spends fixating a certain word, the more activity parafoveal words may receive, and thus the higher the chance that these are recognized while in the parafovea and are subsequently skipped. This effect can be more pronounced in highly predictable words, which receive more predictive activation under parafoveal preview. Thus, given the model’s assumption that early (i.e. in parafoveal preview) word recognition largely drives word skipping, predicative activation may lead OB1-reader to recognize highly predictable words in the parafovea and skip them. Parafoveal recognition either does not occur as often in humans, or does not cause human readers to skip those words as reliably as occurs in the model. It is also plausible that lexical retrieval prior to fixation is not the only factor driving skipping behaviour. More investigation is needed into the interplay between top-down feedback processes, such as predictability, and perception processes, such as visual word recognition, and the role of this interaction in saccade programming.

To better understand the potential differences between model and empirical data, as well as factors driving the higher performance of simulations using LLM-based predictability, we compared simulated and human eye movements in relation to word-based linguistic features. Across predictability conditions, we found reverse trends in simulating reading times and saccade patterns: while the reading times of longer, infrequent, and content words were more difficult for the model to simulate, more error was observed in fitting skipping and regression rates of shorter, frequent, and function words. A closer inspection of the raw differences between simulated and human data showed that the model was more critically slower than humans at first-pass reading of longer and infrequent words. It also skipped and regressed to longer, infrequent and content words less often than humans. The model, thus, seems to read more difficult words more “statically” (one-by-one) than humans do, with less room for “dynamic” reading (reading fast, skipping, and regressing for remembering or correcting).

When comparing LLMs to cloze, simulations using LLaMA-derived predictability showed an advantage at simulating gaze duration, total reading time and regression rates of highly predictable words, and a disadvantage at simulating skipping rates of highly predictable words. One potential explanation is that highly predictable words from the LLM are read faster, and thus are closer to human reading times, because LLaMA-derived likelihoods for highly predictable words are higher than those derived from GPT-2 and cloze (S17 Fig), providing more predictive activation to those words. LLaMA is also more accurate at predicting the next word in the Provo passages, which may allow the model to provide more predictive activation to the correct word in the passage (S1 Appendix). Next to faster reading, simulations using LLaMA may also skip highly predictable words more often, leading to increasing mismatch with the human data. This process was put forward before as the reason why the model may exaggerate the skipping of highly predictable words, and, since LLaMA provides higher values for highly predictable words and it is more accurate at predicting, the process is more pronounced in the LLaMA condition.

All in all, RMSE between simulated eye movements and corpus eye movements across eye movement measures indicated that LLMs can provide word predictability estimates which are better than cloze norming at fitting eye movements with a model of reading. Moreover, the least error across eye movement simulations occurred with predictability derived from a more complex language model (in this case, LLaMA), relative to a simpler language model (GPT-2) and cloze norming. Previous studies using transformer-based language models have shown mixed evidence for a positive relation between model quality and the ability of the predictability estimates to predict human reading behaviour [20,3537]. Our results align with studies that have found language model word prediction accuracy, commonly operationalized as perplexity or cross-entropy, and model size, commonly operationalized as the number of parameters, to positively correlate with the model’s psychometric predictive power [32,35]. Note that number of parameters and next-word prediction accuracy are not the only differences between the models used. Further investigation is needed comparing more language models, and the same language model with one setting under scrutiny which varies systematically (e.g. all settings are the same except the parameter count), to help to determine which language model and settings are best for estimating predictability in reading simulations. Our results suggest that more complex pre-trained LLMs are more useful to this end.

Language models may aid our understanding of the cognitive mechanisms underlying reading

Improved fits aside, the broader, and perhaps more important, question is whether language models may provide a better account of the higher-order cognition involved in language comprehension. Various recent studies have claimed that deep language models offer a suitable “computational framework”, or “deeper explanation”, for investigating the neurobiological mechanisms of language [11,23,24], based on the correlation between model performance and human data. However, correlation between model and neural and behavioural data does not necessarily mean that the model is performing cognition, because the same input-output mappings can be performed by wholly different mechanisms (this is the “multiple realizability” principle) [38]. Consequently, claiming that our results show that LLMs constitute a “deeper explanation” for predictability in reading would be a logical fallacy. It at best is a failed falsification attempt, that is, we failed to show that language models are unsuitable for complementing cognitive models of reading. Our results rather suggest that language models might be useful in the search for explanatory theories about reading. Caution remains important when drawing parallels between separate implementations, such as between language models and human cognition [39].

The question is then how we can best interpret language models for cognitive theory building. If they resemble language processing in the human brain, how so? One option is to frame LLMs as good models of how language works in the brain, which implies that LLMs and language cognition are mechanistically equivalent. This is improbable however, given that LLMs are tools built to perform language tasks efficiently, with no theoretical, empirical or biological considerations about human cognition. It is highly unlikely that language processing in the human brain resembles a Transformer implemented on a serial processor. Indeed, some studies explicitly refrain from making such claims, despite referring to language models as a “deeper explanation” or “suitable computational framework” for understanding language cognition [11,23].

Another interpretation is that LLMs resemble the brain by performing the same task, namely to predict the upcoming linguistic input before they are perceived. Prediction as the basic mechanism underlying language is the core idea of Predictive Coding, a prominent theory in psycholinguistics [3] and in cognitive neuroscience [40,41]. However, shared tasks do not necessarily imply shared algorithms. For instance, it has been shown that more accuracy on next-word prediction was associated with worse encoding of brain responses, contrary to what the theory of predictive coding would imply [39].

Yet another possibility is that LLMs resemble human language cognition at a more abstract level: both systems encode linguistic features which are acquired through statistical learning on the basis of linguistic data. The similarities are then caused not by the algorithm, but by the features in the input which both systems learn to optimally encode. The capability of language models to encode linguistic knowledge has been taken as evidence that language—including grammar, morphology and semantics—may be acquired solely through exposure, without the need for, e.g., an in-built sense of grammar [42]. How humans acquire language has been a continuous debate between two camps: the proponents of universal grammar argue for the need of an innate, rule-based, domain-specific language system [43,44], whereas the proponents of usage-based theories emphasize the role of domain-general cognition (e.g. statistical learning, [45], and generalization, [46]) in learning from language experience. Studying large language models can only enlighten this debate if those models are taken to capture the essence of human learning from linguistic input.

In contrast, some studies criticize the use of language models to understand human processing altogether. Having found a linear relationship between predictability and reading times instead of a logarithmic relationship, Smith and Levy [10] and Brothers and Kuperberg [25] speculated the discrepancy to be due to the use of n-gram language models instead of cloze estimations. One argument was that language models and human readers are sensitive to distinct aspects of the previous linguistic context and that the interpretability and limited causal inference of language models are substantial downfalls. However, language models have become more powerful in causal inference and provide a more easily interpretable measure of predictability than does cloze. Additionally, contextualized word representations show that the previous linguistic context can be better captured by the state-of-the-art language models than by simpler architectures such as n-gram models. More importantly, neural networks allow for internal (e.g. architecture, representations) and external (e.g. input and output) probing: when certain input or architectural features can be associated with hypotheses about cognition, testing whether these features give rise to observed model behaviour can help adjudicate among different mechanistic explanations [24]. All in all, language models show good potential to be a valuable tool for investigating higher-level processing in reading. Combining language models, which are built with engineering goals in mind, with models of human cognition, might be a powerful method to test mechanistic accounts of reading comprehension. The current study is the first to apply this methodological strategy.

Finally, we emphasize that the LLM’s success is likely not only a function of the LLM itself, but also of the way in which its outputs are brought to bear in the computational model. The cognitive mechanism that we proposed, in which predictions gradually affect multiple words in parallel, may align better with LLMs than with cloze norms, because the outputs of the former are based on a continuous consideration of multiple words in parallel, while the outputs of the latter may be driven by a more serial approach. More investigation is needed as to what extent the benefit of LLMs is independent of the cognitive theory into which it is embedded. Comparing the effect of LLM-derived predictability in other models of reading, especially serial ones (e.g. E-Z Reader) could provide a clearer understanding of the generalizability of such approach.

Another potential venue for future investigations is whether transformer-based LLMs can account for early and late cognitive processes during reading by varying the size of the context window of the LLM. Hofmann et al. [11] have investigated a similar question, but using language model architectures which differ in how much of the context is considered, without including the transformer-based architecture nor performing reading simulations. We emphasize that such investigation would require careful thinking on how to align the context window of the LLM with that of the model of reading simulation. Follow-up work may address such gap.

The optimal method to investigate language comprehension may be by combining the ability of language models to functionally capture higher-order language cognition with the ability of cognitive models to mechanically capture low-order language perception. Computational models of reading, and more specifically, eye-movement control in reading, are built as a set of mathematical constructs to define and test explanatory theories or mechanism proposals regarding language processing during reading. As such, they are more interpretable and more resembling of theoretical and neurobiological accounts of cognition than LLMs. However, they often lack functional generalizability and accuracy. In contrast, large language models are built to efficiently perform natural language processing tasks, with little to no emphasizes on neurocognitive plausibility and interpretability. Interestingly, despite the reliance on performance over explanatory power and cognitive plausibility, LLMs have been shown to capture various aspects of natural language, in particular at levels of cognition considered higher order by brain and language researchers (e.g. semantics and discourse) and which models of eye-movement control in reading often lack. This remarkable ability of LLMs suggests that they offer a promising tool for expanding cognitive models of reading.

Methods

Eye-tracking and cloze norming

We use the full cloze completion and reading time data from the Provo corpus [29]. This corpus consists of data from 55 passages (2689 words in total) with an average of 50 words (range: 39–62) and 2.5 sentences (range: 1–5) per passage, taken from various genres, such as online news articles, popular science and fiction (see (a) below for an example passage). The Provo corpus had several advantages over other corpora; Sentences are presented as part of a multi-line passage instead of in isolation [47], which is closer to natural, continuous reading. In addition, Provo provides predictability norms for each word in the text, instead of only the final word [48], which is ideal for studies in which every word is examined. Finally, other cloze corpora tend to contain quite many constrained contexts (which are actually rare in natural reading), while this corpus provides a more naturalistic cloze probability distribution [29].

  1. There are now rumblings that Apple might soon invade the smart watch space, though the company is maintaining its customary silence. The watch doesn’t have a microphone or speaker, but you can use it to control the music on your phone. You can glance at the watch face to view the artist and title of a song.

In an online survey, 470 participants provided a cloze completion to each word position in each passage. Each participant was randomly assigned to complete 5 passages, resulting in 15 unique continuations filled in by 40 participants on average. All participants were English native speakers, ages 18–50, with at least some college experience. Another 85 native English-speaking university students read the same 55 passages while their eyes were tracked with a high-resolution, EyeLink 1000 eye-tracker.

The cloze probability of each continuation in the upcoming word position was equivalent to the proportion of participants that provided the continuation in the corresponding word position. Since the number of participants completing a sequence was a maximum of 43, the minimum cloze probability of a continuation was 0.023 (i.e. if each participant would give a different continuation). Words in a passage which did not appear among the responded continuations received cloze probability of 0. The cloze probabilities of each word in each passage and the corresponding continuations were used in the model to pre-activate each predicted word, as further explained in the sub-section “Predictability Implementation” under Methods.

The main measure of interest in this study is eye movements. During reading, our eyes make continuous and rapid movements, called saccades, separated by pauses in which the eyes remain stationary, called fixations. Reading in English consists of fixations of about 250ms on average, whereas a saccade typically lasts 15-40ms. Typically, about 10–15% are saccades to earlier parts of the text, called regressions, and about two thirds of the saccades skip words [49].

The time spent reading a word is associated with the ease of recognizing the word and integrating it with the previously read parts of the text [49]. The fixation durations and saccade origins and destinations are commonly used to compute word-based measures that reflect how long and how often each word was fixated. Measures that reflect early stages of word processing such as lexical retrieval include (i) skipping rate, (ii) first fixation duration (the duration of the first fixation on a word), and (iii) gaze duration (the sum of fixations on a word before the eyes move forward). Late measures include total reading time (iv) and (v) regression rate and are said to reflect full syntactic and semantic integration [50]. Facilitatory effects of word predictability are generally evidenced in both early measures and late measures: that is, predictable words are skipped more often and read more quickly [27].

The measures of interest readily provided in the eye-tracking portion of the corpus were first fixation duration, gaze duration, total reading time, skipping likelihood and regression likelihood. Those measures were reported by the authors to be predictable from the cloze probabilities, attesting the validity of the data collected. We refer the reader to [29] for more details on the corpus used.

Language models

Language model probabilities were obtained from two transformer-based language models: the smallest version available of the pre-trained LLaMA [31] (7 billion parameters, 32 hidden layers, 32 attention heads, 4096 hidden dimensions and 32k vocabulary size); and the smallest version of the pre-trained GPT-2 [30] (124 million parameters, 12 hidden layers, 12 attention heads, 768 hidden dimensions and 50k vocabulary size). Both models were freely accessible through the Hugging Face Transformers library at the time of the study. The models are auto regressive and thus trained on predicting a word based uniquely on its previous context. Given a sequence as input, the language model computes the likelihood of each word in the model’s vocabulary to follow the input sequence. The likelihood values are expressed in the form of logits in the model’s output vector, where each dimension contains the logit of a corresponding token in the model’s vocabulary. The logits are normalized using softmax operation to be between 0 and 1.

Since the language model outputs a likelihood for each token in the model’s vocabulary, we limited the sample to only the tokens with likelihood above a threshold. The threshold was defined according to two criteria: the number of predicted words by the language model should be close to the average number of cloze responses over text positions, and the threshold value should be sufficiently low in order to capture the usually lower probabilities of language models. We have attempted a few threshold values (low = 0.001, medium-low = 0.005, medium = 0.01, medium-high = 0.05, and high = 0.1). The medium threshold (0.01) provided the closest number of continuations and average top-1 predictability estimate to those of cloze. For example, using the medium threshold on GPT-2 predictions provided an average of approximately 10 continuations (range 0–36) and an average predictability of 0.29 for the top-1 prediction, which was the closest to the values from cloze (average of 15 continuations ranging from 0 to 38, and top-1 average predictability of 0.32). Low and medium-low provided a much higher number of continuations (averages of 75 and 19, ranging up to 201 and 61, respectively), whereas medium-high and high provided too few continuations compared to cloze (average of 2 and 1 continuations, ranging up to 12 and 5, respectively). The medium threshold was also optimal for LLaMA. Note that we have not applied softmax to the resulting sequence of predictability estimates. The highly long tail of predictability estimates excluded (approximately the size of the LLM vocabulary, i.e. ~50k estimates for GPT-2 and ~32k for LLaMA) meant that re-normalizing the top 10 to 15 estimates (the average number of continuations post threshold filtering) would remove most of the variation among the top estimates. For instance, in the first passage the second word position led to likelihoods for the top 12 predictions varying between .056 and .011. Applying softmax resulted in all estimates transformed to .08 when rounded to two decimals. Thus, even though it is common practice to re-normalize after filtering to ensure the values sum to one across different sources, we opted to use predictability estimates without post-filtering re-normalization.

Each sequence was tokenized with the corresponding model’s tokenizer before given as input, since the language models have their own tokenization (Byte-Pair Encoder [51]) and expect the input to be tokenized accordingly. Pre-processing the tokens and applying the threshold on the predictability values resulted in an average of 10 continuations per word position (range 1 to 26) with LLaMA and an average of 10 continuations per word position (range 1 to 36) with GPT-2. After tokenization, no trial (i.e. Provo passage) was longer than the maximum lengths allowed by LLaMA (2048 tokens) nor by GPT-2 (1024 tokens). Note that the tokens forming the language model’s vocabulary do not always match a word in OB1-reader’s vocabulary. This is because words can be represented as multi-tokens in the language model vocabulary. Additionally, OB1-reader’s vocabulary is pre-processed and limited to the text words plus the most frequent words in a frequency corpus [52]. 31% of the predicted tokens by LLaMA were not in OB1-reader’s vocabulary and 17% of words in the stimuli are split into multi-tokens in LLaMA’s vocabulary. With GPT-2, these percentages were 26% and 16%, respectively.

To minimize the impact of vocabulary misalignment, we considered a match between a word in the OB1-reader’s lexicon and a predicted token when the predicted token corresponded to the first token of the word as tokenized by the language model tokenizer. For instance, the word “customary” is split into the tokens “custom” and “ary” by LLaMA. If “custom” is among the top predictions from LLaMA, we used the predictability of “custom” as an estimate for the predictability of “customary”. We are aware that this design choice may overestimate the predictability of long words, as well as create discrepancies between different sources of predictability (as different language models have different tokenizers). However, in the proposed predictability mechanism, not only the text words are considered, but also all the words predicted at a given text position (above a pre-defined threshold). Other approaches that aggregate the predictability estimates over all tokens belonging to the word, instead of only the first token, would require computing next-token predictions repeatedly for each different predicted token for each text position, until we assume a word has been formed. To avoid these issues and since the proportion of text words which are split into more than one token by the language models is moderate (17% and 16% from LLaMA and GPT-2, respectively), we adopted the simpler first-token-only strategy.

Model description

In each fixation by OB1-reader (illustrated in Fig 5), the input consists of the fixated word n and the words n—1, n + 1, n + 2 and n + 3 processed in parallel. With each processing cycle, word activation is determined by excitation from constituent letters, inhibition from competing words and a passive decay over time.

Fig 5. Schematic Diagram from OB1-reader.

Fig 5

This diagram was taken from [12].

The model assumes that attention is allocated to multiple words in parallel, such that recognizing those words can be achieved concurrently. Open bigrams [53] from three to five words are activated simultaneously, modulated by the eccentricity of each letter, its distance from the focus of attention, and crowding exerted by its neighbouring letters. Each fixation sets off several word activation cycles of 25ms each. Within each cycle, the bigrams encoded from the visual input propagate activation to each lexicon word they are in. The activated words with bigram overlap inhibit each other. Lexical retrieval only occurs when a word of similar length to the word in the visual input reaches the recognition threshold, which depends on its frequency. Initiating a saccade is a stochastic decision, with successful word recognition increasing the likelihood of moving the “eyes”, that is, the letter position where the fixation point is simulated in the text. Word recognition also influences the size of the attention window, by increasing the attention window when successful. Once the saccade is initiated, the most visually salient word within the attention window becomes the saccade’s destination. With the change in the location of fixation, the activation of words no longer in the visual input decays, while words encoded in the new visual input receive activation.

The strength of the excitation from visual processing vi generated by a letter i depends on its eccentricity ei and crowding mi, weighted by an attentional weight ai, and normalized by the number of bigrams in the lexicon len(bigrams) and a constant cd, as in the following equation (adapted from Snell et al. [6]):

vi=ai×mi[1/ce(0.018*ei+10.64)]len(bigrams)+cd

The combinations of letters activate open bigrams, which are pair of letters within the same word and, in OB1-reader, up to three letters apart. The activation of an open bigram Oij thus equals the square root of the visual input vi and vj of the constituent letters i and j, as implemented in the original OB1-reader.

The strength of the inhibition from other word nodes depends on the extent of the pairwise orthographic overlap. With each time step that the word is not in visual input, the activation decays. Words are recognized when their activations reach their corresponding recognition thresholds. Note that in the original OB1-reader, this threshold is determined by the word’s length, frequency, and predictability, which is constant for all words being predicted at the same text position. In this version of OB1-reader, predictability is modelled as activation instead. This allows predictability to be specific to the words being predicted beyond the actual word in the text position. The recognition threshold is thus only determined by frequency weighted by a constant (cf), as follows:

T=freqmax*(freqmax/cf)freqwfreqmax/cf

Predictability implementation

In addition to the visual activation, words predicted to follow the word being currently read receive predictive activation (Ap) in the model’s lexicon. Given a word wi at position i in a passage T, activation is added to each predicted word wp of predictability predw above threshold t in the model’s lexicon, prior to word wi being fixated by the model. The activation Ap of wp as a result of wp being predicted from the previous context is defined as follows:

Ap=predw×predw1×cp,

where predw is either the language model or cloze completion probability; predw-1 is the predictability of the previous word, which varies as a function of recognition of that word; and c1 is a free-scaling parameter. predw-1 equals to the cloze- or language model-derived predictability of the previous word if it has not been recognized yet, or 1 if its lexical access has been completed.

ΔSw=((SmaxSw)×[c1(ΣijϵwOij)c2(Σkdw,kSk)+Ap])+((SwSmin)*τ)

The input between square brackets has three parts. The first term (c1(ΣijϵwOij)) is the excitation from bigram nodes, the second term (c2(Σkdw,kSk)) is the inhibition exerted by competing words in the lexicon, and the third part (Ap) is the predictive activation. Word activity is bound to the interval between Smax and Smin and decays (τ) with every cycle in which the word is not in the visual input. Each parafoveal word gets predictive activation in parallel to other parafoveal words until its recognition cycle is reached. See Table 1 for an overview of the model parameters.

Table 1. Simulation Parameters.

Parameter Description Value
c e Scaling cortical magnification 35.55556
e i Letter eccentricity Distance in letters between letter and centre of attention, times 0.3 (letter size per degree of visual angle)
m i Masking factor describing crowding 1 for outer letters, .5 for inner letters, 3 for one-letter words
c d Discounted ngram factor to normalize strong activation of short words 5
c f Weight of frequency in threshold setting .08
c p Weight of predictability in predictive activation setting [.05, .1, .2]
τ Decay -.1
S max Maximum activity of a word node 1
S min Minimum activity of a word node 0
c 1 Bigram-to-word excitation 1
c 2 Word-to-word inhibition -2.5

With the described implementation, predictive activation exerts a facilitatory effect on word recognition through faster reading times and more likelihood of skipping. Words in the text which are predicted receive additional activation prior to their fixation, which allows for the activation threshold for lexical access to be reached more easily. Consequently, the predicted word may be recognized more readily and even before fixation. In addition, higher predictability may indirectly increase the likelihood of skipping, because more successful recognition in the model leads to a larger attention window. In contrast, activation of predicted words may exert an inhibitory effect on the recognition of words with low predictability, because activated, predicted words inhibit the words that they resemble orthographically, which may include the word that was truly in the text.

Reading simulations

The simulation input consisted of each of the 55 passages from the Provo corpus, with punctuation and trailing spaces removed and words lower-cased. The output of each model simulation mainly consisted of reading times, skipping and regression computed per simulated fixation. For evaluation, we transform the fixation-centered output into word-centered data, where each data point aggregates fixation information of each word in each passage. Using the word-centered data, we computed the eye-movement measures for each word in the corpus. Since the eye movement measures vary on scale (milliseconds for durations and likelihood for skipping and regression), we standardized the raw differences between simulated and human values based on the respective human average. Root Mean Square Error (RMSE) was then calculated between the simulated word-based eye-movement measures and the equivalent human data per simulation, and Wilcoxon t-test was run to compare the RMSE across conditions.

Finally, we compared the error in each condition across different word-based linguistic variables (length, frequency, predictability, part-of-speech category and position) to better understand the differences in performance. This analysis consisted of binning (20 bins of equal width) the continuous linguistic variables (length, frequency, predictability and position), computing the RMSE in each simulation for each bin, and averaging the RMSE for each bin over simulations. Part-of-speech tags were binned into three categories: content, consisting of the Spacy part-of-speech tags noun (NOUN), verb (VERB), adjective (ADJ), adverb (ADV) and proper noun (PROPN); function, consisting of the Spacy part-of-speech tags auxiliary (AUX), adposition (ADP), conjunction (CONJ, SCONJ, CCONJ), determiner (DET), particle (PART), and pronoun (PRON); and other, consisting of numeral (NUM), interjection (INTJ) and miscellaneous (X). See Fig 6 for an overview of the methodology. The code to run and evaluate the model simulations is fully available on the GitHub repository of this project.

Fig 6. Methodology used in experiment with OB1-reader, LLMs and Cloze Predictability.

Fig 6

(a) In OB1-reader, a model of eye-movements in reading, word predictability is computed for every word predicted in a condition (GPT-2, LLaMA, cloze) in each position in the current stimulus. The predictability of each predicted word (predw) is weighted by the predictability of the previous word (predw-1) and a free parameter (c4), and added to its current activation (Sw) at each cycle until recognition. The number of processing cycles by OB1-reader to achieve recognition determine the word’s reading time. (b) Eye movements simulated by the model are compared to the eye movements from the Provo Corpus by computing RMSE scores.

Supporting information

S1 Fig. Relation between predictability and first fixation duration.

Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high). First fixation duration is displayed in milliseconds.

(TIFF)

pcbi.1012117.s001.tiff (897.9KB, tiff)
S2 Fig. Root Mean Square Error (RMSE) for first fixation durations in relation to word variables.

The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

(TIFF)

pcbi.1012117.s002.tiff (641.3KB, tiff)
S3 Fig. Relation between word variables and first fixation duration.

The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other). First fixation duration is displayed in milliseconds.

(TIFF)

pcbi.1012117.s003.tiff (1.1MB, tiff)
S4 Fig. Relation between predictability and gaze duration.

Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high). Gaze duration is displayed in milliseconds.

(TIFF)

pcbi.1012117.s004.tiff (941.6KB, tiff)
S5 Fig. Root Mean Square Error (RMSE) for gaze durations in relation to word variables.

The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

(TIFF)

pcbi.1012117.s005.tiff (672.9KB, tiff)
S6 Fig. Relation between word variables and gaze duration.

The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other). Gaze duration is displayed in milliseconds.

(TIFF)

pcbi.1012117.s006.tiff (1.1MB, tiff)
S7 Fig. Relation between predictability values and skipping likelihood.

Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high).

(TIFF)

S8 Fig. Root Mean Square Error (RMSE) for skipping likelihood in relation to word variables.

The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

(TIFF)

pcbi.1012117.s008.tiff (681.5KB, tiff)
S9 Fig. Root Mean Square Error (RMSE) for each eye movement measure in relation to word type (content, function or other).

(TIFF)

pcbi.1012117.s009.tiff (762KB, tiff)
S10 Fig. Relation between word variables and skipping likelihood.

The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other).

(TIFF)

pcbi.1012117.s010.tiff (1.2MB, tiff)
S11 Fig. Relation between predictability and total reading time.

Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high). Total reading time is displayed in milliseconds.

(TIFF)

pcbi.1012117.s011.tiff (967.4KB, tiff)
S12 Fig. Root Mean Square Error (RMSE) for total reading time in relation to word variables.

The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

(TIFF)

pcbi.1012117.s012.tiff (641.6KB, tiff)
S13 Fig. Relation between word variables and total reading time.

The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other). Total reading time is displayed in milliseconds.

(TIFF)

pcbi.1012117.s013.tiff (1.1MB, tiff)
S14 Fig. Relation between predictability and regression likelihood.

Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high).

(TIFF)

pcbi.1012117.s014.tiff (1.1MB, tiff)
S15 Fig. Root Mean Square Error (RMSE) for regression likelihood in relation to word variables.

The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

(TIFF)

pcbi.1012117.s015.tiff (650.8KB, tiff)
S16 Fig. Relation between word variables and regression likelihood.

The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other).

(TIFF)

pcbi.1012117.s016.tiff (1.1MB, tiff)
S17 Fig. Distribution of predictability values by each predictor.

(TIFF)

pcbi.1012117.s017.tiff (371.3KB, tiff)
S1 Appendix. Accuracy and Correlation Analyses.

(DOCX)

pcbi.1012117.s018.docx (20.7KB, docx)

Data Availability

All the relevant data and source code used to produce the results and analyses presented in this manuscript are available on a Github repository at https://github.com/dritlopes/OB1-reader-model.

Funding Statement

This study was supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO) Open Competition-SSH (Social Sciences and Humanities) (https://www.nwo.nl), 406.21.GO.019 to MM. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Van Berkum JJA, Brown CM, Zwitserlood P, Kooijman V, Hagoort P. Anticipating upcoming words in discourse: evidence from ERPs and reading times. J Exp Psychol Learn Mem Cogn. 2005;31(3):443. doi: 10.1037/0278-7393.31.3.443 [DOI] [PubMed] [Google Scholar]
  • 2.Rayner K, Slattery TJ, Drieghe D, Liversedge SP. Eye movements and word skipping during reading: effects of word length and predictability. J Exp Psychol Hum Percept Perform. 2011;37(2):514. doi: 10.1037/a0020990 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ryskin R, Nieuwland MS. Prediction during language comprehension: what is next? Trends Cogn Sci. 2023; doi: 10.1016/j.tics.2023.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Smith N, Levy R. Cloze but no cigar: The complex relationship between cloze, corpus, and subjective probabilities in language processing. In: Proceedings of the Annual Meeting of the Cognitive Science Society. 2011. [Google Scholar]
  • 5.Reichle ED, Warren T, McConnell K. Using EZ Reader to model the effects of higher level language processing on eye movements during reading. Psychon Bull Rev. 2009;16:1–21. doi: 10.3758/PBR.16.1.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Snell J, van Leipsig S, Grainger J, Meeter M. OB1-reader: A model of word recognition and eye movements in text reading. Psychol Rev. 2018;125(6):969. doi: 10.1037/rev0000119 [DOI] [PubMed] [Google Scholar]
  • 7.Engbert R, Nuthmann A, Richter EM, Kliegl R. SWIFT: a dynamical model of saccade generation during reading. Psychol Rev. 2005;112(4):777. doi: 10.1037/0033-295X.112.4.777 [DOI] [PubMed] [Google Scholar]
  • 8.Li X, Pollatsek A. An integrated model of word processing and eye-movement control during Chinese reading. Psychol Rev. 2020;127(6):1139. doi: 10.1037/rev0000248 [DOI] [PubMed] [Google Scholar]
  • 9.Reilly RG, Radach R. Some empirical tests of an interactive activation model of eye movement control in reading. Cogn Syst Res. 2006;7(1):34–55. [Google Scholar]
  • 10.Smith N, Levy R. The effect of word predictability on reading time is logarithmic. Cognition. 2013;128(3):302–19. doi: 10.1016/j.cognition.2013.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hofmann MJ, Remus S, Biemann C, Radach R, Kuchinke L. Language models explain word reading times better than empirical predictability. Front Artif Intell. 2022;4:730570. doi: 10.3389/frai.2021.730570 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Reichle ED. Computational models of reading: A handbook. Oxford University Press; 2021. [Google Scholar]
  • 13.Reichle ED, Pollatsek A, Rayner K. Using EZ Reader to simulate eye movements in nonreading tasks: A unified framework for understanding the eye–mind link. Psychol Rev. 2012;119(1):155. doi: 10.1037/a0026473 [DOI] [PubMed] [Google Scholar]
  • 14.Taylor WL. “Cloze procedure”: A new tool for measuring readability. Journalism quarterly. 1953;30(4):415–33. [Google Scholar]
  • 15.Rayner K. Eye movements in reading and information processing: 20 years of research. Psychol Bull. 1998;124(3):372. doi: 10.1037/0033-2909.124.3.372 [DOI] [PubMed] [Google Scholar]
  • 16.Levy R, Fedorenko E, Breen M, Gibson E. The processing of extraposed structures in English. Cognition. 2012;122(1):12–36. doi: 10.1016/j.cognition.2011.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jurafsky D, Martin JH. Speech and Language Processing. In 2023. [Google Scholar]
  • 18.Cevoli B, Watkins C, Rastle K. Prediction as a basis for skilled reading: insights from modern language models. R Soc Open Sci. 2022;9(6):211837. doi: 10.1098/rsos.211837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Harris ZS. Distributional structure. Word. 1954;10(2–3):146–62. [Google Scholar]
  • 20.Shain C, Meister C, Pimentel T, Cotterell R, Levy RP. Large-scale evidence for logarithmic effects of word predictability on reading time. 2022; [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Staub A, Grant M, Astheimer L, Cohen A. The influence of cloze probability and item constraint on cloze task response time. J Mem Lang. 2015;82:1–17. [Google Scholar]
  • 22.Michaelov JA, Coulson S, Bergen BK. So cloze yet so far: N400 amplitude is better predicted by distributional information than human predictability judgements. IEEE Trans Cogn Dev Syst. 2022; [Google Scholar]
  • 23.Goldstein A, Zada Z, Buchnik E, Schain M, Price A, Aubrey B, et al. Shared computational principles for language processing in humans and deep language models. Nat Neurosci. 2022;25(3):369–80. doi: 10.1038/s41593-022-01026-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Doerig A, Sommers RP, Seeliger K, Richards B, Ismael J, Lindsay GW, et al. The neuroconnectionist research programme. Nat Rev Neurosci. 2023;1–20. [DOI] [PubMed] [Google Scholar]
  • 25.Brothers T, Kuperberg GR. Word predictability effects are linear, not logarithmic: Implications for probabilistic models of sentence comprehension. J Mem Lang. 2021;116:104174. doi: 10.1016/j.jml.2020.104174 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kuperberg GR, Jaeger TF. What do we mean by prediction in language comprehension? Lang Cogn Neurosci. 2016;31(1):32–59. doi: 10.1080/23273798.2015.1102299 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Staub A. The effect of lexical predictability on eye movements in reading: Critical review and theoretical interpretation. Lang Linguist Compass. 2015;9(8):311–27. [Google Scholar]
  • 28.Parker AJ, Kirkby JA, Slattery TJ. Predictability effects during reading in the absence of parafoveal preview. Journal of Cognitive Psychology. 2017;29(8):902–11. [Google Scholar]
  • 29.Luke SG, Christianson K. The Provo Corpus: A large eye-tracking corpus with predictability norms. Behav Res Methods. 2018;50:826–33. doi: 10.3758/s13428-017-0908-4 [DOI] [PubMed] [Google Scholar]
  • 30.Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI blog. 2019;1(8):9. [Google Scholar]
  • 31.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971. 2023; [Google Scholar]
  • 32.Wilcox EG, Meister CI, Cotterell R, Pimentel T. Language Model Quality Correlates with Psychometric Predictive Power in Multiple Languages. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. p. 7503–11. [Google Scholar]
  • 33.Luke SG, Christianson K. Limits on lexical prediction during reading. Cogn Psychol. 2016;88:22–60. doi: 10.1016/j.cogpsych.2016.06.002 [DOI] [PubMed] [Google Scholar]
  • 34.Rmus M, Pan TF, Xia L, Collins AGE. Artificial neural networks for model identification and parameter estimation in computational cognitive models. PLoS Comput Biol. 2024;20(5):e1012119. doi: 10.1371/journal.pcbi.1012119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Goodkind A, Bicknell K. Predictive power of word surprisal for reading times is a linear function of language model quality. In: Proceedings of the 8th workshop on cognitive modeling and computational linguistics (CMCL 2018). 2018. p. 10–8. [Google Scholar]
  • 36.Oh BD, Schuler W. Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times? Trans Assoc Comput Linguist. 2023;11:336–50. [Google Scholar]
  • 37.De Varda A, Marelli M. Scaling in cognitive modelling: A multilingual approach to human reading times. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. p. 139–49. [Google Scholar]
  • 38.Guest O, Martin AE. On logical inference over brains, behaviour, and artificial neural networks. Comput Brain Behav. 2023;1–15.36618326 [Google Scholar]
  • 39.Antonello R, Huth A. Predictive coding or just feature discovery? an alternative account of why language models fit brain data. Neurobiology of Language. 2023;1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Rao RPN, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci. 1999;2(1):79–87. doi: 10.1038/4580 [DOI] [PubMed] [Google Scholar]
  • 41.Friston K. The free-energy principle: a rough guide to the brain? Trends Cogn Sci. 2009;13(7):293–301. doi: 10.1016/j.tics.2009.04.005 [DOI] [PubMed] [Google Scholar]
  • 42.Contreras Kallens P, Kristensen-McLachlan RD, Christiansen MH. Large language models demonstrate the potential of statistical learning in language. Cogn Sci. 2023;47(3):e13256. doi: 10.1111/cogs.13256 [DOI] [PubMed] [Google Scholar]
  • 43.Chomsky N. Knowledge of language: Its nature, origin, and use. New York; 1986.
  • 44.Yang C, Crain S, Berwick RC, Chomsky N, Bolhuis JJ. The growth of language: Universal Grammar, experience, and principles of computation. Neurosci Biobehav Rev. 2017;81:103–19. doi: 10.1016/j.neubiorev.2016.12.023 [DOI] [PubMed] [Google Scholar]
  • 45.Saffran JR, Aslin RN, Newport EL. Statistical learning by 8-month-old infants. Science (1979). 1996;274(5294):1926–8. doi: 10.1126/science.274.5294.1926 [DOI] [PubMed] [Google Scholar]
  • 46.Goldberg AE. Explain me this: Creativity, competition, and the partial productivity of constructions. Princeton University Press; 2019. [Google Scholar]
  • 47.Kliegl R, Grabner E, Rolfs M, Engbert R. Length, frequency, and predictability effects of words on eye movements in reading. European journal of cognitive psychology. 2004;16(1–2):262–84. [Google Scholar]
  • 48.Hamberger MJ, Friedman D, Rosen J. Completion norms collected from younger and older adults for 198 sentence contexts. Behavior Research Methods, Instruments, & Computers. 1996;28:102–8. [Google Scholar]
  • 49.Rayner K, Chace KH, Slattery TJ, Ashby J. Eye movements as reflections of comprehension processes in reading. Scientific studies of reading. 2006;10(3):241–55. [Google Scholar]
  • 50.Reichle ED, Pollatsek A, Fisher DL, Rayner K. Toward a model of eye movement control in reading. Psychol Rev. 1998;105(1):125. doi: 10.1037/0033-295x.105.1.125 [DOI] [PubMed] [Google Scholar]
  • 51.Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:150807909. 2015; [Google Scholar]
  • 52.Van Heuven WJB, Mandera P, Keuleers E, Brysbaert M. SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly journal of experimental psychology. 2014;67(6):1176–90. doi: 10.1080/17470218.2013.850521 [DOI] [PubMed] [Google Scholar]
  • 53.Grainger J, Van Heuven WJB. Modeling letter position coding in printed word perception. 2004; [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012117.r001

Decision Letter 0

Daniele Marinazzo, Ronald van den Berg

10 Jun 2024

Dear Ms. Lopes Rego,

Thank you very much for submitting your manuscript "Language models outperform cloze predictability in a cognitive model of reading" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers (three in this case). In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

One point that was brought up by two of the reviewers – and which I also scribbled down in my own evaluation of your work – is that the methodology should be described in more detail. In addition to the examples of missing details raised by the reviewer, another one is that it was unclear to me what the input to each simulation was; apparently 55 passages from a corpus were used, but how long were those passages, what did they look like, and how were they chosen?

A second point that was brought up by two reviewers and also appeared in my own notes was that the manuscript would benefit from a discussion of systematic differences between model fit and empirical data. Would it be possible to say something about why the LLM-based measure leads to lower RMSE? On what kinds of inputs does it perform particularly well and are there cases where it does substantially worse? Answers to those kinds of questions (and those raised by the reviewers) could give important insights that go beyond just comparing performances.

Related to this, a third important point that was brought up is the question of whether the success of the LLM is purely due to its scale (trained on texts from millions of people vs. the 40 participants from which the empirical Cloze norms were derived)?

Besides these three points, the reviewers have numerous other small comments and suggestions. You can find their detailed reviews below.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript will be sent to the three reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ronald van den Berg

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Review of manuscript entitled “Language models outperform cloze predictability in a cognitive model of reading” (PCOMPBIOL-D-24-00700) by Rego, Snell, and Meeter.

Summary. The manuscript describes a simulation “experiment” in which OB1-Reader was used to predict the eye movements observed during the reading of the Provo corpus. Across the simulated “conditions,” the predictability of each word in the corpus was set equal to zero or values derived from cloze norms or one of two large language models, LLMs (GPT2 or LLaMA), with the weights being used by OB1-Reader to modulate the effect of predictability also being manipulated. The key result from this exercise was that predictability values derived from LLMs provided better fits to the data than the cloze norms. The authors discuss their work in terms of possible theoretical interpretations of LLMs and their integration with more plausible models of reading.

General comments. I enjoyed reading this manuscript and thought that the basic idea of using LLMs to generate word-predictability values for reading models was both novel and interesting. For those reasons, I think the manuscript is potentially worthy of publication. Before that can happen, however, the manuscript will require a significant amount of revision to address the concerns outlined below.

Major Concerns/Questions:

1. Many of the essential details of OB1-Reader and how the simulations were completed were not provided, making it difficult to evaluate the quality of the simulation results. For example, the authors first mention “pre-cycle activation” on p. 7 but only provide some of the details to understand what this is later, on pp. 24-25. Also, although the authors provide a figure of their model, they do not provide any of the implementational details, but instead assume that readers will consult Snell et al. (2018). The action editor may differ with me here, but my view is that this type of paper should be self-contained, and that whatever model details are required to understand the simulations should be provided.

In relation to this, one important detail that needs to be expanded upon is related to the issue of serial versus parallel processing of words. For example, on p. 25, the authors indicate that predictability is “parallel, i.e., predictions can be made about multiple text positions simultaneously.” This assumption obviously aligns with the architecture of OB1-Reader but is contrary to models that, like E-Z Reader, assume that words are processed one at a time. In fact, in E-Z Reader, a word’s predictability can only be used to facilitate its processing if the preceding words have been both identified and integrated into the sentence representation being constructed (see Reichle et al., 2012, p. 159). This point of contrast should be mentioned because the two models, OB1-Reader and E-Z Reader, represent the end points on an important theoretical continuum.

2. On pp. 12-14 and in the Supporting Information, the authors discuss the model fits and how they compare to the observed data from the Provo corpus. However, the authors do not discuss the systematic differences between the two. For example, in Figures 1 and 2 of the Supporting Information, the simulated first fixation and gaze durations are consistently longer than the corresponding observed values. Some explanation of these differences seems necessary.

3. I’m not an expert on LLMs, but my understanding is that part of their utility reflects the fact that they’ve been trained on huge volumes of text, perhaps making their “experience” (using the authors’ term; see p. 6) equivalent to that of many readers (as opposed to, e.g., what would be expected of a single reader). In contrast, the empirical cloze norms taken from the Provo corpus were based on an average of only 40 participants (p. 19). If that’s the case, would the model’s performance using cloze norms be more equivalent to the model’s performance using LMMs if the norms had been based on a much larger sample? Some explanation seems necessary. And if my intuition here is incorrect, then a discussion of precisely why might strengthen the authors’ argument for the value of using LMMs.

Minor Concerns: These suggestions are admittedly nitpicky but intended to be helpful [page/paragraph/line(s)].

1. General comment: I had the impression that the manuscript was written in haste, with minimal effort to proof it for typos.

2. 2 (abstract)/1/2: Remove “yet”.

3. 3/1/2: “provide” --> “provides”

4. 3/1/penultimate sentence: The sentence beginning “In the short term,” sound empty and should be expanded or removed.

5. 4/1/6: Add two references: (1) Reilly & Radach (2006); and (2) Li & Pollatsek (2020).

6. 4/1/7 (and elsewhere throughout manuscript): “Cloze” is typically not capitalized.

7. 4/3/4 (and elsewhere): Use the Oxford comma with lists of three or more items to avoid ambiguity.

8. 5/1/2: “EZ-reader” --> “E-Z Reader”

9. 5/1/final sentence: Although what is said here is technically true, in E-Z Reader, the cloze values are “fixed” only because the model doesn’t provide a deep account of language processing. As such, I don’t conceptualize cloze values as being static (as suggested by the authors) but instead view this assumption as a proxy. Some acknowledgment of this seems necessary. (And not just for my model, but for the others, too.)

10. 5/2/6-7: Move the parenthetical example to the end, as a new sentence.

11. 6/1/7: “is determined” --> “are determined”

12. 6/1/10: “the phrase “as well as (17), or outperform (18)” should be re-written and expanded, with more details provided (i.e., not just the citations).

13. 6/2/2: Insert “our” after “advancing”.

14. 8/2: I would remove the abbreviations for the eye-movement measures because they are not used elsewhere in the manuscript (except Figure 2, where they are provided). Also, I would list the measures with numbers; e.g., “… of interest were: (1) skipping; i.e., …” Do the same on p. 20.

15. p. 12, Figure 3: The axes labels are too small to read. Also, the p-value associated with ** is not indicated in the note.

16. 14/3/4: “as-” --> “as”

17. 14/3/7: Shouldn’t this read “semantic and syntactic processes” given the discussion on p. 6?

18. 15/2/5: Remove “specifically”.

19. 15/2/6: “higher” --> “larger”

20. 15/2/7-11: The argument beginning with “Given the model’s assumption…” doesn’t make sense to me. I would suggest clarifying this argument or removing it.

21. 16/1/6: Shouldn’t “model quality” be “model size”?

22. 16/2/11: I think I understood what the authors meant by “failed falsification attempt” but I suspect that many readers won’t. So please expand this point.

23. 17/3: In the preceding paragraphs, the authors briefly describe how LLMs have been interpreted and then provide an argument for why each of these interpretations is limited. The authors don’t do this for the interpretation mentioned in this paragraph, however. That said, is there any evidence supporting or refuting this interpretation?

24. 18/1/2: Remove “and more” and “still”.

25. 18/1/7: Again, I think I understand what the authors mean by “probing” the internal versus external behavior of neural networks, but I suspect that most readers won’t. So briefly expand this point.

26. 19/1/6: The claims that “cognitive models of reading often lack [accounts of semantics and discourse]” isn’t quite accurate. I would suggest weakening this claim and citing a few of the discourse-processing models (e.g., Fletcher et al., 1996; Frank et al., 2007; Kintsch, 1998; Myers & O’Brien, 1998; van den Broek et al., 1996).

27. 21/2/2: “7B” --> “7 billion”

28. 21/2/4: “124M” --> “124 million”

29. 23/1/1: Figure 1 is actually Figure 4.

30. 25/2: In discussing the model’s assumption about predictability, use numbers to list the assumptions; e.g., “…predictability being: (1) graded, …”

31. 26/3/4: Figure 2 is actually Figure 5.

Signed: Erik D. Reichle

Reviewer #2: The paper investigates the suitability of large language models (LLMs) within the framework of a cognitive model of reading, specifically comparing the predictability scores from LLMs (GPT-2 and LLaMA-7B) to those obtained via traditional cloze tasks. The authors demonstrate that LLM-derived predictability scores offer a better fit to human eye-movement data during reading, outperforming cloze-based predictions (as indicated by the Root Mean Square Error (RMSE) between simulated and human eye movements).

The methodology proposed by the authors consists of two "novelties": using LLMs to generate predictability scores and implementing a novel "parallel-graded mechanism" that involves the pre-activation of all predicted words at a given position based on their contextual certainty, which changes dynamically as text processing unfolds. The authors claim that their contribution is the first of its kind in combining a language model with a computational cognitive model of reading, offering a new approach to understanding the interplay between word predictability and eye movements.

Pros:

- The paper introduces an novel integration of LLMs with a cognitive model, contributing to the field of reading research by providing a way of obtaining more dynamic and contextually aware word predictability scores.

- By outperforming traditional cloze methods, the study highlights the potential of LLMs to enhance predictive accuracy in reading simulations.

- The authors are very careful about interpreting the results, ensuring they do not fall into the trap of proposing LLMs as cognitively-plausible models.

Cons/questions:

- The authors only consider the first token of multi-token words, rather than considering the joint probability of all tokens that constitute the word. For example, "customary" is represented as two tokens in LLaMA's vocabulary, and the authors use only p(custom) as the predictability score for the word "customary" instead of p(custom)×p(ary∣custom). I am not sure this is the best approach as it may overestimate the predictability of longer words. Furthermore, this creates discrepancy across different sources of predictability (e.g. with each new tokenizer, the representative part of a multi-token words will differ). The implication of this design choice must be discussed in the paper.

- Although I do not have a problem with it, the choice of GPT-2 and LLaMA should be justified. GPT-2 is not traditionally regarded as an LLM (it is perhaps the last 'small' language model), so it could be presented as a baseline language model to demonstrate the potential benefits of more complex LLMs. It would also be interesting to see whether models with increasing sizes yield better performance or if there is a potential limit to the benefits of LLMs at a certain parameter count.

- The methodology section could benefit from more details. The threshold of 0.01 feels rather arbitrary (how many continuations did you get for 0.005 or 0.001?). It is generally advisable to re-normalize the probabilities after any kind of filtering to ensure they still sum to one. Did you apply softmax again to the probabilities of the remaining words? Having normalized true probability distributions is important for accuracy, especially when comparing different sources.

Overall, this paper makes a contribution to the field by showing how the LLMs can be integrated into the existing cognitive modeling techniques. The use of LLMs to predict word predictability in reading contexts perhaps will lead to more research in reading comprehension.

Reviewer #3: 1. I'm curious whether it's possible to vary the size of the context window of the LLMs as an experimental parameter to see how larger/smaller contexts influence the gaze metrics, specifically in relation to early vs late measures.

2. Page 6. The sentence starting with: "This evidence has led to believe..." seems to be missing a subject (but I'm not a native English speaker so I might be wrong, but it sounds a bit strange to me).

3. I understand it is out of scope for this article but it would be interesting to compare results also across different types of models of eye movements during reading (e.g., SWIFT and EZ Reader). This way one might get a clearer understanding whether LLMs are generally better than Cloze estimates independently of the particular mechanisms of the proposed eye movement model, or alternatively, if the results might depend more on the mechanistic details of the particular model.

4. Page 9. "These results suggest that language models, especially with higher prediction accuracy, are good word predictability..." So there is a clear relationship between an LLMs general prediction accuracy and the fit to eye movement data? Maybe some simple comparison metrics between GPT2 and LLaMa would help the reader along the way here. For the reader it is not clear at this point in the text how these two LLMs differ, for example in terms of vocabulary size or overall prediction accuracy.

5. Page 12, fig 3. Please specify which statistical hypothesis test was used to derive the p-values.

6. Page 14, "the current paper is the first to evidence..." Sounds grammatically strange to my ears.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Erik D. Reichle

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012117.r003

Decision Letter 1

Daniele Marinazzo, Ronald van den Berg

9 Sep 2024

Dear Ms. Lopes Rego,

We have now heard back from the three reviewers and are pleased to inform you that your manuscript has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Ronald van den Berg

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Review of manuscript entitled “Language models outperform cloze predictability in a cognitive model of reading” (PCOMPBIOL-D-24-00700R1) by Rego, Snell, and Meeter.

Summary. Because this is a revised version of a manuscript that I previously reviewed, I won’t provide another summary here.

General comments. I was generally positive about the first version of the manuscript, although I did have a handful of substantial concerns and a long list of minor suggestions. The authors did a really nice job of addressing my concerns, as well as tightening up the manuscript. For that reason, I’m going to recommend that the manuscript now be accepted for publication.

Signed: Erik D. Reichle

Reviewer #2: I have reviewed the authors' responses and revisions. While some of the issues I previously raised (e.g., using only the first token to represent words) remain in the pipeline, these concerns are now explicitly discussed in the paper. So any future readers will be aware of these limitations. At this stage, it may be unreasonable to expect the authors to re-run or re-implement the entire pipeline, and I believe the findings and the overall work are interesting for the general audience. I have no further comments.

Reviewer #3: I have re-read the manuscript and I feel that the authors have adequately addressed the points I raised in my review. I recommend that the manuscript be accepted for publication.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Erik D. Reichle

Reviewer #2: No

Reviewer #3: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012117.r004

Acceptance letter

Daniele Marinazzo, Ronald van den Berg

20 Sep 2024

PCOMPBIOL-D-24-00700R1

Language models outperform cloze predictability in a cognitive model of reading

Dear Dr Lopes Rego,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Relation between predictability and first fixation duration.

    Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high). First fixation duration is displayed in milliseconds.

    (TIFF)

    pcbi.1012117.s001.tiff (897.9KB, tiff)
    S2 Fig. Root Mean Square Error (RMSE) for first fixation durations in relation to word variables.

    The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

    (TIFF)

    pcbi.1012117.s002.tiff (641.3KB, tiff)
    S3 Fig. Relation between word variables and first fixation duration.

    The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other). First fixation duration is displayed in milliseconds.

    (TIFF)

    pcbi.1012117.s003.tiff (1.1MB, tiff)
    S4 Fig. Relation between predictability and gaze duration.

    Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high). Gaze duration is displayed in milliseconds.

    (TIFF)

    pcbi.1012117.s004.tiff (941.6KB, tiff)
    S5 Fig. Root Mean Square Error (RMSE) for gaze durations in relation to word variables.

    The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

    (TIFF)

    pcbi.1012117.s005.tiff (672.9KB, tiff)
    S6 Fig. Relation between word variables and gaze duration.

    The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other). Gaze duration is displayed in milliseconds.

    (TIFF)

    pcbi.1012117.s006.tiff (1.1MB, tiff)
    S7 Fig. Relation between predictability values and skipping likelihood.

    Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high).

    (TIFF)

    S8 Fig. Root Mean Square Error (RMSE) for skipping likelihood in relation to word variables.

    The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

    (TIFF)

    pcbi.1012117.s008.tiff (681.5KB, tiff)
    S9 Fig. Root Mean Square Error (RMSE) for each eye movement measure in relation to word type (content, function or other).

    (TIFF)

    pcbi.1012117.s009.tiff (762KB, tiff)
    S10 Fig. Relation between word variables and skipping likelihood.

    The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other).

    (TIFF)

    pcbi.1012117.s010.tiff (1.2MB, tiff)
    S11 Fig. Relation between predictability and total reading time.

    Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high). Total reading time is displayed in milliseconds.

    (TIFF)

    pcbi.1012117.s011.tiff (967.4KB, tiff)
    S12 Fig. Root Mean Square Error (RMSE) for total reading time in relation to word variables.

    The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

    (TIFF)

    pcbi.1012117.s012.tiff (641.6KB, tiff)
    S13 Fig. Relation between word variables and total reading time.

    The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other). Total reading time is displayed in milliseconds.

    (TIFF)

    pcbi.1012117.s013.tiff (1.1MB, tiff)
    S14 Fig. Relation between predictability and regression likelihood.

    Predictability is computed from cloze norms, gpt2 or llama, and is displayed as a function of the weight attached to predictability in word activation (low, medium or high).

    (TIFF)

    pcbi.1012117.s014.tiff (1.1MB, tiff)
    S15 Fig. Root Mean Square Error (RMSE) for regression likelihood in relation to word variables.

    The word variables are frequency, length, predictability, and the position of the word in the passage (word_id).

    (TIFF)

    pcbi.1012117.s015.tiff (650.8KB, tiff)
    S16 Fig. Relation between word variables and regression likelihood.

    The word variables are frequency, length, predictability, the position of the word in the passage (word_id), and type (function, content, or other).

    (TIFF)

    pcbi.1012117.s016.tiff (1.1MB, tiff)
    S17 Fig. Distribution of predictability values by each predictor.

    (TIFF)

    pcbi.1012117.s017.tiff (371.3KB, tiff)
    S1 Appendix. Accuracy and Correlation Analyses.

    (DOCX)

    pcbi.1012117.s018.docx (20.7KB, docx)
    Attachment

    Submitted filename: response_letter_compt_bio.pdf

    pcbi.1012117.s019.pdf (314.3KB, pdf)

    Data Availability Statement

    All the relevant data and source code used to produce the results and analyses presented in this manuscript are available on a Github repository at https://github.com/dritlopes/OB1-reader-model.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES