Abstract
Models of gaze allocation in complex scenes are derived mainly from studies of static picture viewing. The dominant framework to emerge has been image salience, where properties of the stimulus play a crucial role in guiding the eyes. However, salience-based schemes are poor at accounting for many aspects of picture viewing and can fail dramatically in the context of natural task performance. These failures have led to the development of new models of gaze allocation in scene viewing that address a number of these issues. However, models based on the picture-viewing paradigm are unlikely to generalize to a broader range of experimental contexts, because the stimulus context is limited, and the dynamic, task-driven nature of vision is not represented. We argue that there is a need to move away from this class of model and find the principles that govern gaze allocation in a broader range of settings. We outline the major limitations of salience-based selection schemes and highlight what we have learned from studies of gaze allocation in natural vision. Clear principles of selection are found across many instances of natural vision and these are not the principles that might be expected from picture-viewing studies. We discuss the emerging theoretical framework for gaze allocation on the basis of reward maximization and uncertainty reduction.
Keywords: salience, natural tasks, eye movements, reward, learning, prediction
Introduction
Visually guided behaviors require the appropriate allocation of gaze in both space and time. High acuity foveal vision must be directed to locations that provide information for completing behavioral goals. Behaviorally informative locations change with progress through a task, so this allocation of gaze must not only be to the right places but must also be at the right times to serve behavior. Understanding the principles that underlie the deployment of gaze in space and time is, therefore, important for understanding any visually guided behavior.
In this article, we review the current state of models of eye guidance for complex scene viewing and whether they can generalize to natural behavior. In particular, we review the dominant class of models that has emerged to explain gaze allocation in picture viewing: those that are based on low-level image properties, often operationalized as image salience. While this approach has provided insights into oculomotor selection and has given rise to a considerable volume of research, we argue that most current models offer only a limited description of human gaze behavior. Moreover, we argue that the dominant paradigm—that of picture viewing—is an inappropriate domain of explanation if we wish to understand eye movement behavior more generally. While most models have been built around a core of low-level feature conspicuity, some emerging models attempt to base selection on higher level aspects of scenes. We consider the direction that these models are taking and whether this will allow insights into vision in natural settings. We approach this by considering what a model of natural eye guidance should be able to explain. That is, we highlight the principles of fixation selection in natural tasks that can be found to generalize across many real-world situations; these are the components of eye movement behavior that need to be explained by any theoretical model. The common underlying principles for eye guidance suggest that behavioral relevance and learning are central to how we allocate gaze. These principles necessarily change the emphasis of what should be modeled and we suggest that a framework incorporating behavioral rewards will provide a useful approach for understanding the manner in which models of eye guidance may be implemented in the future.
Image salience and eye movement behavior
The extensive psychophysical literature on visual search has demonstrated that basic visual features can capture and guide attention (see Wolfe, 1998). If a target differs from a set of distractors in just a single feature dimension, such as color or orientation, it can be detected very rapidly, and detection time remains fast irrespective of the number of distractors present (Treisman & Gelade, 1980). This “pre-attentive” capture (“popout”) suggests that features can drive the allocation of attention. Similarly, more complex search, where targets are defined by the unique conjunction of two features, can successfully be explained using serial selection driven by image features. Models such as Treisman's feature integration theory (Treisman & Gelade, 1980) or Wolfe's (2007) guided search model produce human-like search behavior using only low-level featural information. A natural extension of this work was to ask whether this principle could be applied to understanding how attention is allocated in more complex scenes. These models of visual search underlie the most influential class of models of gaze patterns in picture viewing based on low-level image features. One computational implementation of this class of model is the notion of the “salience map,” a spatial ranking of conspicuous visual features that could be candidates for covert or overt attention (Itti & Koch, 2000; Itti, Koch, & Niebur, 1998; Koch & Ullman, 1985). The salience map concept has had a profound influence on the research field and has become an integral component of many subsequent models of gaze allocation. In the original implementation of the salience model, when presented with a scene, low-level features are extracted in parallel across the extent of the viewed scene (Figure 1). Local competition across image space and feature scales results in feature maps for luminance contrast, color contrast, and orientation contrast. These individual feature maps are combined by weighted sum to create an overall distribution of local feature contrast, known as the “salience map.” Attention is then allocated to locations in the scene according to the salience in the computed map using a winner-takes-all principle. To avoid attention becoming “stuck” at the most salient location, a local, transient inhibition is applied to each attended location. Each iteration of the model—a winner-takes-all selection of the most salient location followed by inhibition at the attended location—effectively represents a relocation of attention.
Explanatory power of the salience map
Visual conspicuity models such as Itti and Koch's salience map can explain aspects of human attention allocation. The salience model described in Figure 1 can localize popout targets in a single iteration of the model. However, conjunction targets can take several iterations of the model before they are selected, and the number of iterations depends upon the number of distractors present (Itti & Koch, 2000). This serial search behavior with search times dependent upon the distractor set size mirrors human search behavior. When presented with a complex photographic scene, the model predicts serial search behavior in which visually conspicuous locations are selected for “fixation,” in a manner that appears superficially similar to human eye movement patterns.
Empirical evaluation of salience-based selection in complex scenes
It is clear that under certain circumstances image salience (or other feature-based models) can provide a good explanation of how humans orient their attention. However, the evidence so far discussed is derived almost exclusively from situations in which the stimulus is a simple visual search array or in which the target is uniquely defined by simple visual features. These studies provide a proof of principle for the notion that the visual system can select fixation targets on the basis of conspicuity. Indeed, the original goal of such models was to explain attentional capture rather than to provide a model of eye movements in complex scenes. It seems reasonable to ask whether the principles derived from simple paradigms might generalize to viewing more complex scenes. However, a real-world scene provides a much greater range of information types than these simple search displays. It is, therefore, important to empirically evaluate whether visual conspicuity contributes significantly to fixation selection when a greater range of information is available. Empirical evaluations of the salience model using complex, natural scenes show that more fixations lie within regions predicted by the salience model than would be expected by chance (e.g., Foulsham & Underwood, 2008) and salience at fixated locations is significantly higher than at control locations (e.g., Parkhurst, Law, & Niebur, 2002). Findings such as these are widespread, suggesting a correlation between low-level features in scenes and fixation selection by humans. However, as argued by previous researchers, these correlations alone should not be taken to imply any causal link between features and fixation placement (Henderson, 2003; Henderson, Brockmole, Castelhano, & Mack, 2007; Tatler, 2007).
Despite the widespread interest in this model and the considerable successes that it has had in predicting fixation selection at above-chance levels, it is important to consider just how much fixation behavior can be explained by a feature-based model of selection. Empirical evaluations of the extent of the difference in salience at fixated and control locations are informative in this respect. Using signal detection approaches, it is possible to consider not only whether statistically significant differences in the salience at fixated and control locations can be found but also the magnitude of these differences (see Tatler, Baddeley, & Gilchrist, 2005). Essentially, the technique can be used to determine the extent to which fixated and control locations can be discriminated on the basis of low-level feature information. The magnitude of the difference describes how well fixation selection can be described by low-level features. Such evaluations have found areas under the receiver–operator curve in the region of 0.55 to 0.65 (where 0.5 is chance), which suggests that the proportion of fixation behavior that can be accounted for by image salience is modest (e.g., Einhäuser, Spain, & Perona, 2008; Nyström & Holmqvist, 2008; Tatler et al., 2005). When the viewer's task is manipulated, this modest predictive power can disappear (e.g., Foulsham & Underwood, 2008; Henderson et al., 2007). The weak statistical support for low-level factors in fixation selection can be contrasted to the support offered for other factors in fixation selection. Using the same logic of attempting to discriminate between fixated and control locations on the basis of a particular source of information, areas under the receiver–operator curve for other factors can be compared to those found for image salience. Einhäuser, Spain et al. (2008) found that fixated and control locations can be better distinguished by object-level information than by image salience. Tatler and Vincent (2009) found that fixated and control locations can be better distinguished by biases in how we move our eyes than by image salience.
Despite these empirical shortcomings of the original implementation of the salience model (and of similar models), conspicuity-based accounts continue to feature prominently in much of the recent work on eye guidance (e.g., Xu, Yang, & Tsien, 2010; Yanulevskaya, Marsman, Cornelissen, & Geusebroek, 2010; Zehetleitner, Hegenloh, & Mueller, 2011; Zhao & Koch, 2011, and many others).Recent special issues of Cognitive Computation (Taylor & Cutsuridis, 2011) and of Visual Cognition (Tatler, 2009) reflect the continuing prominence of image salience and similar conspicuity-based factors in current research. Indeed, even recent emerging models often continue to retain a key role for visual conspicuity (e.g., Ehinger, Hidalgo-Sotelo, Torralba, & Oliva, 2009; Kanan, Tong, Zhang, & Cottrell, 2009), a point we will return to later in this article. However, this conspicuity-based class of computational model of eye guidance requires a set of assumptions that are conceptually and empirically problematic. In the section that follows, we highlight these assumptions and evaluate empirical evidence about their validity. Following this, we will consider the emerging models that overcome some of these limitations, including models that place less emphasis on visual conspicuity. We then consider situations in which conspicuity models may provide useful descriptions of human behavior.
Assumptions in models of scene viewing
Assumption 1: Pre-attentive features drive !xation selection
One of the essential assumptions behind salience models is that simple features are extracted pre-attentively at early levels of visual processing and that the spatial deviations of features from the local surround can, therefore, provide a basis for directing attention to regions of potential interest. While there exist statistically robust differences in the low-level content of fixated locations, compared with control locations (e.g., Mannan, Ruddock, & Wooding, 1997; Parkhurst et al., 2002; Reinagel & Zador, 1999), the magnitude of these differences tends to be small (see above), suggesting that the correlation between features and fixation is relatively weak. Furthermore, correlations are only found for small amplitude saccades (Tatler, Baddeley, & Vincent, 2006) and, crucially, disappear once the cognitive task of the viewer is manipulated (e.g., Foulsham & Underwood, 2008; Henderson et al., 2007). This does not mean that stimulus properties are unimportant. A high signal-to-noise ratio will make a variety of visual tasks such as search faster and more reliable. The question is whether simple stimulus features are analyzed pre-attentively and can, thus, form the basis for a bottom-up mechanism that can direct attention to particular locations. When walking around a real or virtual environment, feature-based salience offers little or no explanatory power over where humans fixate (Jovancevic, Sullivan, & Hayhoe, 2006; Jovancevic-Misic & Hayhoe, 2009; Sprague, Ballard, & Robinson, 2007; Turano, Geruschat, & Baker, 2003). In a virtual walking environment in which participants had to avoid some obstacles while colliding with others, image salience was not only unable to explain human fixation distributions but predicted that participants should be looking at very different scene elements (Rothkopf, Ballard, & Hayhoe, 2007). Humans looked at mainly the objects with only 15% of fixations directed to the background. In contrast, the salience model predicted that more than 70% of fixations should have been directed to the background. Thus, statistical evaluations of image salience in the context of active tasks confirm their lack of explanatory power. Hence, the correlations found in certain situations when viewing static scenes do not generalize to natural behavior. In ball sports, the short-comings of feature-based schemes become even more obvious. Saccades are launched to regions where the ball will arrive in the near future (Ballard & Hayhoe, 2009; Land & McLeod, 2000). Crucially, at the time that the target location is fixated, there is nothing that visually distinguishes this location from the surrounding background of the scene. Even without quantitative evaluation, it is clear that no image-based model could predict this behavior. Similar targeting of currently empty locations is seen in everyday tasks such as tea making (Land, Mennie, & Rusted, 1999) and sandwich making (Hayhoe, Shrivastava, Mruczek, & Pelz, 2003). When placing an object on the counter, people will look to the empty space where the object will be placed. As has been pointed out before, it is important to avoid causal inferences from correlations between features and fixations (Einhäuser & König, 2003; Henderson et al., 2007; Tatler, 2007), and indeed, higher level correlated structures such as objects offer better predictive power for human fixations (Einhäuser, Spain et al., 2008).
Assumption 2: There is a default bottom-up mode of looking
An implicit assumption in salience-based models is that there is a “default” task-free, stimulus-driven, mode of viewing and that vision for tasks is special in some way. The possibility of such a default viewing mode that can be overridden by other factors is discussed by several recent authors (e.g., Einhäuser, Rutishauser, & Koch, 2008; Underwood, Foulsham, van Loon, Humphreys, & Bloyce, 2006). Higher level factors are conceptualized as modulators of this basic mode of looking (see below). This assumption can be found at the heart of a wide range of studies and has motivated the use of “free-viewing” as a condition in studies of picture viewing, in an attempt to isolate task-free visual processing (e.g., Parkhurst et al., 2002). Here, the viewer is given no specific instructions during the experiment other than to look at the images. The assumption that “free-viewing” is a task-free condition for the viewer is questionable. It seems more likely that free-viewing tasks simply give the subject free license to select his or her own internal agendas (Tatler et al., 2005). A reasonable assumption about what people may be doing when asked to simply look at images is to recognize and remember the contents, but we cannot be sure of their internal priorities. Consequently, we are not studying viewing behavior while free of task, but rather we are studying viewing behavior when we have no real knowledge of what the viewer has chosen as the purpose of looking. Of course, the fixation behavior we engage in when “freely viewing” an image will be very different from that when engaged in a specific task such as search, but this does not imply that the former reflects any “default” task-free mode of looking. Not only is free-viewing a conceptually problematic task, but even when participants are freely viewing images, correlations between features and fixations are weak (Einhäuser, Spain et al., 2008; Nyström & Holmqvist, 2008; Tatler, 2007).
Assumption 3: Target selection from the map
Within the salience map framework, the decision about where to fixate arises from the computation of salience across the entire visual field, followed by a winner-takes-all process to select the most salient location. In order for this to allow more than one saccade, there is transient inhibition at each attended location. While this scheme seems like a reasonable computational solution to the problem of creating an iterative model of target selection, there exist at least two problems with this aspect of models.
Retinal sampling and eccentricity
In most accounts of salience-based schemes, the retinal position of image information is not accounted for; thus, decreasing retinal acuity in the periphery is overlooked (see Wischnewski, Belardinelli, & Schneider, 2010, for further information about the failure to consider peripheral sampling limits in most recent accounts of fixation selection). Some recent models do account for retinal sampling and we will consider these later. However, we first consider the problems associated with failing to account for this aspect of visual sampling. Vincent, Troscianko, and Gilchrist (2007) showed that feature coding becomes unreliable in the periphery once the variable resolution retina is taken into account. The feature maps and resultant salience maps generated when accounting for the variable spatial resolution outside the human fovea are very unlike those generated if uniform resolution sampling is assumed. This means that salience maps computed without taking into account the resolution of peripheral vision are biologically implausible. Moreover, salience computations that do account for spatial sampling heterogeneity fail to discriminate natural object targets in photographic scenes (Vincent et al., 2007). Retinal anisotropies in sampling result in tendencies to move the eyes in particular ways (Najemnik & Geisler, 2008). Humans tend to select nearby locations more frequently than distant locations as targets for their saccades (e.g., Bahill, Adler, & Stark, 1975; Gajewski, Pearson, Mack, Bartlett, & Henderson, 2005; Pelz & Canosa, 2001; Tatler et al., 2006). Similarly, when viewing pictures, horizontal saccades dominate (e.g., Bair & O'Keefe, 1998; Lappe, Pekel, & Hoffmann, 1998; Lee, Badler, & Badler, 2002; Moeller, Kayser, Knecht, & König, 2004). Incorporating these tendencies into models of fixation selection dramatically improves the predictive power of the model (Tatler & Vincent, 2009); indeed, these motor biases alone predicted fixation selection better than a model based on homogenous salience computation or homogenous edge feature extraction. We must, therefore, account for where information is in the retinal image rather than simply where peaks in any arbitrary whole-scene feature map might occur. Failing to account properly for where the winner is in the salience map results in distributions of saccade amplitudes that do not match human eye behavior (Figure 2).
It is interesting to compare the logic behind a winner-takes-all selection process and how we typically view the need to move the eyes. The general conception of the need to move the eyes is to bring the fovea to bear on information that is not fully available in the limited acuity peripheral vision. Thus, eye movements serve to provide new information about the surroundings, maximizing information gathering or reducing uncertainty about the visual stimulus (e.g., Najemnik & Geisler, 2005; Renninger, Verghese, & Coughlan, 2007). This contrasts with the winner-takes-all approach of selecting the region with the biggest signal as the next saccade target.
Inhibition of return
To allow attention to move on from the most salient peak in the salience map, transient inhibition of each attended location is included in the model. The inclusion of transient inhibition at attended locations is based on psychophysical experiments suggesting that there is an increase in latency when returning to recently attended locations (Klein, 1980, 2000; Klein & MacInnes, 1999; Posner & Cohen, 1984). However, empirical evidence suggests that there is no reduction in tendency to return to recently fixated locations when viewing photographic images (Smith & Henderson, 2009; Tatler & Vincent, 2008). Hooge, Over, van Wezel, and Frens (2005) found that while saccades back to the previously fixated location were preceded by longer fixation times (showing temporal IOR), there was no evidence of any decrease in the frequency of saccades back to previously fixated locations. Whether we observe something resembling inhibition of return or not depends upon the statistics of the dynamic environment being observed (Farrell, Ludwig, Ellis, & Gilchrist, 2010) and tasks that require refixations between objects show no evidence of IOR (Carpenter & Just, 1978). When specifically engaged in foraging behavior, refixations are rare (Gilchrist, North, & Hood, 2001), but it is not clear whether this is due to a low-level inhibitory mechanism, particular oculomotor strategies specific to foraging, or simply memory for previously visited locations (Gilchrist & Harvey, 2006). Indeed, Droll, Gigone, and Hayhoe (2007) demonstrated that locations are fixated more frequently if they are more likely to have the target.
The implementation of IOR in computational models of salience presents an obvious problem when attempting to simulate extended viewing. If the inhibition is long lasting, then refixations are impossible; if the inhibition is transient, then the model predicts cyclic eye movement behavior. Neither of these is compatible with human behavior. When viewing a picture of a face, participants will cycle around the triangle of central facial features (Yarbus, 1967). However, this cyclic behavior is not commonly found in more complex scenes and is certainly not an unavoidable consequence of looking at the same scene for more than a few seconds. Figure 3 compares fixation patterns for a human observer viewing a scene for an extended period to Itti and Koch's (2000) salience model inspecting the scene for the same number of fixations. Thus, it seems likely that a different mechanism is required to explain the transition from one fixation to the next. This is likely to be a more active mechanism, driven by a particular goal such as search or information acquisition.
Assumption 4: Time and target selection
A reasonable starting point when developing a model of eye movement behavior is to make the simplifying assumption that a first goal should be to explain spatial rather than temporal aspects of viewing behavior. It is becoming increasingly clear, however, that important information about the underlying mechanisms for saccade target selection also lies in the temporal domain. Fixation durations vary from a few tens of milliseconds to several hundred milliseconds and, in certain situations in real-world behaviors, can last for several seconds (Hayhoe et al., 2003; Land et al., 1999). Work on the importance of fixation duration in picture viewing is beginning to emerge (Henderson & Pierce, 2008; Henderson & Smith, 2009; Nuthmann, Smith, Engbert, & Henderson, 2010). Evidence from natural tasks emphasizes the need to consider fixation durations: fixation durations depend critically on the on the time required to acquire the necessary information for the current act (Droll, Hayhoe, Triesch, & Sullivan, 2005; Hayhoe, Bensinger, & Ballard, 1998; Hayhoe et al., 2003; Land et al., 1999). If fixation durations vary according to the information extraction requirements, then ignoring this source of information when evaluating and constructing models of eye guidance misses a fundamental aspect of the control of attentional allocation.
Within the context of both simple laboratory paradigms and complex natural behavior, there is evidence that sequences of fixations may be planned in parallel (Zingale & Kowler, 1987). Unusually short fixations are often interpreted as implying that they are part of a preprogrammed sequence and the evidence for this in simple tasks is considerable. Express saccades are found predominantly when they are part of an ordered sequence of fixations in the same direction as each other (Carpenter, 2001). In the antisaccade task, erroneous prosaccades are frequently followed by short duration fixations before a corrective saccade to the opposite hemifield, a result that is interpreted as reflecting parallel programming of both the erroneous and corrective saccade before the error is initiated (Massen, 2004). The prevalence of very short fixation durations in some natural tasks (e.g., Hayhoe et al., 2003) could be interpreted in the same manner as short duration fixations have in simple viewing paradigms: as part of a pre-programmed sequence of eye movements.
From picture-viewing experiments, we know that the consistency between observers changes over time, such that different people will pick more similar locations for their first few fixations than for later fixations (Buswell, 1935). One possible explanation for this has been that the first few fixations in a viewing period are primarily driven by image properties (e.g., Parkhurst et al., 2002). However, subsequent studies have not supported this notion, suggesting that the correlations between features and fixations do not change over time (e.g., Nyström & Holmqvist, 2008; Tatler et al., 2005). Consequently, the changes in viewing behavior that are found across viewing time must come from higher level factors. Thus, time within a viewing epoch may prove to be an informative component for modeling the underlying target selection processes.
Assumption 5: Saccades precisely target locations for processing
Eye guidance in picture viewing is often assessed by comparing image statistics at fixated and control locations, extracting image properties over a small window (1—2 degrees of visual angle) centered at fixation (e.g., Parkhurst et al., 2002; Tatler et al., 2006; and many others). This approach assumes that the information at the center of gaze contains the intended target of each saccade. This seems plausible from the perspective of eye movement behavior in simple laboratory-based viewing paradigms. When required to fixate a small peripheral target, saccades that land short will almost always be corrected so that the fovea is brought to bear precisely upon the target (e.g., Becker, 1972, 1991; Carpenter, 1988; Deubel, Wolf, & Hauske, 1984; Kapoula & Robinson, 1986; Prablanc & Jeannerod, 1975). However, it is unclear whether such precision, evidenced by the presence of small corrective saccades, is a feature of natural image viewing (Tatler & Vincent, 2008).
In the context of more natural tasks, such precision may be unnecessary. When moving an object past an obstacle, getting the center of vision within about 3 degrees was sufficient: saccades that brought the foveae within 3 degrees of the obstacle were not corrected (Johansson, Westling, Backstrom, & Flanagan, 2001). Similarly, in tea making, saccades of amplitudes less than 2.5 degrees are very rare (Land et al., 1999). These findings suggest that getting the eye close to but not necessarily precisely on to a target is sufficient to serve many aspects of natural behavior, particularly when the objects being dealt with are large in the field of view. When making large relocations from one side of the room to another, gaze will sometimes be shifted in one large combined movement of eyes, head, and body. However, on other occasions, the relocation may involve one or more short duration fixations en route to the intended target (Land et al., 1999). In this case, the fixations made en route do not appear to land on any particular locations in the scene. It seems unlikely that these were intentionally targeted fixations; rather, they represent incidental stops during a planned relocation to the final, intended object. As such, the contents of these en passant fixations are unlikely to have played a key role in saccade targeting and modeling their visual characteristics of these fixations is likely to be misleading.
One question that arises when considering eye movements during natural behavior is whether all of the fixations we make are strictly necessary for serving the current behavioral goal or whether there is a certain amount of redundancy. Figure 4 shows an example of eye movements made while waiting for the kettle to boil. Many of these seem unlikely to be strictly necessary for the primary task and may reflect a variety of other purposes. It is entirely possible that these non-essential fixations are not targeted with the same precision or using the same selection criteria as other fixations. In general, the tight linking of fixations to the primary task will vary, depending on such factors as time pressure or behavioral cost. For example, fixations during driving may be more critical than when walking, where time is less critical. It is probably a mistake to think that every fixation must have an identifiable purpose and should be targeted with the same precision or selection criteria. It may be under conditions of reduced cognitive load that conspicuity-based fixations are most likely to be manifest.
Emerging alternative accounts
Not all of the issues identified above are fatal for existing approaches to the computational modeling of fixation selection. For example, incorporating peripheral acuity limits (Assumption 3) into models is tractable and several authors have incorporated aspects of this in computational models (e.g., Peters, Iyer, Itti, & Koch, 2005). Recent models emphasize the importance of inhomogeneous retinal sampling (e.g., Wischnewski et al., 2010; Zelinsky, 2008). Similarly models can incorporate information about when in a viewing epoch a fixation occurs or the duration of the fixation (Assumption 4). Models of fixation durations in scene viewing are beginning to emerge (Nuthmann et al., 2010).
Several recent models that attempt to incorporate higher level factors into accounts of fixation selection have been developed, a limitation of the original salience model that was recognized from the outset (Itti & Koch, 2000). One possibility is to suggest that top-down control is used to selectively weight the feature channels in the salience model to emphasize features that define the target of a search (Navalpakkam & Itti, 2005). A successful approach has been to incorporate prior knowledge of where particular objects are likely to be found in a scene in order to guide eye movements (Torralba, Oliva, Castelhano, & Henderson, 2006). In this model, a salience map of low-level conspicuity is modified by a contextual map of where particular targets are likely to occur. Contextual guidance and low-level features combine to provide good predictive power for human fixation distributions (Ehinger et al., 2009). In addition to using spatial expectations to refine the search space in a scene, prior knowledge of the appearance of objects of a particular class can be used (Kanan et al., 2009). Using the combination of a probabilistic appearance map, spatial contextual guidance and low-level feature salience can again be used to predict a sizeable fraction of human fixations (Kanan et al., 2009).
While the majority of recent computational models have retained a central place for low-level visual conspicuity, some models depart from this and build around alternative cores. The two most developed of these alternatives come from Wischnewski, Steil, Kehrer, and Schneider (2009; Wischnewski et al., 2010)and Zelinsky (2008). In Zelinsky's Target Acquisition Model, retinal inhomogeneity of sampling for the visual image is computationally implemented. Visual information is represented not as simple feature maps but as higher order derivatives, and knowledge of the target is incorporated. This model is successful at replicating human-like search of photographic scenes and the direction of the first saccade in a viewing epoch. The model can generalize to simpler stimuli and reproduce phenomena such as the center of gravity effect, where saccades land between potential targets (Zelinsky, Rao, Hayhoe, & Ballard, 1997).
Wischnewski et al.'s (2010) model builds upon Bundesen's (1990) Theory of Visual Attention. Wischnewski et al. attempt not only to move away from simple static visual features but also to overcome some of the problematic assumptions described above. In this model, retinal inhomogeneity of feature processing is included and the model centers around the integration of static features, dynamic features, proto-objects, and task. The emphasis in this model is not how static features are modified by other factors, but rather that the representation underlying saccade targeting is an integration across these levels of processing. These authors suggest that the different levels of information are integrated into an overall, retinotopic attention priority map. The notion of an attention priority map that integrates low-level and high-level cues has been suggested several times in the neurophysiological literature (Fecteau & Munoz, 2006). The neural implementations of such a priority map include the superior colliculus (McPeek & Keller, 2002), pulvinar (Robinson & Petersen, 1992), V1 (Li, 2002), V4 (Mazer & Gallant, 2003), LIP (Gottlieb, Kusunoki, & Goldberg, 1998), and the frontal eye field (Thompson & Bichot, 2005). Indeed, the emergence of a priority map to reflect the choice of either a target or an action in the posterior parietal cortex and subsequent areas is well supported and is clearly necessary to mediate targeted movements. It is also commonly accepted that both bottom-up and top-down signals contribute to such priority maps (Bichot & Schall, 1999; Gottlieb et al., 1998). The way that such activity emerges from the combination of stimulus and task context is unresolved, however, and beyond the scope of this review.
Wischnewski et al.'s notion of proto-objects as a key component in fixation selection is similar to recent suggestions by Henderson, Malcolm, and Schandl (2009). These authors suggested that selection proceeds from a representation of proto-objects ranked by cognitive relevance.
One other notable feature of Wischnewski et al.'s model is the incorporation of dynamic features. The need to account for dynamic stimuli and the inclusion of motion as a feature in models have been recognized for some time (see Dorr, Martinetz, Gegenfurtner, & Barth, 2010), and several versions of conspicuity-based models have incorporated dynamic features (e.g., Itti, 2005). However, it remains the case that the vast majority of studies of eye movements when viewing complex scenes use photographic images of real scenes, which necessarily fail to capture both the dynamics of real scenes and the complex, time-dependent, nature of task influences. Given this paradigmatic dominance of picture viewing, we will first consider whether this paradigm is a suitable domain in which to study eye guidance, before considering what can be learned from studying eye movements in dynamic and immersive contexts.
The picture-viewing paradigm
Can we learn about how we allocate gaze in natural environments and during natural behavior from how people look at pictures? While it is clear that models of picture viewing have utility for understanding tasks that involve looking at images on a computer monitor, it is important to consider whether we can use them to infer principles for fixation selection when behaving in natural environments. We are not the first to ask questions about the suitability of pictures as surrogates for real environments. Henderson (2003, 2006, 2007) has discussed this issue on several occasions. We wish to draw attention to two particular issues: biases introduced by the framing of the scene and effects of sudden scene onset.
The physical difference between photographs and real environments is obvious: the dynamic range of a photograph is much less than a real scene; many depth cues (stereo and motion parallax) are absent in static images; motion cues (both egomotion and external motion) are absent when viewing photographs; the observer's viewpoint in a still image is fixed and defined by the viewpoint of the photographer, which typically reflects compositional biases (Tatler et al., 2005). Not only is the field of view limited to the angle subtended by the display monitor, but also the scale of the image is typically undefined and depends on an inference by the observer. For example, a plate in a real setting might subtend 10 degrees, depending on the location of the observer, but in a picture of a scene it may subtend only a degree or two, and the subject must infer the viewpoint. This seems like a fairly sophisticated computation and is at odds with the essential idea of salience that low-level pre-attentive image features control gaze, with only limited perceptual analysis. Not only are the contents of photographs far removed from real images, but also placing the images within the bounds of the computer monitor's frame introduces strong biases in how the scenes are viewed. There is a strong tendency to fixate the center of images on a monitor irrespective of the scene's content (Tatler, 2007; Vincent,Baddeley,Correani,Troscianko,&Leonards, 2009). If, as Vincent et al. suggest, up to 34—56% of eye movements are best accounted for by a bias to fixate the screen center, then modeling the visual contents of these fixations will be very misleading.
Picture-viewing paradigms typically take the form of a series of trials characterized by the sudden onset of an image, followed by a few seconds of viewing, followed by the sudden offset of the image. Sudden onset may, in itself, influence inspection behavior. As discussed earlier, viewing patterns appear to change over time: there is inter-observer consistency in the locations fixated early in viewing, but this decreases with increasing viewing time (e.g., Parkhurst et al., 2002; Tatler et al., 2005). While this observation has given rise to a continuing debate about whether this arises because of an early dominance of salience followed by a switch toward more top-down control later in viewing, a more fundamental issue is what the implications of early differences are for the generalizability of findings. If viewing is different for the first few seconds after sudden scene onset (an observation that authors are in agreement about), then the selection criteria for these first few fixations are different for those later in viewing. The problem arises because there is no real-world analogue of the sudden onset of an entire scene, and it is known that the activity of neurons involved in target selection is very different for sudden onsets (Gottlieb et al., 1998). Even opening a door to a room is not like a sudden onset: here, the scene still emerges as the door opens. If we accept that sudden whole-scene onsets are peculiar to static scene paradigms, then the targeting decisions that underlie saccades made early in viewing periods may be specific to the sudden onset paradigms. Because such experiments typically only show scenes for a few seconds (in the region of 1–10 s in most studies), this could influence a sizeable fraction of the eye movements that are modeled.
It could be argued that the “purpose” of vision is very different when looking at a static scene to when engaging in real-world behavior. In natural tasks, a key goal of vision can be seen as extracting the visual information and coordinating the motor actions required to complete the task. However, when viewing photographic scenes, there is rarely a task that involves the active manipulation of objects in the environment. Rather, in static scene viewing, the task may be to search for a target, to remember the scene, or to make some judgement about the content of the scene. These classes of task are only a subset of the repertoire of behaviors we execute in the real world. Thus, the principles governing saccade targeting decisions in the tasks used in picture-viewing paradigms are most likely different from those used when engaged in active, real-world tasks.
Videos as surrogates for real-world settings
The shortcomings of static pictures as surrogates for real environments has been recognized by numerous investigators (e.g., Henderson, 2007; Shinoda, Hayhoe, & Shrivastava, 2001). As a result of this recognition, a growing number of studies are starting to use videos because these stimuli include dynamic information (e.g., Carmi & Itti, 2006; Itti, 2005; 't Hart et al., 2009). Dynamic features can be strong predictors of eye movement behavior (Itti, 2005). However, this may not generalize to natural behavior because the frequent editorial cuts that are found in many movie sequences present an unusual and artificial situation for the visual system. Editorial cuts result in memorial and oculomotor disruptions to normal scene perception (Hirose, Kennedy, & Tatler, 2010). Moreover, such cuts result in behavior that is unlike how we view continuous movies with no cuts (Dorr et al., 2010; 't Hart et al., 2009). When viewing continuous movies of a dynamic real-world environment, the predictive power of both static and dynamic feature cues was vanishingly small (Cristino & Baddeley, 2009). Thus, movie-style edited video clips may be problematic stimuli. It is also possible that the framing effects of the monitor continue to induce central biases to scene viewing that are ecologically invalid: while the central bias is weaker for continuous movies, it still remains and explains a considerable fraction of eye movement behavior (Cristino & Baddeley, 2009; Dorr et al., 2010; 't Hart et al., 2009).
A role for visual conspicuity?
It should be reiterated at this point that the original goal of conspicuity models was not really to explain eye movements but rather to explain attentional capture, evaluating this by using eye movements. In this respect, such models were not really designed to explain eye movements in general and should not be expected to generalize to natural behavior. There is a large literature on attentional and oculomotor capture that we will not review here. In general, the findings of this literature are mixed. There is good evidence that specific stimuli such as sudden onsets, new objects, or motion transients have substantial power to attract attention (Franconeri & Simons, 2003; Gibson, Folk, Teeuwes, & Kingstone, 2008; Irwin, Colcombe, Kramer, & Hahn, 2000; Lin, Franconeri, & Enns, 2008; Theeuwes & Godijn, 2001). It is less clear whether certain classes of stimuli attract attention in an obligatory fashion, independently of the subject's task set or ongoing cognitive goals (Jovancevic et al., 2006;Yantis, 1998). While much of natural behavior might be under task-driven control, there is clearly a need for a mechanism to capture attention and change the ongoing cognitive agenda. Many aspects of natural environments are unpredictable and there must be some mechanism to alert the observer to unexpected hazards. Our subjective impression that attention and gaze are reliably drawn to unusual stimuli or events in the environment argues for some mechanism like salience. It is a valid question whether salience models can work in these cases. The essential difficulty is that free viewing of static images is probably not a good paradigm either for attentional capture or for natural vision, as we have discussed. The problem in natural vision is that a stimulus that is salient in one context, such as peripheral motion with a stationary observer, may not be salient in another context, such as when the observer is moving and generating complex motion on the retina. To address this problem, Itti and Baldi (2006) suggested that “salient” events or locations are those that are unexpected or surprising, where surprise is defined as a statistical deviation from the recent history of visual stimuli. Surprising stimuli, therefore, correspond to statistical outliers in time, whereas salient stimuli are statistical outliers in space. A recent paper by Bruce and Tsotsos (2009) reflects this idea in the space domain by defining salience as a “surprisal” value or the extent to which a region differs from its neighborhood. Some kind of surprise mechanism is essential for attracting attention to stimuli that are important but not encompassed by the current task set. There is only a little work on the statistical basis for the formation of a surprise signal. Itti and Baldi conjecture that the visual system learns the statistics of images by estimating the distribution of parameters of probability distributions that can explain recent image feature data. In the context of video sequences, as subsequent image frames are processed, Bayesian inference updates the priors with the posterior from the previous frame. They measure surprise as the shift between the posterior and prior probabilities of model parameters. Itti and Baldi's model is a complex multi-parameter simulation of early visual processing and works on very short time scales (100s of ms). Thus, it is unlikely to reflect the long-term memory factors involved in natural behavior. Most scenes are highly familiar and observers have the opportunity to build extensive long-term memory representations built up over thousands of fixations. Brockmole and Henderson (2005, 2008) and Matsukura, Brockmole, and Henderson (2009) showed that subjects are more likely to fixate changes in scenes when they have previously viewed the scene for a few seconds. Uke-Karacan and Hayhoe (2008) showed that several minutes experience in a virtual environment led to increased fixations on changed objects in the scene. Thus, stimuli that are surprising with respect to a prior expectation might constitute a robust means of attracting attention.
It is, therefore, clear that there are circumstances in which conspicuity-based models of eye guidance and attention can provide explanations of human behavior. When the visual signal in the environment is large (as is the case in simple feature-based search arrays and sudden onset paradigms or when an unexpected event occurs), then this signal will drive eye movement behavior. It is an empirical question whether attentional capture by large signals, that is, the mechanisms of surprise, constitutes a significant portion of ordinary oculomotor behavior. Learned strategies such as searching for Stop signs at intersections can certainly deal with many of the vicissitudes of the natural world, but clearly some attention-getting mechanism is essential. Understanding how the visual world is coded in memory to form the basis of prior expectations and allow reliable detection of surprising stimuli is an important question that needs to be resolved. A related question is the extent to which mechanisms of surprise might be modulated by behavioral goals. For example, one can imagine that the visual system might have the task of looking for surprising stimuli as a priority in many circumstances, or alternatively, vision might only prioritize surprising stimuli when there is no other pressing demand. The answer to these questions would help determine the extent to which results from picture viewing might generalize to natural behavior.
Eye guidance in natural behavior
We have argued that the conspicuity-based theoretical models are unable to explain many aspects of human fixation behavior and that picture viewing (and perhaps movie viewing) is a problematic paradigm for understanding eye movement behavior. Given that a fundamental function of vision is to provide information necessary for survival, if we are to understand the principles that underlie fixation selection, we must consider eye movements in the context of behavioral goals, where the requirement is to seek out relevant information at the time when it is needed. Most contemporary models of fixation selection acknowledge the importance of accounting for cognitive control of eye movements. However, few engage with the need to consider visual selection as being fundamentally and intricately linked to action. One exception to this is Schneider's (1995) Visual Attention Model, which distinguishes “what” and “where” components of target selection, with the latter considering selection for action. Despite its conceptual and empirical strengths (Deubel & Schneider, 1996), the importance of selection for action in models of eye guidance has not featured prominently in more recent models.
Empirical evaluations show that conspicuity-based theoretical models lack explanatory power in the context of natural behavior (e.g., Rothkopf et al., 2007). Thus, we argue that conspicuity-based approaches are not a suitable theoretical framework for understanding eye movements in the context of natural behavior. The challenge, therefore, for this field is to develop a suitable theoretical alternative. Moreover, models that make empirically testable predictions of fixation selection are required. In the sections that follow, we first consider the key findings from studies of natural tasks that are common across multiple instances of behavior. Our aim in this section of the article is to bring together common findings from a range of different natural task settings in order to identify common principles for fixation selection rather than to provide extensive details on any one natural task. Understanding the common observations allows us to identify general principles that underlie eye movements in natural tasks. From these principles, it is clear that the issues that must be explained by any theory of natural eye guidance are rather different from those typically considered in current models. The principles identified here offer the essential elements from which theoretical models might be built.
Spatial coupling between gaze and behavioral goal
Cognitive control of eye movements was well established before the development of salience models (e.g., Buswell, 1935; Kowler, 1990; Yarbus, 1967). This case has been strengthened by more recent work in natural tasks. All studies of eye movements during natural behavior show that there is an intimate link between where we look and the information needed for the immediate task goals (Epelboim et al., 1995; Hayhoe et al., 2003; Land & Furneaux, 1997; Land et al., 1999; Patla & Vickers, 1997; Pelz & Canosa, 2001). The link between our behavioral goals and the allocation of overt visual attention is highlighted by the fact that when engaged in a natural task essentially all the fixations fall on task-relevant objects (Hayhoe et al., 2003; Land et al., 1999), whereas before beginning a task (such as sandwich making) the distribution of fixations between task-relevant and -irrelevant objects is about equal (Hayhoe et al., 2003; Rothkopf et al., 2007). The extent to which fixation placement is driven by the information-gathering requirements for an interaction with an object was demonstrated by Rothkopf et al. (2007) in an immersive virtual reality environment. Here, fixations on identical objects varied considerably depending upon whether the participant was attempting to approach or avoid the object (Figure 5). This result highlights the importance of understanding the function of each fixation for understanding fixation placement.
Similarity between different individuals
The intimate link between vision and action is reflected in the consistency that is observed between individuals who complete natural tasks. Different individuals show a high degree of consistency in where and when they look at informative locations while engaged in natural behaviors. Drivers look consistently at or close to the tangent point of a bend or the lane ahead, with around 50% of fixations made by three drivers falling within an area subtending only about 3 degrees in diameter (Land & Lee, 1994). Fixations on other pedestrians when walking are very consistent across individuals: despite a lack of any explicit instructions, there was a high degree of consistency in when and for how long oncoming pedestrians were fixated (Jovancevic-Misic & Hayhoe, 2009). When cutting a sandwich, subjects always fixate the initial point of contact with the knife and move their gaze along the locus of the cut, just ahead of the knife (Hayhoe et al., 2003). The similarity in fixation sequences of different individuals when taking the kettle to the sink to fill it is illustrated in Figure 6 (Land et al., 1999).
A similarly impressive degree of inter-observer consistency can be found when recording gaze behavior of observers watching a magician perform a trick. When making a cigarette and lighter “disappear,” observers consistently fixate certain locations at the crucial moments in the performance (Kuhn & Tatler, 2005). This trick is based on the simple principle of distracting the observer while first the lighter and then the cigarette is dropped. At these crucial points, the observers consistently look to the opposite hand to that being used to drop the object (Kuhn & Tatler, 2005; Tatler & Kuhn, 2007). This misdirection to the inappropriate hand is tightly controlled in space and time, so that about 300 ms before the object is dropped, most participants will be looking at the same (inappropriate) location (Tatler & Kuhn, 2007). Of course, the question must be asked as to how the magician ensures the consistent misdirection of the audience at these crucial moments. These authors have shown that it is the magician's own direction of gaze that is the key to successfully misdirecting the observer (Kuhn, Tatler, & Cole, 2009).
A clear implication of the spatial and temporal consistency that is found between participants in natural tasks is that the decisions about where and when to allocate gaze must be governed by the same underlying principles in different individuals. Given the role of eye movements in gathering information to accomplish tasks, it makes sense that fixation patterns between individuals should be similar, as they will reflect the physical and dynamic properties of the environment as well as common behavioral goals. This stability in fixation patterns makes the investigation of natural behavior unexpectedly accessible to experimental investigation. The high degree of consistency found in gaze allocation in natural settings is in contrast to the often quite low consistencies found between individuals when viewing static scenes. Especially after a few seconds from the onset of a static scene, there is often very little consistency in where different observers fixate (Tatler et al., 2005). Consequently, one could argue that the principles underlying fixation selection during natural tasks may be more robust than those that researchers have tried to capture in models of eye guidance when viewing static scenes.
Timing of gaze shifts
A striking feature of natural behavior is that there is not only a tight spatial coupling between the eye and the target of the current motor act but also there is a tight temporal coupling between vision and action. This was elegantly demonstrated by Johansson et al. (2001), who measured the departure time of the eye relative to the hand as the subject maneuvered an object past an obstacle. Gaze moved onto the next target just at the point that the object cleared the obstacle. Similar time locking of the fixations and actions has been observed in driving (Land & Lee, 1994; Land & Tatler, 2001), making tea or sandwiches (Hayhoe et al., 2003; Land et al., 1999), music sight reading (Furneaux & Land, 1999), walking (Patla & Vickers, 2003), and reading aloud (Buswell, 1920). The ubiquity of this eye–action temporal coupling underlines the necessity to consider placement of the eyes in time as well as in space. Moreover, it may well be that the correct temporal placement of the eyes is more crucial to successful completion of behaviors than precise spatial placement and that skilled performance is as dependent upon the correct allocation of gaze in time as in space (Land & McLeod, 2000).
The roles of learning
Implicit in much of the research on natural tasks is the finding that people must learn what to look at and when (Chapman & Underwood, 1998; Land, 2004; Land & Furneaux, 1997; Land & Tatler, 2009). For example, in a virtual driving environment, Shinoda et al. (2001) asked participants to look for Stop signs while driving an urban route. Approximately 45% of fixations fell in the neighborhood of intersections during this task, and as might be expected from this, participants were more likely to detect Stop signs placed near intersections than those placed in the middle of a block. This result suggests that drivers have learned that traffic signs are more likely around intersections and so to preferentially allocate their gaze to these regions. At a more detailed level, people must learn the optimal location for the specific information they need. For example, where on the kettle a subject will look depends on what they need to do with that kettle. When waiting for it to boil, they will look mainly at the fill level indicator and switch (Figure 4). When placing it on its base, fixations will alternate between the bottom of the kettle and the fixings protruding from the base (on the work surface). When pouring water from the kettle, fixations will be made to the water stream in the receiving vessel. People must learn not only the locations at which relevant information is to be found but also the order in which the fixations must be made in order to accomplish the task. Thus, when making a sandwich an individual must locate the peanut butter and the bread before picking them up, pick up the knife before spreading, and so on. This means that a complete understanding of eye movements in natural behavior will require an understanding of the way that tasks are learned and represented in the brain, much of which presumably occurs over long time periods during development. In adult life, skills can be learned more rapidly, because they build on related skills already acquired.
In a study that explored the development of eye–hand coordination in a novel task, Sailer, Flanagan, and Johansson (2005) used a mouse-like control task to show that initially the eyes lagged behind action, apparently providing feedback information about the success of the last maneuver. However, once skilled at this task (after about 20 min), the eyes led the movement of the mouse cursor systematically by about 0.4 s, anticipating the next goal of the cursor on the screen. Similarly, learner drivers fixate just ahead of the car when cornering, whereas more experienced drivers look into the bend and fixate points on the road that will be reached as much as 3 s later, thus anticipating any need for future action (Land, 2006; Land & Tatler, 2009).
In stable environments, the observer needs only to update the locations of items that are moved or monitor items that are changing state. In dynamic environments, such as driving, walking, or in sports, more complex properties must be learned. In walking, humans need to know how pedestrians typically behave and how often to look at them. The fact that humans do indeed learn such statistics was demonstrated by Jovancevic-Misic and Hayhoe (2009). In a real walking setting, they were able to actively manipulate gaze allocation by varying the probability of potential collisions. Manipulation of the probability of a potential collision by a risky pedestrian (i.e., one with a past record of attempting collisions) was accompanied by a rapid change in gaze allocation. Subjects learned new priorities for gaze allocation within a few encounters and looked both sooner and longer at potentially dangerous pedestrians. This finding generalizes earlier work, for example, by He and Kowler (1989), showing the sensitivity of saccades to stimulus probability.
Further evidence for learning the dynamic properties of the environment comes from the fact that saccades are often proactive, that is, they are made to a location in a scene in advance of an expected event. In walking, subjects looked at risky pedestrians before they veered onto a collision course. In cricket, squash, and catching balls, players anticipate the bounce point of the ball by 100 ms or more (Land & McLeod, 2000). This ability to predict where the ball will bounce depends on previous experience of the ball's trajectory in combination with current sensory data. This suggests that observers have learned models of the dynamic properties of the world that can be used to position gaze in anticipation of a predicted event. Indeed, given neural delays between the eye and cortex, in time-critical behaviors such as driving and ball sports, action control must proceed on the basis of predictions rather than perceptions.
It is clear from these examples that the types and time scales of learning in the above examples vary considerably. Thus, any theoretical model must be able to explain learning across this broad range.
Reward-based models of gaze allocation
If we are to place learning at the center of theoretical accounts of eye guidance, it is important to consider how it might be implemented in the brain. The reward system, which has been implicated in a variety of aspects of learning, offers a suitable system for implementing the learning that is required for deploying gaze in natural behavior.
Neural substrates for learning gaze allocation in task execution
It has become increasingly clear that the brain's internal reward mechanisms are intimately linked to the neural machinery controlling eye movements. Schultz et al. have shown that dopaminergic neurons in the basal ganglia signal the reward expected from an action. The role of dopamine in expected reward is signaled as it is handed out in anticipation of the result of a behavior (e.g., Schultz, Tremblay, & Hollerman, 2000). Sensitivity to reward is manifest throughout the saccadic eye movement circuitry. Caudate cell responses reflect both the target of an upcoming saccade and the reward expected after making the movement (Hikosaka, Nakamura, & Nakahara, 2006). Saccade-related areas in the cortex (LIP, FEF, SEF, and DLPF) all exhibit sensitivity to reward (Dorris & Glimcher, 2004; Glimcher, 2003; Glimcher, Camerer, Fehr, & Poldrack, 2009; Platt & Glimcher, 1999; Stuphorn & Schall, 2006; Stuphorn, Taylor, & Schall, 2000; Sugrue, Corrado, & Newsome, 2004). The neurons involved in saccadic targeting respond in a graded manner to both the amount of expected reward and the probability of a reward in the period prior to execution of the response. Sensitivity to both these variables is critical for learning and, consequently, for linking fixation patterns to task demands. The cortical saccade-related areas converge on the caudate nucleus in the basal ganglia, and the cortical–basal ganglia–superior colliculus circuit appears to regulate the control of fixation and the timing of planned movements. Such regulation is a critical requirement for task control of fixations.
The relevance of the neurophysiological work on reward may not be immediately obvious for ordinary human behavior. In neurophysiological paradigms, usually a primary reward such as juice or a raisin is delivered after the animal performs an action. This, of course, does not happen in real life when one makes an eye movement. However, eye movements are for the purpose of obtaining information, and this information is used to achieve behavioral goals, such as making a sandwich, that are ultimately important for survival. Thus, visual information acquired during a fixation can be thought of as a secondary reward and can mediate learning of gaze patterns by virtue of its ultimate significance for adaptation and survival. Indeed, several researchers have quantified the intrinsic reward associated with looking at particular visual stimuli. Deaner, Khera, and Platt (2005) and Shepherd, Deaner, and Platt (2006) measured how much liquid reward monkeys were willing to give up in order to obtain visual information about members of their social group. In this case, liquid is the measurable, external equivalent of an internal reward resulting from gaze. Thus, the dopaminergic machinery appears to be intimately related to the sensitivity of eye movement target selection to behavioral outcomes.
Modeling eye movements using reward
The reward sensitivity of the eye movement circuitry provides the neural underpinnings for reinforcement learning models of behavior (Montague, Hyman, & Cohen, 2004; Schultz, 2000). The mathematics of reinforcement learning is potentially useful for understanding how complex gaze patterns might be generated (Sutton & Barto, 1998). Dopaminergic cellssignal thereward expected from an action, and reinforcement learning models are pertinent because they allow an agent to learn what actions or action sequences will lead to reward in the future. Given a set of possible states, and actions that might be associated with those states, reinforcement learning algorithms allow an agent to learn a policy for selecting actions that will ultimately maximize reward.
There have been few attempts to model the eye movements observed in complex behavior. However, one such model, by Sprague et al. (2007; Figure 7), shows how a simulated agent in a virtual environment can learn to allocate gaze to avoid obstacles and control direction in walking (see also Ballard & Hayhoe, 2009; Rothkopf & Ballard, 2009; Rothkopf et al., 2007). The model assumes that visual computations required in the real world can be broken down into a set of subtasks, or modules, such as controlling direction, avoiding obstacles, and so on. Each subtask is associated with some reward value. For example, obtaining visual information that allows avoidance of an obstacle presumably provides secondary reward. These authors have provided a computational account of how we can successfully distribute attention and gaze between these visual subtasks in a dynamic environment. Their chosen paradigm involves walking along a virtual path with three simultaneous tasks: stay on the path, avoid obstacles, and pick up “litter.” The proposed model assumes that we can only attend to one location at any moment in time and that our uncertainty about unattended tasks grows over time. The decision about which task to attend to is based on the expected reward of switching attention to another task, evaluated every 300 ms. To choose between ongoing competing tasks, in their model, uncertainty increases (together with an attendant cost) when gaze is withheld from an informative scene location. The model assumes that eye movements are selected to maximize reward by reducing uncertainty that could result in suboptimal actions. Framing the decision about where to look in terms of uncertainty reduction has been effective in explaining aspects of static scene viewing (Najemnik & Geisler, 2005, 2008; Renninger, Coughlan, & Vergheese, 2005) as well as dynamic scene viewing.
Reward is a central component of recent applications of statistical decision theory to understanding control of body movements. In this approach, the concepts of reward (costs and benefits of the outcome of the action), uncertainty (of both sensory state and outcome), and prior knowledge (probability distributions associated with world states) are central to understanding sensory-motor behavior (e.g., Tassinari, Hudson, & Landy, 2006; Trommershäuser, Maloney, & Landy, 2008). When reward is externally defined (e.g., by monetary reward), it has been shown that subjects making rapid hand movements learn a complicated spatially distributed target reward system and behave in a nearly optimal manner to maximize reward (e.g., Seydell, McCann, Trommershäuser, & Knill, 2008; Trommershäuser, Maloney, & Landy, 2003). Similar targeting experiments using saccadic eye movements with monetary rewards and losses showed that reward affected saccadic targeting, although stimulus strength also affected the movements particularly at short latency (Stritzke, Trommershäuser, & Gegenfurtner, 2009). Other evidence for the role of reward in saccade targeting has been demonstrated by Navalpakkam, Koch, Rangel, and Perona (2010) who showed that subject's saccade behavior in a visual search is consistent with an ideal Bayesian observer, taking into account both rewards and stimulus detectability. Thus, it is plausible that the patterns of eye movements observed in the natural world takes into account both the reward structure of the environment and stimulus uncertainty (Trommershäuser, Glimcher, & Gegenfurtner, 2009).
Models that use reward and uncertainty as central components are in their relative infancy and are not yet at the stage of providing a computational model that explains eye movements across multiple instances of natural behavior. However, such models offer the potential to include ubiquitous aspects of fixation selection that cannot be explained within conspicuity-based models. For example, the common tendency to look into empty spaces in anticipation of an event is very problematic for conspicuity models but can be explained if gaze allocation is based on expected (secondary) reward. Developing eye guidance models based on reward is a difficult endeavor because it essentially requires a model of task execution. Not only this, but as we have seen the types and time scales of learning that we must be able to model vary considerably. At present, models based on reward focus on the immediate time scale of the current behavioral situation but reflect the outcome of longer time scales of learning. Reinforcement learning, for example, presumably functions on a developmental time scale, so adults' gaze patterns would reflect the end product of such models. Many fundamental questions require empirical support. For example, is it appropriate to model behavior as a set of semi-independent subtasks? This assumption of behavioral modules is critical to make the problem computationally tractable (Rothkopf & Ballard, 2010), but it is not known whether it is a good model of sensory-motor behavior. However, it is clear that reward is intrinsic to many aspect of cortical function (Glimcher et al., 2009) so the reward-based approach seems likely to provide a key building block from which to develop future theories and models of gaze behavior.
Conclusions
Investigation of eye guidance in scenes has been driven largely by studies of static scene viewing. The latest models of this behavior can be thought of as modifications to the image salience framework, where a core bottom-up mode of looking is modified by various high-level constraints. We argue that the basic assumptions at the heart of such studies are problematic if we wish to try to generalize these models to how gaze is allocated in natural behavior. That is, models developed from static scene-viewing paradigms may be adequate models of how we look at pictures but are unlikely to generalize to gaze behavior in other situations. Developing computational models of gaze allocation that can generalize across many instances of natural behavior is a difficult goal. However, we see already from studies of gaze selection in natural behavior that there is a consistent set of principles underlying eye guidance involving behavioral relevance, or reward, uncertainty about the state of the environment, and learned models of the environment, or priors. These factors control the decision mechanisms that govern what we should attend to on the basis of where we will gain information for fulfilling the current behavioral goals.
Acknowledgments
The authors would like to thank Werner Schneider and an anonymous reviewer for their helpful comments and suggestions. We thank Brian Sullivan for his comments on an earlier draft. Work in this article was supported by two NIH Grants (NIH R01 EY05729 and NIH R01 EY019174) and two grants from the Leverhulme Trust (F/00 143/O and F/00 143/J). We thank Sharon Scrafton for supplying the tea making footage used in the icon for this article and Sarah Walls for contributing the eye movement data used in Figure 3.
Footnotes
Commercial relationships: none.
Contributor Information
Benjamin W. Tatler, School of Psychology, University of Dundee, Dundee, UK
Mary M. Hayhoe, Center for Perceptual Systems, University of Texas at Austin, Austin, TX, USA
Michael F. Land, School of Life Sciences, University of Sussex, Sussex, UK
Dana H. Ballard, Computer Science Department, University of Texas at Austin, Austin, TX, USA
References
- Bahill AT, Adler D, Stark L. Most naturally occurring human saccades have magnitudes of 15 degrees or less. Investigative Ophthalmology. 1975;14:468–469. [PubMed] [Google Scholar]
- Bair W, O'Keefe LP. The influence of fixational eye movements on the response of neurons in area MT of the macaque. Visual Neuroscience. 1998;15:779–786. doi: 10.1017/s0952523898154160. [DOI] [PubMed] [Google Scholar]
- Ballard DH, Hayhoe MM. Modeling the role of task in the control of gaze. Visual Cognition. 2009;17:1185–1204. doi: 10.1080/13506280902978477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Becker W. The control of eye movements in the saccadic system. Bibliotheca Ophthalmologica. 1972;82:233–243. [PubMed] [Google Scholar]
- Becker W. Saccades. In: Carpenter RHS, editor. Vision & visual dysfunction: Eye movements. Vol. 8. Macmillan; Basingstoke, UK: 1991. pp. 95–137. [Google Scholar]
- Bichot NP, Schall JD. Effects of similarity and history on neural mechanisms of visual selection. Nature Neuroscience. 1999;2:549–554. doi: 10.1038/9205. [DOI] [PubMed] [Google Scholar]
- Brockmole JR, Henderson JM. Prioritization of new objects in real-world scenes: Evidence from eye movements. Journal of Experimental Psychology: Human Perception and Performance. 2005;31:857–868. doi: 10.1037/0096-1523.31.5.857. [DOI] [PubMed] [Google Scholar]
- Brockmole JR, Henderson JM. Prioritizing new objects for eye fixation in real-world scenes: Effects of object-scene consistency. Visual Cognition. 2008;16:375–390. [Google Scholar]
- Bruce NDB, Tsotsos JK. Saliency, attention, and visual search: An information theoretic approach. Journal of Vision. 2009;9(3):5, 1–24. doi: 10.1167/9.3.5. http://www.journalofvision.org/content/9/3/5 doi:10.1167/9.3.5. [PubMed][Article] [DOI] [PubMed]
- Bundesen C. A theory of visual attention. Psychological Review. 1990;97:523–547. doi: 10.1037/0033-295x.97.4.523. [DOI] [PubMed] [Google Scholar]
- Buswell GT. An experimental study of the eye–voice span in reading. Chicago University Press; Chicago: 1920. [Google Scholar]
- Buswell GT. How people look at pictures: A study of the psychology of perception in art. University of Chicago Press; Chicago: 1935. [Google Scholar]
- Carmi R, Itti L. Visual causes versus correlates of attentional selection in dynamic scenes. Vision Research. 2006;46:4333–4345. doi: 10.1016/j.visres.2006.08.019. [DOI] [PubMed] [Google Scholar]
- Carpenter PA, Just MA. Eye fixations during mental rotation. In: Senders J, Monty R, Fisher D, editors. Eye movements and psychological functions II. Erlbaum; Hillsdale, NJ: 1978. pp. 115–133. [Google Scholar]
- Carpenter RHS. Movements of the eyes. 2 nd ed Pion; London: 1988. [Google Scholar]
- Carpenter RHS. Express saccades: Is bimodality a result of the order of stimulus presentation? Vision Research. 2001;41:1145–1151. doi: 10.1016/s0042-6989(01)00007-4. [DOI] [PubMed] [Google Scholar]
- Chapman PR, Underwood G. Visual search of driving situations: Danger and experience. Perception. 1998;27:951–964. doi: 10.1068/p270951. [DOI] [PubMed] [Google Scholar]
- Cristino F, Baddeley RJ. The nature of the visual representations involved in eye movements when walking down the street. Visual Cognition. 2009;17:880–903. [Google Scholar]
- Deaner RO, Khera AV, Platt ML. Monkeys pay per view: Adaptive valuation of social images by rhesus macaques. Current Biology. 2005;15:543–548. doi: 10.1016/j.cub.2005.01.044. [DOI] [PubMed] [Google Scholar]
- Deubel H, Schneider WX. Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Research. 1996;36:1812–1837. doi: 10.1016/0042-6989(95)00294-4. [DOI] [PubMed] [Google Scholar]
- Deubel H, Wolf W, Hauske G. The evaluation of the oculomotor error signal. In: Gale A, Johnson F, editors. Theoretical and applied aspects of eye movement research. North Holland; Amsterdam, The Netherlands: 1984. pp. 55–62. [Google Scholar]
- Dorr M, Martinetz T, Gegenfurtner KR, Barth E. Variability of eye movements when viewing dynamic natural scenes. Journal of Vision. 2010;10(10):28, 1–17. doi: 10.1167/10.10.28. http://www.journalofvision.org/content/10/10/28 doi:10.1167/10.10.28. [PubMed][Article] [DOI] [PubMed]
- Dorris M-C, Glimcher P-W. Activity in posterior parietal cortex is correlated with the subjective desirability of an action. Neuron. 2004;44:365–378. doi: 10.1016/j.neuron.2004.09.009. [DOI] [PubMed] [Google Scholar]
- Droll JA, Gigone K, Hayhoe MM. Learning where to direct gaze during change detection. Journal of Vision. 2007;7(14):6, 1–12. doi: 10.1167/7.14.6. http://www.journalofvision.org/content/7/14/6 doi:10.1167/7.14.6. [PubMed][Article] [DOI] [PubMed]
- Droll JA, Hayhoe MM, Triesch J, Sullivan BT. Task demands controlacquisitionandstorageof visual information. Journal of Experimental Psychology: Human Perception and Performance. 2005;31:1416–1438. doi: 10.1037/0096-1523.31.6.1416. [DOI] [PubMed] [Google Scholar]
- Ehinger KA, Hidalgo-Sotelo B, Torralba A, Oliva A. Modeling search for people in 900 scenes: A combined source model of eye guidance. Visual Cognition. 2009;17:945. doi: 10.1080/13506280902834720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Einhäuser W, König P. Does luminance-contrast contribute to a saliency map for overt visual attention? European Journal of Neuroscience. 2003;17:1089–1097. doi: 10.1046/j.1460-9568.2003.02508.x. [DOI] [PubMed] [Google Scholar]
- Einhäuser W, Rutishauser U, Koch C. Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli. Journal of Vision. 2008;8(2):2, 1–19. doi: 10.1167/8.2.2. http://www. journalofvision.org/content/8/2/2 doi:10.1167/8.2.2. [PubMed][Article] [DOI] [PubMed]
- Einhäuser W, Spain M, Perona P. Objects predict fixations better than early saliency. Journal of Vision. 2008;8(14):18, 1–26. doi: 10.1167/8.14.18. http://www.journalofvision.org/ content/8/14/18 doi:10.1167/8.14.18. [PubMed][Article] [DOI] [PubMed]
- Epelboim JL, Steinman RM, Kowler E, Edwards M, Pizlo Z, Erkelens CJ, et al. The function of visual search and memory in sequential looking tasks. Vision Research. 1995;35:3401–3422. doi: 10.1016/0042-6989(95)00080-x. [DOI] [PubMed] [Google Scholar]
- Farrell S, Ludwig CJH, Ellis LA, Gilchrist ID. The influence of environmental statistics on inhibition of saccadic return. Proceedings of the National Academy of Sciences. 2010;107:929–934. doi: 10.1073/pnas.0906845107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fecteau JH, Munoz DP. Salience, relevance, and firing: A priority map for target selection. Trends in Cognitive Sciences. 2006;10:382–390. doi: 10.1016/j.tics.2006.06.011. [DOI] [PubMed] [Google Scholar]
- Foulsham T, Underwood G. What can saliency models predict about eye movements? Spatial and sequential aspects of fixations during encoding and recognition. Journal of Vision. 2008;8(2):6, 1–17. doi: 10.1167/8.2.6. http://www.journalofvision.org/content/8/2/6 doi:10.1167/8.2.6. [PubMed][Article] [DOI] [PubMed]
- Franconeri SL, Simons DJ. Moving and looming stimuli capture attention. Perception & Psychophysics. 2003;65:999–1010. doi: 10.3758/bf03194829. [DOI] [PubMed] [Google Scholar]
- Furneaux S, Land MF. The effects of skill on the eye-hand span during musical sight-reading. Proceedings of the Royal Society of London B: Biological Sciences. 1999;266:2435–2440. doi: 10.1098/rspb.1999.0943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gajewski DA, Pearson AM, Mack ML, Bartlett FN, Henderson JM. Human gaze control in real world search. In: Paletta L, Tsotsos JK, Rome E, Humphreys GW, editors. Attention and performance in computational vision. Springer-Verlag; Heidelberg, Germany: 2005. pp. 83–99. [Google Scholar]
- Gibson B, Folk C, Teeuwes J, Kingstone A. Introduction to special issue on attentional capture. Visual Cognition. 2008;16:145–154. [Google Scholar]
- Gilchrist ID, Harvey M. Evidence for a systematic component within scan paths in visual search. Visual Cognition. 2006;14:704–715. [Google Scholar]
- Gilchrist ID, North A, Hood B. Is visual search really like foraging? Perception. 2001;30:1459–1464. doi: 10.1068/p3249. [DOI] [PubMed] [Google Scholar]
- Glimcher P. The neurobiology of visual-saccadic decision making. Annual Review of Neuroscience. 2003;26:133–179. doi: 10.1146/annurev.neuro.26.010302.081134. [DOI] [PubMed] [Google Scholar]
- Glimcher P, Camerer C, Fehr E, Poldrack R. Neuroeconomics: Decision making and the brain. Academic Press; London: 2009. [Google Scholar]
- Gottlieb JP, Kusunoki M, Goldberg ME. The representation of visual salience in monkey parietal cortex. Nature. 1998;391:481–484. doi: 10.1038/35135. [DOI] [PubMed] [Google Scholar]
- Hayhoe MM, Bensinger DG, Ballard DH. Task constraints in visual working memory. Vision Research. 1998;38:125–137. doi: 10.1016/s0042-6989(97)00116-8. [DOI] [PubMed] [Google Scholar]
- Hayhoe MM, Shrivastava A, Mruczek R, Pelz JB. Visual memory and motor planning in a natural task. Journal of Vision. 2003;3(1):6, 49–63. doi: 10.1167/3.1.6. http://www. journalofvision.org/content/3/1/6 doi:10.1167/3.1.6. [PubMed][Article] [DOI] [PubMed]
- He PY, Kowler E. The role of location probability in the programming of saccades—Implications for center-of-gravity tendencies. Vision Research. 1989;29:1165–1181. doi: 10.1016/0042-6989(89)90063-1. [DOI] [PubMed] [Google Scholar]
- Henderson JM. Human gaze control in real-world scene perception. Trends in Cognitive Sciences. 2003;7:498–504. doi: 10.1016/j.tics.2003.09.006. [DOI] [PubMed] [Google Scholar]
- Henderson JM. Eye movements. In: Senior C, Russell T, Gazzaniga M, editors. Methods in mind. MIT Press; Cambridge, MA: 2006. pp. 171–191. [Google Scholar]
- Henderson JM. Regarding scenes. Current Directions in Psychological Science. 2007;16:219–222. [Google Scholar]
- Henderson JM, Brockmole JR, Castelhano MS, Mack ML. Visual saliency does not account for eye movements during search in real-world scenes. In: van Gompel RPG, Fischer MH, Murray WS, Hill RL, editors. Eye movements: A window on mind and brain. Elsevier; Oxford, UK: 2007. pp. 537–562. [Google Scholar]
- Henderson JM, Malcolm GL, Schandl C. Searching in the dark: Cognitive relevance drives attention in real-world scenes. Psychonomic Bulletin and Review. 2009;16:850–856. doi: 10.3758/PBR.16.5.850. [DOI] [PubMed] [Google Scholar]
- Henderson JM, Pierce GL. Eye movements during scene viewing: Evidence for mixed control of fixation durations. Psychonomic Bulletin & Review. 2008;15:566–573. doi: 10.3758/pbr.15.3.566. [DOI] [PubMed] [Google Scholar]
- Henderson JM, Smith TJ. How are eye fixation durations controlled during scene viewing? Further evidence from a scene onset delay paradigm. Visual Cognition. 2009;17:1055–1082. [Google Scholar]
- Hikosaka O, Nakamura K, Nakahara H. Basal ganglia orient eyes to reward. Journal of Neurophysiology. 2006;95:567–584. doi: 10.1152/jn.00458.2005. [DOI] [PubMed] [Google Scholar]
- Hirose Y, Kennedy A, Tatler BW. Perception and memory across viewpoint changes in moving images. Journal of Vision. 2010;10(4):2, 1–19. doi: 10.1167/10.4.2. http://www.journalofvision.org/content/10/4/2 doi:10.1167/10.4.2. [PubMed][Article] [DOI] [PubMed]
- Hooge ITC, Over EAB, van Wezel RJA, Frens MA. Inhibition of return is not a foraging facilitator in saccadic search and free viewing. Vision Research. 2005;45:1901–1908. doi: 10.1016/j.visres.2005.01.030. [DOI] [PubMed] [Google Scholar]
- Irwin DE, Colcombe AM, Kramer AF, Hahn S. Attentional and oculomotor capture by onset, luminance and color singletons. Vision Research. 2000;40:1443–1458. doi: 10.1016/s0042-6989(00)00030-4. [DOI] [PubMed] [Google Scholar]
- Itti L. Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes. Visual Cognition. 2005;12:1093–1123. [Google Scholar]
- Itti L, Baldi P. Bayesian surprise attracts human attention. In: Weiss Y, Schölkopf B, Platt J, editors. Advances in Neural Information Processing Systems, (NIPS 2005) Vol. 18. MIT Press; Cambridge, MA: 2006. pp. 547–554. [Google Scholar]
- Itti L, Koch C. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research. 2000;40:1489–1506. doi: 10.1016/s0042-6989(99)00163-7. [DOI] [PubMed] [Google Scholar]
- Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20:1254–1259. [Google Scholar]
- Johansson RS, Westling GR, Backstrom A, Flanagan JR. Eye-hand coordination in object manipulation. Journal of Neuroscience. 2001;21:6917–6932. doi: 10.1523/JNEUROSCI.21-17-06917.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jovancevic J, Sullivan B, Hayhoe M. Control of attention and gaze in complex environments. Journal of Vision. 2006;6(12):9, 1431–1450. doi: 10.1167/6.12.9. http://www.journalofvision.org/content/6/12/9 doi:10.1167/6.12.9. [PubMed][Article] [DOI] [PubMed]
- Jovancevic-Misic J, Hayhoe M. Adaptive gaze control in natural environments. Journal of Neuroscience. 2009;29:6234–6238. doi: 10.1523/JNEUROSCI.5570-08.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanan C, Tong MH, Zhang LY, Cottrell GW. SUN: Top-down saliency using natural statistics. Visual Cognition. 2009;17:979–1003. doi: 10.1080/13506280902771138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kapoula Z, Robinson DA. Saccadic under-shoot is not inevitable: Saccades can be accurate. Vision Research. 1986;26:735–743. doi: 10.1016/0042-6989(86)90087-8. [DOI] [PubMed] [Google Scholar]
- Klein RM. Does oculomotor readiness mediate cognitive control of visual attention? In R. In: Nickerson S, editor. Attention and performance VIII. Lawrence Erlbaum; Hillsdale, NJ: 1980. pp. 259–276. [Google Scholar]
- Klein RM. Inhibition of return. Trends in Cognitive Sciences. 2000;4:138–147. doi: 10.1016/s1364-6613(00)01452-2. [DOI] [PubMed] [Google Scholar]
- Klein RM, MacInnes J. Inhibition of return is a foraging facilitator in visual search. Psychological Science. 1999;10:346–352. [Google Scholar]
- Koch C, Ullman S. Shifts in selective visual attention—Towards the underlying neural circuitry. Human Neurobiology. 1985;4:219–227. [PubMed] [Google Scholar]
- Kowler E. The role of visual and cognitive processes in the control of eye movement. In: Kowler E, editor. Eye movements and their role in visual and cognitive processes. Elsevier; Amsterdam, The Netherlands: 1990. pp. 1–70. [PubMed] [Google Scholar]
- Kuhn G, Tatler BW. Magic and fixation: Now you don't see it, now you do. Perception. 2005;34:1155–1161. doi: 10.1068/p3409bn1. [DOI] [PubMed] [Google Scholar]
- Kuhn G, Tatler BW, Cole GG. You look where I look! Effect of gaze cues on overt and covert attention in misdirection. Visual Cognition. 2009;17:925–944. [Google Scholar]
- Land MF. The coordination of rotations of the eyes, head and trunk in saccadic turns produced in natural situations. Experimental Brain Research. 2004;159:151–160. doi: 10.1007/s00221-004-1951-9. [DOI] [PubMed] [Google Scholar]
- Land MF. Eye movements and the control of actions in everyday life. Progress in Retinal and Eye Research. 2006;25:296–324. doi: 10.1016/j.preteyeres.2006.01.002. [DOI] [PubMed] [Google Scholar]
- Land MF, Furneaux S. The knowledge base of the oculomotor system. Philosophical Transactions of the Royal Society of London B: Biological Sciences. 1997;352:1231–1239. doi: 10.1098/rstb.1997.0105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Land MF, Lee DN. Where we look when we steer. Nature. 1994;369:742–744. doi: 10.1038/369742a0. [DOI] [PubMed] [Google Scholar]
- Land MF, McLeod P. From eye movements to actions: How batsmen hit the ball. Nature Neuro-science. 2000;3:1340–1345. doi: 10.1038/81887. [DOI] [PubMed] [Google Scholar]
- Land MF, Mennie N, Rusted J. The roles of vision and eye movements in the control of activities of daily living. Perception. 1999;28:1311–1328. doi: 10.1068/p2935. [DOI] [PubMed] [Google Scholar]
- Land MF, Tatler BW. Steering with the head: The visual strategy of a racing driver. Current Biology. 2001;11:1215–1220. doi: 10.1016/s0960-9822(01)00351-7. [DOI] [PubMed] [Google Scholar]
- Land MF, Tatler BW. Looking and acting: Vision and eye movements in natural behaviour. Oxford University Press; Oxford, UK: 2009. [Google Scholar]
- Lappe M, Pekel M, Hoffmann KP. Optokinetic eye movements elicited by radial optic flow in the macaque monkey. Journal of Neuro-physiology. 1998;79:1461–1480. doi: 10.1152/jn.1998.79.3.1461. [DOI] [PubMed] [Google Scholar]
- Lee SP, Badler JB, Badler NI. Eyes alive. ACM Transactions on Graphics. 2002;21:637–644. [Google Scholar]
- Li ZP. A saliency map in primary visual cortex. Trends in Cognitive Sciences. 2002;6:9–16. doi: 10.1016/s1364-6613(00)01817-9. [DOI] [PubMed] [Google Scholar]
- Lin JY, Franconeri S, Enns JT. Objects on a collision path with the observer demand attention. Psychological Science. 2008;19:686–692. doi: 10.1111/j.1467-9280.2008.02143.x. [DOI] [PubMed] [Google Scholar]
- Mannan SK, Ruddock KH, Wooding DS. Fixation sequences made during visual examination of briefly presented 2D images. Spatial Vision. 1997;11:157–178. doi: 10.1163/156856897x00177. [DOI] [PubMed] [Google Scholar]
- Massen C. Parallel programming of exogenous and endogenous components in the antisaccade task. Quarterly Journal of Experimental Psychology A: Human Experimental Psychology. 2004;57:475–498. doi: 10.1080/02724980343000341. [DOI] [PubMed] [Google Scholar]
- Matsukura M, Brockmole JR, Henderson JM. Overt attentional prioritization of new objects and feature changes during real-world scene viewing. Visual Cognition. 2009;17:835–855. [Google Scholar]
- Mazer JA, Gallant JL. Goal-related activity in V4 during free viewing visual search: Evidence for a ventral stream visual salience map. Neuron. 2003;40:1241–1250. doi: 10.1016/s0896-6273(03)00764-5. [DOI] [PubMed] [Google Scholar]
- McPeek RM, Keller EL. Superior colliculus activity related to concurrent processing of saccade goals in a visual search task. Journal of Neurophysiology. 2002;87:1805–1815. doi: 10.1152/jn.00501.2001. [DOI] [PubMed] [Google Scholar]
- Moeller GU, Kayser C, Knecht F, Konig P. Interactions between eye movement systems in cats and humans. Experimental Brain Research. 2004;157:215–224. doi: 10.1007/s00221-004-1835-z. [DOI] [PubMed] [Google Scholar]
- Montague PR, Hyman SE, Cohen JD. Computational roles for dopamine in behavioral control. Nature. 2004;431:760–767. doi: 10.1038/nature03015. [DOI] [PubMed] [Google Scholar]
- Najemnik J, Geisler WS. Optimal eye movement strategies in visual search. Nature. 2005;434:387–391. doi: 10.1038/nature03390. [DOI] [PubMed] [Google Scholar]
- Najemnik J, Geisler WS. Eye movement statistics in humans are consistent with an optimal search strategy. Journal of Vision. 2008;8(3):4, 1–14. doi: 10.1167/8.3.4. http://www.journalofvision.org/content/8/3/4 doi:10.1167/8.3.4. [PubMed][Article] [DOI] [PMC free article] [PubMed]
- Navalpakkam V, Itti L. Modeling the influence of task on attention. Vision Research. 2005;45:205–231. doi: 10.1016/j.visres.2004.07.042. [DOI] [PubMed] [Google Scholar]
- Navalpakkam V, Koch C, Rangel A, Perona P. Optimal reward harvesting in complex perceptual environments. Proceedings of the National Academy of Sciences of the United States of America. 2010;107:5232–5237. doi: 10.1073/pnas.0911972107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nuthmann A, Smith TJ, Engbert R, Henderson JM. CRISP: A computational model of fixation durations in scene viewing. Psychological Review. 2010;117:382–405. doi: 10.1037/a0018924. [DOI] [PubMed] [Google Scholar]
- Nyström M, Holmqvist K. Semantic override of low-level features in image viewing—Both initially and overall. Journal of Eye Movement Research. 2008;2:1–11. [Google Scholar]
- Parkhurst DJ, Law K, Niebur E. Modeling the role of salience in the allocation of overt visual attention. Vision Research. 2002;42:107–123. doi: 10.1016/s0042-6989(01)00250-4. [DOI] [PubMed] [Google Scholar]
- Patla AE, Vickers JN. Where and when do we look as we approach and step over an obstacle in the travel path? Neuroreport. 1997;8:3661–3665. doi: 10.1097/00001756-199712010-00002. [DOI] [PubMed] [Google Scholar]
- Patla AE, Vickers JN. How far ahead do we look when required to step on specific locations in the travel path during locomotion. Experimental Brain Research. 2003;48:133–138. doi: 10.1007/s00221-002-1246-y. [DOI] [PubMed] [Google Scholar]
- Pelz JB, Canosa R. Oculomotor behavior and perceptual strategies in complex tasks. Vision Research. 2001;41:3587–3596. doi: 10.1016/s0042-6989(01)00245-0. [DOI] [PubMed] [Google Scholar]
- Peters RJ, Iyer A, Itti L, Koch C. Components of bottom-up gaze allocation in natural images. Vision Research. 2005;45:2397–2416. doi: 10.1016/j.visres.2005.03.019. [DOI] [PubMed] [Google Scholar]
- Platt ML, Glimcher PW. Neural correlates of decision variables in parietal cortex. Nature. 1999;400:233–238. doi: 10.1038/22268. [DOI] [PubMed] [Google Scholar]
- Posner MI, Cohen Y. Components of visual orienting. In: Bouma H, Bouwhuis DG, editors. Attention and performance X: Control of language processes. Erlbaum; Hillsdale, NJ: 1984. pp. 531–554. [Google Scholar]
- Prablanc C, Jeannerod M. Corrective saccades: Dependence of retinal reafferent signals. Vision Research. 1975;15:465–469. doi: 10.1016/0042-6989(75)90022-x. [DOI] [PubMed] [Google Scholar]
- Reinagel P, Zador AM. Natural scene statistics at the centre of gaze. Network–Computation in Neural Systems. 1999;10:341–350. [PubMed] [Google Scholar]
- Renninger LW, Coughlan J, Vergheese P. An information maximization model of eye movements. In: Saul LK, Weiss Y, Bottou L, editors. Advances in neural information processing systems. Vol. 17. MIT Press; Cambridge, MA: 2005. pp. 1121–1128. [PubMed] [Google Scholar]
- Renninger LW, Vergheese P, Coughlan J. Where to look next? Eye movements reduce local uncertainty. Journal of Vision. 2007;7(3):6, 1–17. doi: 10.1167/7.3.6. http://www.journalofvision.org/content/7/3/6 doi:10.1167/7.3.6. [PubMed][Article] [DOI] [PubMed]
- Robinson DL, Petersen SE. The pulvinar and visual salience. Trends in Neurosciences. 1992;15:127–132. doi: 10.1016/0166-2236(92)90354-b. [DOI] [PubMed] [Google Scholar]
- Rothkopf CA, Ballard DH. Image statistics at the point of gaze during human navigation. Visual Neuroscience. 2009;26:81–92. doi: 10.1017/S0952523808080978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rothkopf CA, Ballard DH. Credit assignment in multiple goal embodied visuomotor behavior. Frontiers in Psychology. 2010;1:173. doi: 10.3389/fpsyg.2010.00173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rothkopf CA, Ballard DH, Hayhoe MM. Task and context determine where you look. Journal of Vision. 2007;7(14):16, 1–20. doi: 10.1167/7.14.16. http://www.journalofvision.org/content/7/14/16 doi:10.1167/7.14.16. [PubMed][Article] [DOI] [PubMed]
- Sailer U, Flanagan JR, Johansson RS. Eye–hand coordination during learning of a novel visuomotor task. Journal of Neuroscience. 2005;25:8833–8842. doi: 10.1523/JNEUROSCI.2658-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneider WX. VAM: A neuro-cognitive model for visual attention control of segmentation, object recognition, and space-based motor action. Visual Cognition. 1995;2:331–375. [Google Scholar]
- Schultz W. Multiple reward signals in the brain. Nature reviews: Neuroscience. 2000;1:199–207. doi: 10.1038/35044563. [DOI] [PubMed] [Google Scholar]
- Schultz W, Tremblay L, Hollerman JR. Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex. 2000;10:272–283. doi: 10.1093/cercor/10.3.272. [DOI] [PubMed] [Google Scholar]
- Seydell A, McCann BC, Trommershäuser J, Knill DC. Learning stochastic reward distributions in a speeded pointing task. Journal of Neuroscience. 2008;28:4356–4367. doi: 10.1523/JNEUROSCI.0647-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shepherd SV, Deaner RO, Platt ML. Social status gates social attention in monkeys. Current Biology. 2006;16:119–120. doi: 10.1016/j.cub.2006.02.013. [DOI] [PubMed] [Google Scholar]
- Shinoda H, Hayhoe MM, Shrivastava A. What controls attention in natural environments? Vision Research. 2001;41:3535–3545. doi: 10.1016/s0042-6989(01)00199-7. [DOI] [PubMed] [Google Scholar]
- Smith TJ, Henderson JM. Facilitation of return during scene viewing. Visual Cognition. 2009;17:1083–1108. [Google Scholar]
- Sprague N, Ballard DH, Robinson A. Modeling embodied visual behaviors. ACM Transactions on Applied Perception. 2007;4:11. [Google Scholar]
- Stritzke M, Trommershäuser J, Gegenfurtner KR. Effects of salience and reward information during saccadic decisions under risk. Journal of the Optical Society of America A. 2009;26:B1–B13. doi: 10.1364/JOSAA.26.0000B1. [DOI] [PubMed] [Google Scholar]
- Stuphorn V, Schall JD. Executive control of countermanding saccades by the supplementary eye field. Nature Neuroscience. 2006;9:925–931. doi: 10.1038/nn1714. [DOI] [PubMed] [Google Scholar]
- Stuphorn V, Taylor TL, Schall JD. Performance monitoring by the supplementary eye field. Nature. 2000;408:857–860. doi: 10.1038/35048576. [DOI] [PubMed] [Google Scholar]
- Sugrue LP, Corrado GS, Newsome WT. Matching behavior and the representation of value in the parietal cortex. Science. 2004;304:1782–1787. doi: 10.1126/science.1094765. [DOI] [PubMed] [Google Scholar]
- Sutton R, Barto A. Reinforcement learning: An introduction. MIT Press; Cambridge, MA: 1998. [Google Scholar]
- 't Hart BM, Vockeroth J, Schumann F, Bartl K, Schneider E, Konig P, et al. Gaze allocation in natural stimuli: Comparing free exploration to head-fixed viewing conditions. Visual Cognition. 2009;17:1132–1158. [Google Scholar]
- Tassinari H, Hudson TE, Landy MS. Combining priors and noisy visual cues in a rapid pointing task. Journal of Neuroscience. 2006;26:10154–10163. doi: 10.1523/JNEUROSCI.2779-06.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatler BW. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of Vision. 2007;7(14):4, 1–17. doi: 10.1167/7.14.4. http://www.journalofvision.org/content/7/14/4 doi:10.1167/7.14.4. [PubMed][Article] [DOI] [PubMed]
- Tatler BW. Eye guidance and natural scenes. Psychology Press; Hove, UK: 2009. [Google Scholar]
- Tatler BW, Baddeley RJ, Gilchrist ID. Visual correlates of fixation selection: Effects of scale and time. Vision Research. 2005;45:643–659. doi: 10.1016/j.visres.2004.09.017. [DOI] [PubMed] [Google Scholar]
- Tatler BW, Baddeley RJ, Vincent BT. The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task. Vision Research. 2006;46:1857–1862. doi: 10.1016/j.visres.2005.12.005. [DOI] [PubMed] [Google Scholar]
- Tatler BW, Kuhn G. Don't look now: The magic of misdirection. In: van Gompel RPG, Fischer MH, Murray WS, Hill RL, editors. Eye movements: A window on mind and brain. Elsevier; Oxford, UK: 2007. pp. 697–714. [Google Scholar]
- Tatler BW, Vincent BT. Systematic tendencies in scene viewing. Journal of Eye Movement Research. 2008;2:1–18. [Google Scholar]
- Tatler BW, Vincent BT. The prominence of behavioural biases in eye guidance. Visual Cognition. 2009;17:1029–1054. [Google Scholar]
- Taylor JG, Cutsuridis V. Saliency, attention, active visual search, and picture scanning. Cognitive Computation. 2011;3:1–3. [Google Scholar]
- Theeuwes J, Godijn R. Attentional and oculomotor capture. In: Folk C, Gibson B, editors. Attraction, distraction, and action: Multiple perspectives on attentional capture. Elsevier; Amsterdam, The Netherlands: 2001. pp. 121–150. [Google Scholar]
- Thompson KG, Bichot NP. A visual salience map in the primate frontal eye field. Progress in Brian Research. 2005;147:249–262. doi: 10.1016/S0079-6123(04)47019-8. [DOI] [PubMed] [Google Scholar]
- Torralba A, Oliva A, Castelhano MS, Henderson JM. Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological Review. 2006;113:766–786. doi: 10.1037/0033-295X.113.4.766. [DOI] [PubMed] [Google Scholar]
- Treisman AM, Gelade G. A feature-integration theory of attention. Cognitive Psychology. 1980;12:97–136. doi: 10.1016/0010-0285(80)90005-5. [DOI] [PubMed] [Google Scholar]
- Trommershäuser J, Glimcher PW, Gegenfurtner KR. Visual processing, learning and feedback in the primate eye movement system. Trends in Neuro-sciences. 2009;32:583–590. doi: 10.1016/j.tins.2009.07.004. [DOI] [PubMed] [Google Scholar]
- Trommershäuser J, Maloney LT, Landy MS. Statistical decision theory and the selection of rapid, goal-directed movements. Journal of the Optical Society of America A. 2003;20:1419–1433. doi: 10.1364/josaa.20.001419. [DOI] [PubMed] [Google Scholar]
- Trommershäuser J, Maloney LT, Landy MS. Decision making, movement planning, and statistical decision theory. Trends in Cognitive Sciences. 2008;12:291–297. doi: 10.1016/j.tics.2008.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turano KA, Geruschat DR, Baker FH. Oculomotor strategies for the direction of gaze tested with a real-world activity. Vision Research. 2003;43:333–346. doi: 10.1016/s0042-6989(02)00498-4. [DOI] [PubMed] [Google Scholar]
- Uke-Karacan H, Hayhoe M. Is attention drawn to changes in familiar scenes? Visual Cognition. 2008;16:346–374. [Google Scholar]
- Underwood G, Foulsham T, van Loon E, Humphreys L, Bloyce J. Eye movements during scene inspection: A test of the saliency map hypothesis. European Journal of Cognitive Psychology. 2006;18:321–342. [Google Scholar]
- Vincent BT, Baddeley RJ, Correani A, Troscianko T, Leonards U. Do we look at lights? Using mixture modelling to distinguish between low- and high-level factors in natural image viewing. Visual Cognition. 2009;17:856–879. [Google Scholar]
- Vincent BT, Troscianko T, Gilchrist ID. Investigating a space-variant weighted salience account of visual selection. Vision Research. 2007;47:1809–1820. doi: 10.1016/j.visres.2007.02.014. [DOI] [PubMed] [Google Scholar]
- Walther D, Koch C. Modeling attention to salient proto-objects. Neural Networks. 2006;19:1395–1407. doi: 10.1016/j.neunet.2006.10.001. [DOI] [PubMed] [Google Scholar]
- Wischnewski M, Belardinelli A, Schneider W. Where to look next? Combining static and dynamic proto-objects in a TVA-based model of visual attention. Cognitive Computation. 2010;2:326–343. [Google Scholar]
- Wischnewski M, Steil J, Kehrer L, Schneider W. Ritter H, Sagerer G, Dillmann R, Buss M, editors. Integrating inhomogeneous processing and proto-object formation in a computational model of visual attention. Cognitive Systems Monographs. 2009;6:93–102. [Google Scholar]
- Wolfe JM. What can 1 million trials tell us about visual search? Psychological Science. 1998;9:33–39. [Google Scholar]
- Wolfe JM. Guided search 4.0: Current progress with a model of visual search. In: Gray W, editor. Integrated models of cognitive systems. Oxford; New York: 2007. pp. 99–119. [Google Scholar]
- Xu J, Yang Z, Tsien JZ. Emergence of visual saliency from natural scenes via context-mediated probability distributions coding. PLoS ONE. 2010;5:e15796. doi: 10.1371/journal.pone.0015796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yantis S. Control of visual attention. In: Pashler H, editor. Attention. Psychology Press; Hove, UK: 1998. pp. 233–256. [Google Scholar]
- Yanulevskaya V, Marsman JB, Cornelissen F, Geusebroek J. An image statistics-based model for fixation prediction. Cognitive Computation. 2010;3:94–104. doi: 10.1007/s12559-010-9087-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yarbus AL. Eye movements and vision. Plenum Press; New York: 1967. [Google Scholar]
- Zehetleitner M, Hegenloh M, Mueller HJ. Visually guided pointing movements are driven by the salience map. Journal of Vision. 2011;11(1):24, 1–18. doi: 10.1167/11.1.24. http://www.journalofvision.org/content/11/1/24 doi:10.1167/11.1.24. [PubMed][Article] [DOI] [PubMed]
- Zelinsky GJ. A theory of eye movements during target acquisition. Psychological Review. 2008;115:787–835. doi: 10.1037/a0013118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zelinsky G, Rao R, Hayhoe M, Ballard D. Eye movements reveal the spatio-temporal dynamics of visual search. Psychological Science. 1997;8:448–453. [Google Scholar]
- Zhao Q, Koch C. Learning a saliency map using fixated locations in natural scenes. Journal of Vision. 2011;11(3):9, 1–15. doi: 10.1167/11.3.9. http://www.journalofvision.org/content/ 11/3/9 doi:10.1167/11.3.9. [PubMed][Article] [DOI] [PubMed]
- Zingale CM, Kowler E. Planning sequences of saccades. Vision Research. 1987;27:1327–1341. doi: 10.1016/0042-6989(87)90210-0. [DOI] [PubMed] [Google Scholar]