Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 22.
Published in final edited form as: Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2020 Aug 5;2020:190–199. doi: 10.1109/cvpr42600.2020.00027

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Zhibo Yang 1, Lihan Huang 1, Yupei Chen 1, Zijun Wei 2, Seoyoung Ahn 1, Gregory Zelinsky 1, Dimitris Samaras 1, Minh Hoai 1
PMCID: PMC8218821  NIHMSID: NIHMS1715372  PMID: 34163124

Abstract

Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer’s internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of high-quality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive target-dependent patterns of object prioritization, which we interpret as a learned object context.

1. Introduction

Human visual attention comes in two forms. One is bottom-up, where prioritization is based solely on processing of the visual input. The other is top-down, where prioritization is based on many top-down information sources (object context of a scene, semantic relationships between objects, etc. [14, 38, 54]). When your food arrives at a restaurant, among your very first attention movements will likely be to the fork and knife (Fig. 1), because they are important to your goal of having dinner.

Figure 1: Predicting fixations in a visual search task.

Figure 1:

Left: behavioral scanpath shown in yellow. Right: predicted scanpath in green. The search target is the fork, shown in the yellow bounding box.

Goal-directed attention control underlies all the tasks that we try to do, thus making its prediction a more challenging and important problem than predicting the bottom-up control of attention by a visual input. One of the strongest forms of top-down attention control is in the definition of a target goal. Arguably the simplest goal-directed task is visual search—there is a target object and the task is to find it. Humans are very efficient and flexible in the image locations that they choose to fixate while searching for a target-object goal, making the prediction of human search behavior important for both behavioral and computer vision, e.g. robotic visual systems [16, 39]. In this paper, we introduce Inverse Reinforcement Learning as a computational model of human attention in visual search.

Gaze prediction in visual search.

We aim to predict the fixation patterns made during the visual search of an image. These patterns can be either spatial (fixation density maps) or spatial+temporal (scanpaths). Most fixation-prediction models are of free-viewing behavior. A critical difference between search and free-viewing tasks is that search fixations are guided towards a target-object goal, whereas in free viewing there are no explicit goals. Prioritization of fixation locations during free viewing is thought to be controlled by bottom-up saliency, and since Itti’s [25] seminal work the prediction of free-viewing fixations using saliency maps has grown into a large literature [6, 8, 9, 12, 23, 26, 27, 3234, 36]. However, saliency model predictions do not generalize to fixations made in goal-directed attention tasks, such as target object search [20, 31].

Early models of target guidance during search used simple targets having features that were known to the searcher [53]. This work expanded to include computational models using images of objects and scenes as inputs [15, 57, 58], and the inclusion of target spatial relationships [4] and global scene context [49] to help guide attention to targets and improve search efficiency. There have only been a few attempts to use deep network models to predict human search fixations [2, 52, 59]. Critically, all of these models use algorithms and knowledge about a particular source of information (target features, meaning, context, etc), to prioritize image locations for fixation selection.

Inverse Reinforcement Learning.

Our approach to search-fixation prediction is the opposite. Instead of an algorithm to prioritize locations in an image, we use Inverse Reinforcement Learning (IRL) [1, 17, 21, 41, 60] to learn sequences of search fixations by treating each as a potential source of reward. IRL, a form of imitation learning, aims to recover an expert’s underlying reward function through repeated observation. Most IRL algorithms [17, 55, 60] simultaneously learn an optimal policy and the reward function on which the policy is optimized. Although early IRL algorithms [41, 60] were often restricted to problems with low-dimensional state spaces, deep maximum entropy IRL [55] can handle raw image inputs. Recent work [17, 21] applies adversarial training [19] to learn the underlying reward function and the policy, treating each as (part of) the discriminator and the generator in adversarial training, respectively. The discriminator assigns high reward to an expert’s behavior and low reward to a non-expert’s behavior, where behavior is represented as state-action pairs. The generator/policy is optimized using a reinforcement learning algorithm to get higher reward by behaving more like the expert. Here, we use the GAIL (generative adversarial imitation learning) algorithm [21], because it can imitate behaviors in complex, high-dimensional environments [21]. We define a unified information-maximization framework to combine diverse information sources, in order to select maximally-rewarding locations to fixate, thus increasing accuracy and applicability of human search fixation prediction.

Search Fixation Datasets.

Our other significant contribution is the COCO-Search18 dataset, currently the world’s largest dataset of images annotated with human gaze fixations collected during search. COCO-Search18 is needed because the best models of goal-directed attention will likely be trained on goal-directed behavior data. For free-viewing fixations, the currently best model is DeepGaze II [34], trained on SALICON [27]. SALICON is a crowd-sourced dataset of images annotated with mouse clicks indicating attentionally salient image locations.

There is nothing comparable to SALICON to train goal-directed attention models. Moreover, the existing suitably large datasets, each suffer from some weakness that limits their usefulness (Tab. 1), the most common being that the fixation behavior was not collected during a visual search task (as in [37, 42]). Datasets using a search task either had people search for multiple targets simultaneously [18] or used only one target category (people [15]) or two (microwaves and clocks [56]). These inadequacies demand a new, larger, and higher-quality dataset of search fixations for model training. We use multiple fixation-based behavioral search metrics to interrogate COCO-Search18, which we predict using IRL and other state-of-the-art methods.

Table 1: Comparison of fixation datasets.

Previous datasets either did not use a search task, or had very few target-object classes, subjects, or fixations.

Dataset Search Image Class Subj/img Fixation
SALICON [27] 10000 - 60 4600K*
POET [42] 6270 10 5 178K
People900 [15] 912 1 14 55K
MCS [56] 2183 2 1–4 16K
PET [18] 4135 6 4 30K
COCO-Search18 6202 18 10 300K
*:

Fixations are approximated by mouse clicks.

Contributions.

(1) We apply Inverse Reinforcement Learning (GAIL) to the problem of predicting fixation scanpaths during visual search, the first time this has been done for a goal-directed attention. (2) In order to apply IRL to scanpath prediction we needed to integrate changes in fixation location with changes in the state representation, a problem that we solved using Dynamic Contextual Beliefs. DCB is a novel state encoder that updates beliefs about peripherally-viewed objects (an object context) based on the movements of a simulated fovea. (3) We introduce COCO-Search18, a large-scale, high-quality dataset of COCO images annotated with the fixations of 10 people searching for 18 target-object categories. COCO-Search18 makes possible the deep network modeling of goal-directed attention. (4) We show through model comparison and with multiple metrics that our IRL model outperforms other state-of-the-art methods in predicting search scanpaths. We also show that the IRL model (i) learns an object’s scene context; (ii) generalizes to predict the behavior of new subjects, and (iii) needs less data to achieve good performance compared to other models. (5) Finally, we learn how to quantify a reward function for the fixations in a search task. This will make possible a new wave of experimental investigation that will ultimately result in better understanding of goal-directed attention.

2. Scanpath Prediction Framework

We propose an IRL framework (Fig. 2) to model human visual search behavior. A person performing a visual search task can be considered a goal-directed agent, with their fixations being a sequential decision process of the agent. At each time step, the agent attends to (fixates) a specific location within the image and receives a version of the image that is blurred to approximate the human viewer’s visual state, what we call a retina-transformed image. This is an image that has high-resolution (non-blurred) information surrounding the attended location, and lower-resolution information outside of this central simulated fovea [45]. The state of the agent is determined by the sequence of visual information that accumulates over fixations toward the search target (Sec. 2.1), with each action of the agent depending on the state at that time during the evolving state representation. The goal of the agent is to maximize internal rewards through changes in gaze fixation. While it is difficult to behaviorally measure how much reward is received from these fixations, with IRL this reward can be assumed to be a function of the state and the action, and this function can be jointly learned using the imitation policy (Sec. 2.2).

Figure 2: Overview of the IRL framework.

Figure 2:

The generator (policy) generates fake state-action pairs {Btf,atf} by sampling eye movements from given images and tasks. The discriminator (reward function) is trained to differentiates real human state-action pairs {Btr,atr} from the generated ones and provides reward to train the generator. The states Btf and Btr use DCB representations.

2.1. State Representation Modeling

To model the state of a viewer we propose a novel state representation for accumulating information through fixations that we term a Dynamic-Contextual-Belief (DCB). As shown in Fig. 3, DCB is composed of three components: 1) Fovea, which receives a high-resolution visual input only from the image region around the fixation location; 2) Contextual beliefs, which represent a person’s gross “what” and “where” understanding of a scene in terms of level of class confidence; and 3) Dynamics, which actively collects information with each fixation made during search. We discuss each component in greater detail below.

Figure 3: Overview of DCB.

Figure 3:

First, an input image and its low-res image counterpart are converted into the high-res beliefs and low-res beliefs. The initial state B0 is set as the low-res belief. At each fixation which is discretized into a binary fixation mask Mt with 1’s at the fixation location and 0’s elsewhere, a new state is generated by applying Eq. (1).

Fovea:

The primate visual system has a fovea, which means that high-resolution visual information is available only at a central fixated location. To accumulate information from the visual world, it is therefore necessary to selectively fixate new image locations. Visual inputs outside of the fovea have lower resolution, with the degree of blur depending on the distance between the peripherally-viewed input and the fovea. Rather than implementing a full progressive blurring of an image input (i.e., a complete retina-transformed image, as in [56]), for computational efficiency here we use a local patch from the original image as the high-resolution foveal input and a blurred version of the entire image to approximate low-resolution input from peripheral vision.

Contextual Belief:

Attention is known to be guided to target (and target-like) objects during search, but more recently it has been suggested that attention is also guided to “anchor objects” [7, 50], defined as those objects having a learned spatial relationship to a target that can help in the efficient localization of that target. For example, people often look at the wall when searching for a TV because TVs are often found hanging on the wall. Inspired by this, we propose to model, not only the target features (as in [59]), but also other objects and background information in the state.

We hypothesize that people have an internal scene parser that takes an image input and generates belief maps for that image based on all the objects and background classes in that person’s knowledge structure. We believe these belief maps also guide movements of the fovea to capture high-resolution information and form better beliefs. We approximate these belief maps using a Panoptic-FPN [29] for panoptic segmentation [30]. Given an image, Panoptic-FPN generates a pixel-level mask for each “thing” class (object) and each “stuff” class (background) in the image. There are 80 “thing” categories (including a single “other” class for the 80 “thing” classes) and 54 “stuff” categories [10, 30, 35]. We create a mask for each category by grouping all mask instances belonging to the same category and use the belief maps of the 134 categories as the primary component of the state representation. We term these belief maps contextual beliefs because the collective non-target beliefs constitute a context of spatial cues that might affect the selection of fixations during the search for a target.

Dynamics refers to the change in the state representation that occurs following each fixation. We propose a simple yet effective heuristic to model state dynamics (see Fig. 3). Initially, the state is based on the contextual beliefs on the low-resolution image corresponding to a peripheral visual input. For each fixation by the searcher, we update the state by replacing the portion of the low-resolution belief maps with the corresponding high-resolution portion obtained at the new fixation location. The state is updated as follows:

B0=L and Bt+1=MtH+(1Mt)Bt, (1)

where Bt is the belief state after t fixations, Mt is the circular mask generated from the tth fixation, L and H are the belief maps of “thing” and “stuff” locations for low-resolution and high-resolution images, respectively. Humans have different search behaviors on the same image given different search targets. To capture this, we augment the state by concatenating it with a one-hot task vector. Please refer to the supplementary material for more detail.

2.2. Reward and Policy Learning

We learn the reward function and the policy for visual search behavior using Generative Adversarial Imitation Learning (GAIL) [21]. As shown in Fig. 2, GAIL is an adversarial framework with a discriminator and a generator. The policy is the generator that aims to generate state-action pairs that are similar to human behavior. The reward function (the logarithm of the discriminator output) maps a state-action pair to a numeric value. We train the generator and discriminator with an adversarial optimization framework to obtain the policy and reward functions.

Let D and G denote the discriminator and the generator, respectively. The discriminator aims to differentiate human state-action pairs from fake state-action pairs generated by the policy. This corresponds to maximizing the following objective function:

LD=Er[log(D(S,a))]+Ef[log(1D(S,a))]λEr[D(S,a))2]. (2)

In the above objective function, Er denotes the expectation over the distribution of real state-action pairs, while Ef denotes the expectation over the fake samples generated by the generator (i.e., the policy). The last term of the above objective is the expected squared norm of the gradients, which is added for faster convergence [46]. The reward function is defined based on the discriminator:

r(S,a)=log(D(S,a)). (3)

The generator aims to fool the discriminator, and its objective is to maximize the log likelihood of the generated state-action pairs, i.e., to maximize: LG=Ef[log(D(S,a))]=Ef[r(S,a)].

The generator is an RL policy, hence its objective can be equivalently reformulated as an RL objective and optimized by Proximal Policy Optimization [48]:

Lπ=Eπ[log(π(aS))A(S,a)]+H(π). (4)

We use GAE [47] to estimate the advantage function A which measures the gain of taking action a over the policy’s default behavior. H(π)=Eπ[log(π(aS))], the entropy in max-entropy IRL [60].

3. COCO-Search18 Dataset

COCO-Search18 is a large-scale and high-quality dataset of search fixations obtained by having 10 people viewing 6202 images in the search for each of 18 target-object categories. Half of these images depicted an instance of the designated target object and the other half did not, meaning that we adopted a standard target-present (TP) or target-absent (TA) search task. All images in COCO-Search18 were selected from the COCO trainval set [35]. Five criteria were imposed when selecting the TP images: (1) No images depicting a person or an animal (to avoid known strong biases to these categories that might skew our measures of attention control [11, 28]). (2) The image should include one and only one instance of the target. (3) The size of the target, measured by the area of its bounding box, must be >1% and <10% of the area of the search image. (4) The target should not be at the center of the image, enforced by excluding an image if the target bounding box overlapped with the center cell of a 5×5 grid. (5) The original image ratio (width/height) must be between 1.2 and 2.0 to accommodate the display screen ratio of 1.6. After applying these exclusion criteria, and excluding object categories that had less than 100 images of exemplars, we were left with 32 object categories (out of COCO’s 80) to use as search targets. To exclude images in which the target was highly occluded or otherwise difficult to recognize, we trained a patch-based classifier for target recognition (described in supplemental material) and only selected images in which the cropped target-object patch had a classification confidence in the top 1%. Finally, we manually excluded images depicting digital clocks from the clock target category (because the features of analog and digital clocks are very different and this would be expected to reduce data quality by creating variability in the search behavior), as well as images depicting objectionable content. This left us with 3101 TP images over 18 target categories. To select the same number of TA images for each of these 18 categories, we randomly sampled COCO trainval images with the following constraints:(1) The image should not depict an instance of the target, and (2) The image must include at least two instances of the target’s siblings, as defined in COCO. For example, a microwave sibling can be an oven, a toaster, a refrigerator, or a sink, which are under the parent category of appliance. We did this to discourage TA responses from being based on scene type (e.g., a city street scene would be unlikely to contain a microwave).

Each of the 10 student participants (6 males, age range 18–30, normal or corrected-to-normal vision) viewed all 6202 images, and their eye position throughout was sampled at 1000Hz using an EyeLink 1000 eyetracker (SR Research) in tower-mount configuration under controlled laboratory conditions. For each subject, data collection was distributed over six sessions in six days, with each session having equal number TP trials and TA trials (~500 each) randomly interleaved. Each session required ~2 hours. For each image, subjects made a TP or TA judgment by pressing a ‘yes’ or ‘no’ button on a game pad. They searched all the images for one target category before preceding to next category. A total of 299,037 fixations were extracted from the eye position data, over the 10 subjects, although only data from the TP fixations will be reported here (Fig. 4). TP fixations occurring on error trials, or after fixation on the target, were discarded. This left 100,232 TP search fixations to use for training and testing. All model evaluations are based on 70% training, 10% validation, and 20% test, random splits of COCO-Search18, within each target category.

Figure 4:

Figure 4:

Normalized gaze data [0,1] on response error, reaction time, number of fixations (NumFix), time to target (T2T), and number of fixations to target (NumFix2T) averaged over 10 subjects searching for 18 categories in TP images. Redder color indicates harder search targets, bluer color indicates easier search.

4. Experiments

We evaluate the proposed framework and its constituent components in multiple experiments. We first compare the scanpath predictions by the IRL-based algorithm to predictions from various heuristic methods and behavior cloning methods using ConvNets and convolutional LSTM. We then study the algorithm’s ability to generalize to new human subjects. Finally, we analyze context effects, the value of having more training data, and report on an ablation study. We used only the target-present trials from COCO-Search18, leaving target-absent data for future analyses.

4.1. Comparing Scanpath Prediction Models

Comparison methods.

We compare the IRL algorithm for predicting scanpaths to several baselines, heuristics, and behavior cloning methods: (1) Random scanpath: we predict the scanpath for an input image by randomly selecting a human scanpath for the same search target but in a different input image. (2) Detector: we train a simple ConvNet to predict the location of the target and sample a sequence of fixation locations based on the detection confidence map over the image. Regions with higher confidence scores are more likely to be sampled. (3) Fixation heuristics: rather than sampling from a detector’s confidence map, here we generate fixations by sampling from a fixation density map produced by a ConvNet (with a similar network architecture as the Detector) trained on human fixation density maps.(4) BC-CNN is a behavior cloning method, where we train a ConvNet to predict the next fixation location from the DCB state representation. Note that this state representation and network structure are identical to the one used by the IRL policy described in Sec. 2.1. (5) BC-LSTM is a behavior cloning method similar to BC-CNN, but the state representation and update are done with a convolutional LSTM. Instead of having the simple predefined update rule used by both IRL and BC-CNN, as shown in Eq. (1), BC-LSTM aims to learn a recurrent update rule automatically with an LSTM: Bt+1 =ConvLSTM(Bt, It), where ConvLSTM denotes a convolutional LSTM cell [5], Bt is the hidden state of the LSTM cell and also the searcher’s belief state after t fixations. It is the input to the LSTM at time t, and it is defined as It = MtH + (1 − Mt) ⊙ L. Recall that Mt is the circular mask generated from the tth fixation, L and H are the predicted maps from the Panoptic-FPN [30] for the 80 COCO objects and 54 “stuff” classes for low- and high-resolution input images, respectively.

Results.

Fig. 5 shows the cumulative probability of gaze landing on the target after each of the first 6 fixations made by humans and the algorithms in our model comparison. First, note that even the most predictive models have a performance ceiling lower than that of humans, whose ceiling over this range is nearly 1. These lower ceilings likely reflect a proportion of trials in which the models search was largely unguided. Second, note the steep increase in target fixation probability after the first and second fixations. The slopes of these functions indicate strong target guidance. The target was fixated in the very first movement on about half of the images, with the IRL model replicating human search guidance slightly better than its nearest competitors: the Detector and BC-CNN models.

Figure 5: Cumulative probability of fixating the target.

Figure 5:

for human searchers and all predictive methods. X-axis is the number of fixations until the fovea moves to the target object; Y-axis is the percentage of scanpaths that succeed in locating the target. Means and standard errors are first computed over target categories, and then over searchers.

We quantify the patterns from Fig. 5 using several metrics. Two of these metrics follow directly from Fig. 5 and capture aggregate measures combining search guidance and accuracy. The first of these computes the area under the cumulative probability of the target fixation curve, a metric we refer to as Target Fixation Probability AUC or TFP-AUC. Second, we compute the sum of the absolute differences between the human and model cumulative probability of target fixation in a metric that we refer to as Probability Mismatch. We also report the Scanpath Ratio, which is a widely used metric for search efficiency. It is computed by the ratio of Euclidean distance between the initial fixation location and the center of the target to the summed Euclidean distances between fixations to the target [22]. Finally, we compute two metrics for scanpath prediction success, that both capture the scanpath similarity between fixation sequences generated by humans and sequences generated by the model. The first of these computes a Sequence Score by first converting a scanpath into a string of fixation cluster IDs and then use a string matching algorithm [40] to measure similarity between two strings. Finally, we use MultiMatch [3, 13] to measure the scanpath similarity at the pixel level. MultiMatch measures five aspects of scanpath similarity: shape, direction, length, position, and duration. We exclude the duration metric because the studied models do not predict fixation duration. Unless otherwise specified, each model generates 10 scanpaths of maximum length 6 (excluding the first fixation) for each testing image by sampling from the predicted action map at each fixation, with the results averaged over scanpaths.

As seen from Tab. 2, the IRL algorithm outperforms the other methods on all metrics. The performance of IRL is closest to Human, an oracle method where the scanpath of a subject is used to predict the scanpath of another subject for the same input image. Fig. 6 also shows that reward maps recovered by the IRL model depend greatly on the category of the search target. In the top row, higher reward was assigned to the laptop when searching for a mouse, while for the same image greater reward was expected from fixating on the monitor when searching for a tv. Similarly, the search for a car target in the bottom image resulted in the expectation of reward from the other cars on the road but almost not at all from the highly-salient stop sign, which becomes intensely prioritized when the stop sign is the target.

Table 2: Comparing scanpath prediction algorithms.

(rows) using multiple scanpath metrics (columns) on the COCO-Search18 test dataset. In the case of Sequence Score and Multimatch, “Human” refers to an oracle method where one searcher’s scanpath is used to predict another searcher’s scanpath; “Human” for all other metrics refers to observed behavior.

TFP-AUC ↑ Probability Mismatch ↓ Scanpath Ratio ↑ Sequence Score ↑ MultiMatch ↑
shape direction length position
Human 5.200 - 0.862 0.490 0.903 0.736 0.880 0.910
Random scanpath 0.795 4.407 - 0.295 0.869 0.558 0.849 0.849
Detector 4.046 1.166 0.687 0.414 0.877 0.676 0.853 0.863
Fixation heuristic 2.154 3.046 0.545 0.342 0.873 0.614 0.870 0.850
BC-CNN 3.893 1.328 0.706 0.409 0.880 0.669 0.865 0.874
BC-LSTM 1.702 3.497 0.406 0.324 0.834 0.567 0.818 0.770
IRL(Ours) 4.509 0.987 0.826 0.422 0.886 0.695 0.866 0.885
Figure 6:

Figure 6:

Initial reward maps learned by the IRL model for two different search targets in two test images. Top row: original image (left), mouse target (middle), and tv target (right). Bottom row: original image (left), car target (middle), stop sign target (right). Redder color indicates the expectation of higher reward for fixating a location.

Implementation details.

We resize each input image to 320×512 and obtain a low-resolution image by applying a Gaussian filter with standard deviation σ = 2. To compute the contextual beliefs, we use a Panoptic-FPN with backbone network ResNet-50-FPN pretrained on COCO2017 [30]. Panoptic-FPN outputs a feature map of 134 channels, corresponding to 80 object categories and 54 background classes in COCO, and it is resized to 20×32 spatially.

For IRL and BC-CNN, we use the same policy network architecture: a network composed of four convolutional (conv) layers and a softmax layer. IRL model has two additional components—critic network and discriminator network. The critic network has two convolutional layers and two fully-connected (fc) layers. The discriminator network shares the same sturcture with the IRL policy network except the last layer which is a sigmoid layer. Each conv layer and fc layer in BC-CNN and IRL is followed by a ReLU layer and a batch-norm layer [24]. BC-LSTM has the same policy network as the BC-CNN, with the difference being the use of a convolutional LSTM [5] to update the states. BC-CNN and BC-LSTM use the KL divergence between predicted spatial distribution and ground truth as loss. The prediction of both behavior cloning models and IRL is conditioned on the search target. We implement the target conditioning by introducing an additional bias term based on the search task to the input features at each layer [44]. The human visual system employs Inhibition-of-Return (IOR) to spatially tag previously attended locations with inhibition to discourage attention from returning to a region where information has already been depleted [51]. To capture this mechanism, we enforce IOR on the policy by setting the predicted probability map to 0 at each attended location using a 3×3 grid. See the supplementary for more detail.

4.2. Group Model vs Individual Model

The previous subsection described the IRL model’s ability to predict a searcher’s scanpath on unseen test images, but how well can this model predict the scanpaths of a new unseen searcher without training on that person’s scanpaths? To answer this question, we perform ten leave-one-subject-out experiments, with each experiment corresponding to a test subject. For every subject we train two models: (1) a group model using the scanpaths of the 9 other subjects; and (2) an individual model using the scanpaths of the test subject on the training images. We evaluate the performance of these models on the scanpaths of each test subject on the unseen test images. Fig. 7 shows that both models perform well, with an insignificant performance gap between them. This suggests that there is good agreement between group and individual behaviors, and that a group model can generalize well to new searchers.

Figure 7:

Figure 7:

No significant differences were found between a group model (solid), trained with 9 subjects, and an individual model (striped), trained with one subject.

4.3. Context Effects

Search efficiency.

With DCB we can ask how an object from category A affects the search for a target from category B. This effect can either increase (guidance) or decrease (distraction) search efficiency. To study this, we first zero out the belief map of category A in the DCB state representation and then measure the TFP-AUC (see Sec. 4.1) on test images for category B. We compute the difference between the TFP-AUC obtained with and without switching off the belief map for category A in DCB. A positive value indicates that an object in category A helps to guide search for a target in category B, while a negative value indicates the opposite (that the object is distracting). We did this for the 134 COCO objects and stuff non-target categories A and the 18 target categories B. Fig. 8 shows the six most guiding and distracting objects for the knife and car searches. Note that the fork was highly distracting when searching for a knife, likely because the two look similar in periphery vision, but that the cake facilitated the knife search. Similarly for the car search, pavement provided the strongest guidance whereas trucks were the most distracting.

Figure 8: Context effect.

Figure 8:

The six most influential context objects (grey bars) for knife and car search tasks. The y-axis is the context object category and the x-axis is a measure of how much the belief map for a context object contributed to search efficiency, as measured by TFP-AUC. Larger positive values mean that the context object improved search guidance to the target, more negative values mean that the object distracted attention from the search.

Directional Prior.

Can an object from category A serve as a directional spatial cue in the search for a target from category B? Suppose M is the probability map produced by the policy network of our IRL model, and let M be the modified probability map from the policy network but with the belief map of category A in the DCB state representation being switched off. By computing the difference between M and M which we call a context map (as depicted in the top row of Fig. 9), we can see the spatial relationship between the context object A and the target object B (see Fig. 9 for examples).

Figure 9: Spatial relations.

Figure 9:

between context and target objects learned by the model. Top row shows individual context maps for a dining table (left) and a refrigerator (middle) in a microwave search, and a keyboard (right) in a mouse search. Bottom row are the belief maps of the corresponding context objects. Gaze is guided to the top of the dinning table and refrigerator when searching for a microwave, and to the right of the keyboard when searching for a mouse.

4.4. Ablation Study on State Representation

DCB is a rich representation that uses top-down, bottom-up, and history information. Specifically, it consists of 136 belief maps, divided into five factor groups: target object (1 map), context objects (79 maps), “stuff” (54 maps), saliency (1 map, extracted using DeepGaze2 [34]), and history (1 binary map for the locations of previous fixations). To understand the contribution of a factor group, we remove the group from the full state representation and measure the effect on performance. From the first block of Tab. 3, we can see that the most important factor groups were target and context objects, followed by stuff, whereas saliency and history weakly impacted model performance. In addition, an alternative state representation to DCB is the Cumulative Foveated Image (CFI) [56], but replacing DCB with CFI degrades the performance of IRL (as shown in the second block of Tab. 3).

Table 3: Ablation study of the proposed state representation—dynamic contextual belief.

The full state (DCB-full) consists of 1 history map, 1 saliency map, 54 stuff maps, 79 context maps and 1 target map. We mask out one part by setting the map(s) to zeros at each time. See the supplementary for full results.

Sequence Score ↑ Scanpath Ratio ↑ Prob. ↓ Mismatch
DCB-full 0.422 0.803 1.029
w/o history map 0.419 0.800 1.042
w/o saliency map 0.419 0.795 1.029
w/o stuff maps 0.407 0.777 1.248
w/o thing maps 0.331 0.487 3.152
w/o target map 0.338 0.519 2.926
DCB 0.422 0.826 0.987
CFI 0.402 0.619 1.797

4.5. Data Efficiency

Fig. 10 shows IRL and BC-CNN performance as we vary the number of training images per object category. Both methods use DCB as the state representation. IRL is more data efficient than BC-CNN, achieving comparable or better results using less training data. A likely reason for this is that the GAIL-based [21] IRL method includes an adversarial component that generates augmented training data, leading to a less prone to overfitting policy network. Data efficiency is crucial for training for new categories, given the time and cost of collecting human fixations.

Figure 10:

Figure 10:

Performance of IRL (solid line) and BC-CNN (dashed line) as the number of training images per category increases from 5 to 20. IRL is more data efficient than BC-CNN, likely due to an adversarial data generator.

5. Conclusions

We proposed a new model for predicting search fixation scanpaths that uses IRL to jointly recover the reward function and policy used by people during visual search. The IRL model uses a novel and highly explainable state representation, dynamic contextual beliefs (DCB), which updates beliefs about objects to obtain an object context that changes dynamically with each new fixation. To train and test this model we also introduced COCO-Search18, a large-scale dataset of images annotated with the fixations of people searching for target-object goals. Using COCO-Search18, we showed that the IRL model outperformed comparable models in predicting search scanpaths.

Better predicting human search behavior means better robotic search applications and human-computer systems that can interact with users at the level of their attention movements [43]. It may also be possible to use reward maps from the IRL model to annotate and index visual content based on what is likely to attract a person’s attention. Finally, our work impacts the behavioral vision literature, where the visual features guiding human goal-directed attention are still poorly understood for real images [58].

Supplementary Material

supplemental material

Acknowledgements.

This project is supported by US National Science Foundation Award IIS-1763981, the Partner University Fund, the SUNY2020 Infrastructure Transportation Security Center, and a gift from Adobe.

Footnotes

References

  • [1].Abbeel Pieter and Ng Andrew Y.. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2004. [Google Scholar]
  • [2].Adeli Hossein and Zelinsky Gregory. Deep-bcn: Deep networks meet biased competition to create a brain-inspired model of attention control. In CVPR Workshops, 2018. [Google Scholar]
  • [3].Anderson Nicola C, Anderson Fraser, Kingstone Alan, and Bischof Walter F. A comparison of scanpath comparison methods. Behavior research methods, 47(4):1377–1392, 2015. [DOI] [PubMed] [Google Scholar]
  • [4].Aydemir Alper, Sjöö Kristoffer, Folkesson John, Pronobis Andrzej, and Jensfelt Patric. Search in the real world: Active visual object search based on spatial relations. In Proceedings of the IEEE Conference Robotics and Automation, pages 2818–2824. IEEE, 2011. [Google Scholar]
  • [5].Ballas Nicolas, Yao Li, Pal Chris, and Courville Aaron. Delving deeper into convolutional networks for learning video representations. arXiv:1511.06432, 2015. [Google Scholar]
  • [6].Berg David J, Boehnke Susan E, Marino Robert A, Munoz Douglas P, and Itti Laurent. Free viewing of dynamic stimuli by humans and monkeys. Journal of vision, 9(5):19–19, 2009. [DOI] [PubMed] [Google Scholar]
  • [7].Boettcher Sage EP, Draschkow Dejan, Dienhart Eric, and Võ Melissa L-H. Anchoring visual search in scenes: Assessing the role of anchor objects on eye movements during visual search. Journal of vision, 18(13):11–11, 2018. [DOI] [PubMed] [Google Scholar]
  • [8].Borji Ali and Itti Laurent. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013. [DOI] [PubMed] [Google Scholar]
  • [9].Borji Ali, Cheng Ming-Ming, Jiang Huaizu, and Li Jia. Salient object detection: A benchmark. IEEE transactions on image processing, 24(12):5706–5722, 2015. [DOI] [PubMed] [Google Scholar]
  • [10].Caesar Holger, Uijlings Jasper, and Ferrari Vittorio. Cocostuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1209–1218, 2018. [Google Scholar]
  • [11].Cerf Moran, Harel Jonathan, Einhäuser Wolfgang, and Koch Christof. Predicting human gaze using low-level saliency combined with face detection. In Advances in neural information processing systems, pages 241–248, 2008. [Google Scholar]
  • [12].Cornia Marcella, Baraldi Lorenzo, Serra Giuseppe, and Cucchiara Rita. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, 27(10):5142–5154, 2018. [DOI] [PubMed] [Google Scholar]
  • [13].Dewhurst Richard, Nyström Marcus, Jarodzka Halszka, Foulsham Tom, Johansson Roger, and Holmqvist Kenneth. It depends on how you look at it: Scanpath comparison in multiple dimensions with multimatch, a vector-based approach. Behavior research methods, 44(4):1079–1100, 2012. [DOI] [PubMed] [Google Scholar]
  • [14].Eckstein Miguel P. Visual search: A retrospective. Journal of vision, 11(5):14–14, 2011. [DOI] [PubMed] [Google Scholar]
  • [15].Ehinger Krista A, Hidalgo-Sotelo Barbara, Torralba Antonio, and Oliva Aude. Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual cognition, 17(6–7):945–978, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Elder James H, Hou Yuqian, Goldstein Ronen, and Dornaika Fadi. Attentive panoramic visual sensor, October 31 2006. US Patent 7,130,490.
  • [17].Fu Justin, Luo Katie, and Levine Sergey. Learning robust rewards with adversarial inverse reinforcement learning. arXiv:1710.11248, 2017. [Google Scholar]
  • [18].Gilani Syed Omer, Subramanian Ramanathan, Yan Yan, Melcher David, Sebe Nicu, and Winkler Stefan. Pet: An eye-tracking dataset for animal-centric pascal object classes. In 2015 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015. [Google Scholar]
  • [19].Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. Generative adversarial nets. In Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, editors, Advances in Neural Information Processing Systems. 2014. [Google Scholar]
  • [20].Henderson John M, Brockmole James R, Castelhano Monica S, and Mack Michael. Visual saliency does not account for eye movements during visual search in real-world scenes. In Eye movements, pages 537–III. Elsevier, 2007. [Google Scholar]
  • [21].Ho Jonathan and Ermon Stefano. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016. [Google Scholar]
  • [22].Hout Michael C and Goldinger Stephen D. Target templates: The precision of mental representations affects attentional guidance and decision-making in visual search. Attention, Perception, & Psychophysics, 77(1):128–149, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Huang Xun, Shen Chengyao, Boix Xavier, and Zhao Qi. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 262–270, 2015. [Google Scholar]
  • [24].Ioffe Sergey and Szegedy Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [Google Scholar]
  • [25].Itti L, Koch C, and Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. [Google Scholar]
  • [26].Jetley Saumya, Murray Naila, and Vig Eleonora. End-to-end saliency mapping via probability distribution prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5753–5761, 2016. [Google Scholar]
  • [27].Jiang Ming, Huang Shengsheng, Duan Juanyong, and Zhao Qi. Salicon: Saliency in context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [Google Scholar]
  • [28].Judd Tilke, Ehinger Krista, Durand Frédo, and Torralba Antonio. Learning to predict where humans look. In 2009 IEEE 12th international conference on computer vision, pages 2106–2113. IEEE, 2009. [Google Scholar]
  • [29].Kirillov Alexander, Girshick Ross, He Kaiming, and Dollár Piotr. Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6399–6408, 2019. [Google Scholar]
  • [30].Kirillov Alexander, He Kaiming, Girshick Ross, Rother Carsten, and Dollár Piotr. Panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019. [Google Scholar]
  • [31].Koehler Kathryn, Guo Fei, Zhang Sheng, and Eckstein Miguel P. What do saliency models predict? Journal of vision, 14(3):14–14, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Kruthiventi Srinivas SS, Kumar Ayush, and Babu R Venkatesh. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9):4446–4456, 2017. [DOI] [PubMed] [Google Scholar]
  • [33].Kümmerer Matthias, Theis Lucas, and Bethge Matthias. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv:1411.1045, 2014. [Google Scholar]
  • [34].Kummerer Matthias, Wallis Thomas SA, Gatys Leon A, and Bethge Matthias. Understanding low-and high-level contributions to fixation prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 4789–4798, 2017. [Google Scholar]
  • [35].Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C Lawrence. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [Google Scholar]
  • [36].Masciocchi Christopher Michael, Mihalas Stefan, Parkhurst Derrick, and Niebur Ernst. Everyone knows what is interesting: Salient locations which should be fixated. Journal of vision, 9(11):25–25, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Mathe Stefan and Sminchisescu Cristian. Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(7):1408–1424, 2014. [DOI] [PubMed] [Google Scholar]
  • [38].Nakayama Ken and Martini Paolo. Situating visual search. Vision research, 51(13):1526–1537, 2011. [DOI] [PubMed] [Google Scholar]
  • [39].Narayanan Venkatraman and Likhachev Maxim. Perch: Perception via search for multi-object recognition and localization. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5052–5059. IEEE, 2016. [Google Scholar]
  • [40].Needleman Saul B and Wunsch Christian D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. [DOI] [PubMed] [Google Scholar]
  • [41].Ng Andrew Y. and Russell Stuart. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2000. [Google Scholar]
  • [42].Papadopoulos Dim P, Clarke Alasdair DF, Keller Frank, and Ferrari Vittorio. Training object class detectors from eye tracking data. In European conference on computer vision, pages 361–376. Springer, 2014. [Google Scholar]
  • [43].Park Sohee, Bhattacharya Arani, Yang Zhibo, Dasari Mallesham, Das Samir R, and Samaras Dimitris. Advancing user quality of experience in 360-degree video streaming. In 2019 IFIP Networking Conference (IFIP Networking), pages 1–9. IEEE, 2019. [Google Scholar]
  • [44].Perez Ethan, Strub Florian, Vries Harm De, Dumoulin Vincent, and Courville Aaron. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
  • [45].Perry Jeffrey S and Geisler Wilson S. Gaze-contingent real-time simulation of arbitrary visual fields. In Human vision and electronic imaging VII, volume 4662, pages 57–70. International Society for Optics and Photonics, 2002. [Google Scholar]
  • [46].Roth Kevin, Lucchi Aurelien, Nowozin Sebastian, and Hofmann Thomas. Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pages 2018–2028, 2017. [Google Scholar]
  • [47].Schulman John, Moritz Philipp, Levine Sergey, Jordan Michael, and Abbeel Pieter. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. [Google Scholar]
  • [48].Schulman John, Wolski Filip, Dhariwal Prafulla, Radford Alec, and Klimov Oleg. Proximal policy optimization algorithms. arXiv:1707.06347, 2017. [Google Scholar]
  • [49].Torralba Antonio, Oliva Aude, Castelhano Monica S, and Henderson John M. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766, 2006. [DOI] [PubMed] [Google Scholar]
  • [50].Võ Melissa Le-Hoa, Boettcher Sage EP, and Draschkow Dejan. Reading scenes: How scene grammar guides attention and aids perception in real-world environments. Current opinion in psychology, 2019. [DOI] [PubMed] [Google Scholar]
  • [51].Wang Zhiguo and Klein Raymond M. Searching for inhibition of return in visual search: A review. Vision research, 50(2):220–228, 2010. [DOI] [PubMed] [Google Scholar]
  • [52].Wei Zijun, Adeli Hossein, Hoai Minh, Zelinsky Gregory, and Samaras Dimitris. Learned region sparsity and diversity also predict visual attention. In Advances in Neural Information Processing Systems, 2016. [Google Scholar]
  • [53].Wolfe Jeremy M. Guided search 2.0 a revised model of visual search. Psychonomic bulletin & review, 1(2):202–238, 1994. [DOI] [PubMed] [Google Scholar]
  • [54].Wolfe Jeremy M and Horowitz Todd S. Five factors that guide attention in visual search. Nature Human Behaviour, 1(3):0058, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [55].Wulfmeier Markus, Ondruska Peter, and Posner Ingmar. Maximum entropy deep inverse reinforcement learning. arXiv:1507.04888, 2015. [Google Scholar]
  • [56].Zelinsky Gregory, Yang Zhibo, Huang Lihan, Chen Yupei, Ahn Seoyoung, Wei Zijun, Adeli Hossein, Samaras Dimitris, and Hoai Minh. Benchmarking gaze prediction for categorical visual search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. [Google Scholar]
  • [57].Zelinsky Gregory J. A theory of eye movements during target acquisition. Psychological review, 115(4):787, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [58].Zelinsky Gregory J, Peng Yifan, Berg Alexander C, and Samaras Dimitris. Modeling guidance and recognition in categorical search: Bridging human and computer object detection. Journal of Vision, 13(3):30–30, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [59].Zhang Mengmi, Feng Jiashi, Keng Teck Ma Joo Hwee Lim, Zhao Qi, and Kreiman Gabriel. Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 9(1):3730, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [60].Ziebart Brian D., Maas Andrew, Bagnell J. Andrew, and Dey Anind K.. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, 2008. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplemental material

RESOURCES