Target templates: the precision of mental representations affects attentional guidance and decision-making in visual search

Michael C Hout; Stephen D Goldinger

doi:10.3758/s13414-014-0764-6

. Author manuscript; available in PMC: 2015 Jan 8.

Published in final edited form as: Atten Percept Psychophys. 2015 Jan;77(1):128–149. doi: 10.3758/s13414-014-0764-6

Target templates: the precision of mental representations affects attentional guidance and decision-making in visual search

Michael C Hout ¹, Stephen D Goldinger ²

PMCID: PMC4286498 NIHMSID: NIHMS642329 PMID: 25214306

Abstract

When people look for things in the environment, they use target templates—mental representations of the objects they are attempting to locate—to guide attention and to assess incoming visual input as potential targets. However, unlike laboratory participants, searchers in the real world rarely have perfect knowledge regarding the potential appearance of targets. In seven experiments, we examined how the precision of target templates affects the ability to conduct visual search. Specifically, we degraded template precision in two ways: 1) by contaminating searchers’ templates with inaccurate features, and 2) by introducing extraneous features to the template that were unhelpful. We recorded eye movements to allow inferences regarding the relative extents to which attentional guidance and decision-making are hindered by template imprecision. Our findings support a dual-function theory of the target template and highlight the importance of examining template precision in visual search.

Keywords: Visual search, Eye movements, Target templates, Attentional guidance, Decision making

The world is replete with visual information of such richness and complexity that our cognitive systems are subject to information overload. It is impossible to simultaneously process all available information on a moment-to-moment basis.

Imagine yourself sitting in a coffee shop; as you look around, your eyes are flooded with different colors, shapes, textures, contours, and motion. You spot brightly colored boxes of tea on shelves, round reclining chairs in your periphery, the smooth surface of a linoleum floor, and the sharp edges of a table to avoid as you exit. All this raw information is assembled in your mind such that you perceive coherent objects, rather than features. Rather than “squat, opaque, and round,” you see your coffee mug. “Shiny, silver, and cylindrical” is immediately appreciated as a container of milk on the condiment bar. Despite the fluency of converting visual features into meaningful objects, people are limited regarding how much information they can process in any given moment. As such, people process visual scenes through a series of alternating fixations and saccadic eye movements. Over time, regions of space are briefly foveated and objects are stored in visual working memory (VWM), giving rise to stable representations of the world. Chaotic as saccades may seem, eye movements are not random; they are highly governed by low-level, visual characteristics of the environment, prior experience viewing similar scenes, and (most germane to the current investigation) by the mental representation of whatever the observer is seeking.

Guidance of attention by bottom-up and top-down information

When people look for things in the environment, they use three primary sources of information: low-level salience, scene context, and target template information (Malcolm & Henderson, 2010). Low-level salience is a bottom-up information source that helps select regions of contrast, such as changes in color or intensity (Itti & Koch, 2000, 2001; Koch & Ullman, 1985). Higher-level (top-down) knowledge helps identify informative regions of a scene, most likely to contain target objects (Castelhano & Henderson, 2007; Neider & Zelinsky, 2006). Finally, target “templates” in memory are used to assess visual information, comparing input to representations stored in VWM. Regions that share features with the template are selected for fixation (Rao et al., 2002; Zelinsky, 2008) and, generally, the more similar an item is to the target, the more likely it will be fixated (Eckstein et al., 2007; Findlay, 1997; Becker, 2011; Olivers, Meijer & Theeuwes, 2006; Mannan et al., 2010). Returning to the example of searching for milk at a coffee shop, low-level salience information decomposes the store into regions of coherence, contextual knowledge makes you look on the condiment bar, rather than merchandise display shelves, and a target template helps direct your eyes toward objects that potentially match the target.

Although bottom-up information is necessary for basic visual processing (see Wolfe & Horowitz, 2004), it seems that, relative to the influence of high-level knowledge, the guidance of attention by low-level features has rather limited utility (Einhäuser et al., 2008; Henderson et al., 2007, 2009; Tatler & Vincent, 2008, 2009). For instance, top-down attentional control is entailed even in simple feature-search tasks, which have been traditionally viewed as requiring little overt attention (Wolfe et al., 2003). People can perform remarkable feats of top-down attentional control; when only a subset of items are ever pertinent in a visual search task, people learn to restrict attention to relevant stimuli (Kunar, Flusberg & Wolfe, 2008; Frings, Wentura, & Wühr, 2012) and can learn the features of distractors in order to guide attention away from them (Yang, Chen & Zelinsky, 2009; see also the “preview benefit” in Watson & Humphreys, 1997; Watson & Humphreys, 2000; Watson et al., 2003). Top-down guidance is even strong enough to override attention capture by low-level salience (Chen & Zelinsky, 2006). In essence, top-down guidance works in two ways: it biases attention toward important features or regions, and it biases attention away from undesirable features (or objects that have already been inspected; Al-Aidroos et al., 2012; Arita, Carlisle, & Woodman 2012). It is unsurprising that successful models of visual search, such as Guided Search (Wolfe et al., 1989; Wolfe, 1994; Wolfe & Gancarz, 1996, Wolfe, 2007; Palmer et al., 2011), incorporate top-down guidance as a key mechanism controlling attention.

Target template

The idea of a target template was first proposed by ethologists studying the feeding behavior of birds. When birds feed on insects, they tend to sample the common bugs disproportionately often, suggesting that their behavior is biased in favor of target features that previously resulted in rewards (Tinbergen, 1960; Pietrewicz & Kamil, 1979; Bond, 1983). Regarding humans, the target template (also called the “attentional template” or “search template”) refers to the VWM representation of the target item and how it facilitates detection (Wolfe et al., 2004; Malcolm & Henderson, 2009; Bravo & Farid, 2009, 2012; Vickery et al., 2005). Research on primates (Evans et al., 2011; Chelazzi et al., 1993, 1998) has shown that activating a search template involves neurons in the prefrontal cortex (PFC) that select and maintain behavioral goals (such as finding a target among distractors). These neurons project to inferotemporal (IT) cortex, where visual objects are believed to be represented (Mruczek & Sheinberg, 2007; Peelen, Fei-Fei, & Kastner, 2009). Importantly, topdown input from PFC enhances the gain in IT neurons that are selective for the target object; in essence, the PFC “tells” IT cortex which representations to favor and which to inhibit (Stokes et al., 2009; Zhang et al., 2011). Moreover, this bias may be relayed to V4 or other early visual areas that encode basic stimulus features (Hon et al., 2009). Neurons in V4 are sensitive to stimulation in one specific region of the retina, the cell’s “receptive field.” Beyond spatial selectivity, these cells also have preferred colors and/or shapes they selectively favor (Wolfe, 2005).

In a study using macaques, Bichot, Rossi, and Desimone (2005) recorded from V4 neurons. On trials where the target was (for instance) red, they found that red-selective neurons increased their firing rates even before the target was seen (i.e., before the monkey prepared an eye movement to the target location). Moreover, other red-selective neurons began to synchronize their activity, as if preparing to respond to feature presence. More recently, Eimer et al. (2011) used event-related potentials (ERP) to study the benefits of advanced preparation in visual search (in human participants). Their results suggested that holding a target template in VWM accelerates target selection and resolves attentional competition by inhibiting neurons that code for irrelevant features. Taken together, these findings suggest a mechanism by which target templates may guide visual search; holding a template in mind enhances firing in cells that respond to relevant features and may inhibit cells that respond to irrelevant features (Desimone & Duncan, 1995; Usher & Neiber, 1996).

Problem of template imprecision

In typical search experiments, participants look for a single, unambiguous target in every trial. For instance, participants may see a picture of a bicycle and try to find that same bicycle in a cluttered array of other objects. In other cases, precise target descriptions are given, such as “find a vertical red bar” in a display with lines of various orientations and colors. In such cases, search templates are highly precise. In natural examples of visual search, people rarely enjoy such template precision. We may look for things defined very loosely (e.g., “something to use for a doorstop”). Other times, we might have a good target template but cannot anticipate the exact target appearance, such as looking for a particular friend at a high-school reunion. Perhaps you remember your friend as being thinner than he is now, etc. In such cases, activated template features will draw attention to a relevant subset of individuals but likely will not allow the correct person to “pop out” from the crowd. Additionally, perhaps you are hoping to see several old friends and therefore have multiple templates in mind. If so, larger subsets of people will likely draw attention as potential targets, making search more difficult. The present investigation addressed the effects of such template imprecision on visual search. Specifically, we tested how degrees of dissimilarity between expected and discovered targets affected: 1) the efficiency of guiding attention to targets, and 2) the fluency of appreciating targets once attention has fallen upon them.

Manipulating template precision using template-to-target similarity

The concept of similarity is critical to virtually all theories of perception, attention and memory (Goldstone & Medin, 1994; Hout, Goldinger & Ferguson, 2013; Medin, Goldstone, & Gentner, 1993). With respect to visual search, theories have long emphasized how efficiency is affected by the similarity of targets to distractors, and by the similarity of distractors to each other (Duncan & Humphreys, 1989; 1992; Hwang, Higgins, & Pomplun, 2009; Wolfe, 1994; Zelinsky, 2008). Although vast literature exists on target-to-distractor similarity effects, there is relatively little work on template-totarget similarity effects. Prior experiments have compared template-guided search (using picture cues) to categorical search (using word cues). In such experiments, we assume that searchers’ VWM representations differ across conditions. In template-guided search, observers have (nearly) veridical target representations in mind, whereas categorical search affords less precision.

Yang and Zelinsky (2009), for example, had people search for teddy bears using picture or word cues and found that, in the latter case, people searched longer and required more fixations to complete search. In a similar investigation, Schmidt and Zelinsky (2009) argued that the amount of guidance that a cue elicits is directly related to the cue’s categorical specificity. Participants were given five different types of cues: an exact picture of the target (e.g., a picture of a pair of boots), an abstract textual description (e.g., “footwear”), a precise textual description (e.g., “boots”), an abstract plus color textual description (e.g., “brown footwear”), or a precise plus color textual description (e.g., “brown boots”). They found that guidance (indexed by fixation and saccade metrics) increased as more information was added to the cue. Their findings suggest that guidance improves to the extent that visual information can be extracted from the cue and loaded into VWM (these assumptions were incorporated in two recent search models; Zhang et al., 2006; Navalpakkam & Itti, 2005).

Studies of priming also show the utility of target templates. In a recent study, Wilschut, Theeuwes, and Olivers (2014; see also Wilschut, Theeuwes, & Olivers, 2013) had people search for a target line segment defined by color (a red horizontal bar, or a green vertical bar). Participants were first shown both potential target colors as filled circles (pre-cues) in different spatial locations, followed a fixation dot, and then an “exact” or “neutral” post-cue. The post-cue indicated which color to search for; it was another circle, presented in the same spatial location as one of the pre-cues. On exact-cue trials, the circle was shown in color (green or red), and on neutral-cue trials, the circle was black. Search arrays were shown very briefly, followed by a backwards mask. Wilschut and colleagues found better search accuracy for exact cues, relative to neutral cues. Importantly, however, they found that the effects were equivalent across cue types when priming effects were accounted for. In a second experiment, target pre-cues were shown either verbally (the words “RED” and “GREEN”) or visually (colored circles), and the spatial post-cue was always a neutral black dot, indicating which color to search for. In this way, visual priming was controlled for, because both target cues were presented on each trial and because the search array was temporally aligned to the neutral postcue. No differences were found between the verbal or visual cue conditions, suggesting that visual target cues engender superior attentional selectivity due largely to the priming of visual features.

In the present investigation, we employed two new techniques for directly manipulating the precision of searchers’ target templates, controlling their similarity to targets that eventually appeared in search displays. Our first set of experiments (1a through 1d) involved a paradigm wherein search targets were validly cued in most trials: observers were shown targets that appeared, unaltered, in the display. Instructions indicated that targets would occasionally differ from the cues: “Please search for this item or something very much like it.” In the remaining trials, the eventual targets were slight variations of the provided cues. Of key interest, there were different levels of variation. For example, Experiment 1a included two levels of variation from the provided cues, denoted as imprecise versus inaccurate. In imprecise trials, the observer might be shown a coffee mug, and the target would be the same mug, oriented differently. In inaccurate trials, the target would be a different color mug. (Importantly, despite these differing degrees of variation, targets were always unmistakable and performance was highly accurate.) Performance in valid trials was then contrasted with invalid trials, with two levels of dissimilarity. In later variations of Experiment 1, cuetarget similarity was manipulated more directly, using multidimensional scaling (MDS).

Our second set of experiments (2a through 2c) involved manipulating the “width” of searchers’ template “feature spaces.” In these experiments, observers searched for multiple targets (although single-target, control trials were also included). Two target cues were shown before search, and participants tried to locate either in the search array (only one ever appeared). Width of the template feature space was manipulated by varying the similarity of the two potential targets to one another. Feature spaces ranged from narrow (e.g., a coffee mug oriented in two ways) to broad (e.g., two starfish with different colors and shapes). Across experiments, we therefore examined situations wherein observers’ templates were made imprecise, either by the inclusion of inaccurate features, or the addition of unhelpful ones.

Experiment 1

Experiment 1 tested the degrees to which inaccurate template features would affect visual search. Theoretically, there are two likely functions of a search template. First, it may contribute to attentional guidance, changing the visual system’s “activation map” that sets priorities for objects or regions to fixate, based on similarity to the features in VWM (Malcolm & Henderson, 2009). Second, templates are likely used in target verification, as the criterion to which visual input is compared (Malcolm & Henderson, 2010). To guide this study, we considered three hypotheses regarding how search behavior might change when target templates are imprecise. First, the attentional guidance hypothesis suggests that imprecision will hinder the ability to quickly allocate attention to the correct object, as misleading features in VWM will allow competing objects to draw attention. Second, the decision-making hypothesis posits that imprecise templates will hinder the process of comparing visual input to VWM, slowing target verification (and distractor rejection). Third, the dual-function hypothesis suggests simply that an imprecise target template will hinder both attentional guidance and decision-making during search.

To preview our findings, Experiment 1a established that template imprecision slows search RTs, then Experiment 1b included eye-tracking to decompose search behavior into separate phases. These were scanning (eye movement behavior from search initiation until the target was located) and decision making (the time from first target fixation until the overt response). We used two dependent measures to characterize search behavior during these phases. Scan-path ratios (SPRs) were obtained by summing the amplitude of all saccades (in degrees of visual angle) prior to target fixation and dividing that value by the shortest possible distance between central fixation and the target. Thus, perfect guidance (e.g., pop-out) would yield a ratio equal to one; ratios > 1 would indicate imperfect guidance, as other locations were visited prior to the target.¹ Decision times (DTs) were measured from target fixation to the spacebar press terminating search. In Experiments 1a and 1b, degrees of imprecision (between items stored in VWM and eventual targets) were operationalized by comparing “state” and “exemplar” pairs (from Brady et al., 2008; Konkle et al., 2010). Examples are shown in Fig. 1. In Experiments 1c and 1d, we manipulated template precision using multidimensional scaling measures of similarity among objects (from Hout, Goldinger, & Brady, under review) and again tested both search RTs and eye movements (Alexander & Zelinsky, 2011; Godwin, Hout, & Menneer, 2014).

Fig. 1 — Sample state- and exemplar-pair stimuli from the “Massive Memory” database (cvcl.mit.edu/MM/stimuli.html)

Experiment 1a: method

Participants

Twenty students from Arizona State University participated in Experiment 1a as partial fulfillment of a course requirement. All participants had normal, or corrected-to-normal, vision and all reported normal color vision. (These aspects of the participant pool were true for every experiment reported in this article, and are not reiterated each time.)

Apparatus

Data were collected on up to 12 identical computers simultaneously, all separated by dividers. The PCs were Dell Optiplex 380 systems (3.06 GHz, 3.21 GB RAM) operating at 1366 × 768 resolution on Dell E1912H 18.5” monitors (60 Hz refresh rate). The operating system was Windows XP, and E-Prime v2.0 software (Schneider, Eschman, & Zuccolotto, 2002) was used to control all procedures.

Design

Three levels of Template Precision (precise, imprecise, inaccurate) were manipulated within-subjects. In every condition, three levels of Set Size (12, 16, 20) were manipulated in equal proportions.

Stimuli

All stimuli came from the “Massive Memory” database (Brady et al., 2008; Konkle et al., 2010; cvcl.mit.edu/MM/stimuli.html). They were photographs of real-world objects, resized (maintaining original proportions) to a range of 2.0° to 2.5° visual angle (horizontal or vertical), from a viewing distance of 55 cm. The pictures contained no background; a single object or entity was present in each image (e.g., an ice cream cone, a pair of shoes).

Procedure

Visual search

At the beginning of each trial, participants were shown a target cue and were asked to “search for this item or something very similar to it.” When the participants were ready, they pressed the spacebar to start the trial. This initiated a 500-ms fixation cross, followed by the visual search display, which remained until a response was recorded or 10-s elapsed (timeouts were coded as errors). Participants rested their fingers on the spacebar during search, quickly pressing it upon target location (RTs were measured from display onset to the spacebar press). Responding cleared the images from view, and each image was replaced with a random number (between one and the set size) for 2 seconds (Navalpakkam & Itti, 2007, for a similar approach). The numbers then disappeared and participants indicated which number appeared at the target location, using 2AFC (Fig. 2). Feedback was provided as either a centrally presented green checkmark or a large red X. Feedback for correct trials lasted 1 second; feedback for incorrect trials lasted 2 seconds. Instructions asked participants to respond as quickly as possible while remaining accurate. After four practice trials, there were 360 experimental trials, presented in 4 blocks of 90. There were 240 trials of the precise condition and 60 trials apiece of the imprecise and inaccurate conditions.

Fig. 2 — Visual search trial progression, from Experiment 1a. (Images were presented in full color)

Search array organization

A search array algorithm was used to create spatial configurations with pseudo-random organization (Fig. 3; Hout & Goldinger, 2012). An equal number of objects appeared in each quadrant of the display (three, four, or five, depending upon set size). Each quadrant was broken down into nine equal “cells,” making the entire display into a 6x6 grid. On each trial, images were placed in random cells (per quadrant); specific locations were selected to ensure a minimum of 1.5° of visual angle between adjacent images, and between any image and the edges of the screen. No images appeared in the four centermost locations of the screen to ensure the participant’s gaze would never immediately fall on a target at onset. Targets appeared equally often in each quadrant of the display.

Stimulus selection

In the 240 “precise” trials, targets appeared exactly as cued. In the remaining 120 trials (60 “imprecise” and 60 “inaccurate”), targets were slight deviations from the initial cues; these were either the state or exemplar partners, respectively, of the cue pictures. Participants were told that targets would appear exactly as shown in most trials but would occasionally be slightly different. This procedure encouraged participants to adopt the cue as a search template. In each trial, the target was selected quasi-randomly from among the 100 exemplar- and 100 state-pairs; each cue-target pair was used once or twice in the experiment. In each trial, distractors were selected quasi-randomly from among 240 object categories, chosen such that only one exemplar per semantic category was represented; across trials, no category was repeated until each had been used at least once. The entire set contained 4,040 images; no picture was used more than twice in the entire experiment.

Experiment 1a: results and discussion

Overall, accuracy was very high (in Experiments 1a through 1d, average accuracy was always > 97 %). Although some reliable effects (e.g., Set Size) were observed in accuracy, potential ceiling effects make them uninteresting. Therefore, all reported results for Experiment 1 focus on RT measures and eye-movements. In the RT analyses, although set size and block were included, we focus on the specific findings of interest, involving template precision, and do not unduly focus on extraneous interactions. Accuracy and RT data were analyzed using 3 (Precision: precise, imprecise, inaccurate) × 3 (Set Size: 12, 16, 20) × 4 Block (1-4) within-subjects, repeated measures ANOVAs. Only RTs from correct trials were analyzed.

An initial validation analysis was conducted to ensure that all stimuli were comparable. When used in precise trials, the stimuli from exemplar and state pairs produced equivalent search RTs (987 and 1025 ms, respectively, F < 1). This suggests that there was nothing inherently more difficult about the exemplar-pair pictures, relative to the state-pair pictures. RTs as a function of Precision and Block (collapsed across Set Size) are shown in Fig. 4. There was a main effect of Precision, with fastest RTs in precise trials (1006 ms), followed by imprecise (1321 ms) and inaccurate (1941 ms) trials; F(2, 18) = 77.75, p < 0.01, n²_p = 0.90. There also were main effects of Set Size (1242, 1424, and 1601 ms for sizes 12, 16, and 20, respectively), F(2, 18) = 24.05, p < 0.01, n²_p = 0.73, and Block (1579, 1473, 1360, and 1277 ms for Blocks 1-4), F(3, 17) = 13.53, p < 0.01, n²_p = 0.70: RTs increased with increasing set size and decreased across blocks. There was a Precision × Block interaction (F(6, 14) = 5.89, p < 0.01, n²_p = 0.72); search RTs improved the most when templates were less precise (improvements of 128, 245, and 533 ms across blocks for the precise, imprecise, and inaccurate conditions, respectively).

Fig. 4 — Mean visual search reaction time (on accurate trials) from Experiment 1a, presented as a function of Template Precision and Block. Error bars represent 1 standard error (SE) of the mean

Experiment 1a showed that template imprecision impaired search RTs, but that difference diminished over blocks. Search RTs remained relatively flat in the precise condition, improved across blocks in the imprecise condition, and improved most in the inaccurate condition. There are (at least) two reasons for this interaction. First, over time, participants may have become more resilient to inaccurate target cues, implicitly adopting broad (perhaps categorical) templates based on the cue, rather than anticipating the specific features shown before search. This would appear mainly as an improvement in decision making, once targets had been located. (It seems unlikely that adopting a broader template would lead to more efficient attentional guidance.) Second, and less interesting, is a potential ceiling effect: RTs were fast overall, hovering around 1000 ms in all blocks of the precise condition. The interaction may have arisen, because only the harder conditions afforded more “room” to improve. In Experiment 1b, we followed up on this question in two ways. First, eye-tracking allowed us to decompose RTs into scanning and decision-making phases. We expected to see more fluent decision-making, without a concurrent increase in scanning efficiency.

Second, we introduced a new manipulation of cue validity in Experiment 1b: Across participants, different proportions of mismatch trials were used, based on the expectation that people will adhere to the cues in proportion to their information value. If the cues rarely predict the appearance of the actual targets, an ideal searcher may choose to only loosely base search templates on them. Indeed, a recent study by Machizawa et al. (2012) suggests that the precision of VWM representations can be controlled at will (this study is discussed later in further detail). Experiment 1b introduced a Match Frequency manipulation, varying the proportions of perfect matches. For the high match frequency group, we expected faster search RTs in the “precise” condition, relative to the low match frequency group. However, in the “inaccurate” condition, we expected a reversal, wherein the high match frequency group would be slower, due to their strict adherence to the initial cues.