Investigating Visual Crowding of Objects in Complex Real-World Scenes

Ryan V Ringer; Allison M Coy; Adam M Larson; Lester C Loschky

doi:10.1177/2041669521994150

. 2021 Apr 28;12(2):2041669521994150. doi: 10.1177/2041669521994150

Investigating Visual Crowding of Objects in Complex Real-World Scenes

Ryan V Ringer ^1,^✉, Allison M Coy ², Adam M Larson ³, Lester C Loschky ⁴

PMCID: PMC8822316 PMID: 35145614

Abstract

Visual crowding, the impairment of object recognition in peripheral vision due to flanking objects, has generally been studied using simple stimuli on blank backgrounds. While crowding is widely assumed to occur in natural scenes, it has not been shown rigorously yet. Given that scene contexts can facilitate object recognition, crowding effects may be dampened in real-world scenes. Therefore, this study investigated crowding using objects in computer-generated real-world scenes. In two experiments, target objects were presented with four flanker objects placed uniformly around the target. Previous research indicates that crowding occurs when the distance between the target and flanker is approximately less than half the retinal eccentricity of the target. In each image, the spacing between the target and flanker objects was varied considerably above or below the standard (0.5) threshold to either suppress or facilitate the crowding effect. Experiment 1 cued the target location and then briefly flashed the scene image before participants could move their eyes. Participants then selected the target object’s category from a 15-alternative forced choice response set (including all objects shown in the scene). Experiment 2 used eye tracking to ensure participants were centrally fixating at the beginning of each trial and showed the image for the duration of the participant’s fixation. Both experiments found object recognition accuracy decreased with smaller spacing between targets and flanker objects. Thus, this study rigorously shows crowding of objects in semantically consistent real-world scenes.

Keywords: crowding, object recognition, scene perception, spatial selection/modulation, peripheral vision

Visual Crowding in Real-World Scene Perception

We see the scenes surrounding us in our visual world with both central and peripheral vision. Our entire horizontal visual field extends to roughly 210° in diameter, however, much of our conscious experience occurs within a 5° to 8.5° radius central window (i.e., central vision), which contains the macula and the highest density of photoreceptors (for review, see Strasburger, 2020). Therefore, most of our visual environment is processed using peripheral vision. This raises a key question that vision scientists have been investigating for more than the last 100 years—what is the nature of vision in our visual periphery (for review, see Loschky et al., 2017; Strasburger et al., 2011; Wilson et al., 1990)? Here we are concerned with a particular visual phenomenon known as crowding, which limits both our conscious perception of things in peripheral vision and our ability to recognize them. Crowding has been studied relatively intensively over the last two decades, including special issues of the Journal of Vision in both 2007 (Pelli, Cavanagh, et al., 2007) and in 2014 to 2017 (Crowding: New Vistas,2014–2017), and several in-depth review articles (Levi, 2008; Pelli & Tillman, 2008; Strasburger, 2020; Strasburger et al., 2011; Whitney & Levi, 2011), but an important and unanswered question is, does it operate for objects in real-world scenes? If crowding can affect object recognition in real-world scenes, then it may place very important constraints on our experience of the visual world.

Visual crowding has been characterized as impaired object or form recognition due to the presence of surrounding objects (Bouma, 1970; Ehlers, 1936; Stuart & Burian, 1962; Woodrow, 1938). The most fundamental aspect of crowding is the existence of a critical spacing, which is defined by the separation between the centers of the target and the flanking object that is necessary to achieve recognition performance equivalent to that of a target presented in isolation. It was discovered that the critical spacing between target and distractor objects for a given eccentricity was equal to approximately ½ the visual angle of the retinal eccentricity of the target object (Bouma, 1970), which has come to be known as Bouma’s constant (Whitney & Levi, 2011) or Bouma factor (Pelli et al., 2004; Pelli, Tillman, et al., 2007; Strasburger, 2020). It has been replicated across a number of studies and a range of stimuli (Latham & Whitaker, 1996; Pelli et al., 2004; Pelli & Tillman, 2008; Whitney & Levi, 2011). Whether nearby objects are separated by a distance either greater than or equal to the Bouma factor, or less than the Bouma factor, determines, to a first approximation, whether they do or do not suffer impaired object discrimination, respectively.¹

Although crowding and its properties have been extensively studied, previous studies of crowding may have only limited generalizability to seeing real-world objects in scenes in peripheral vision. The majority of crowding studies have used simple stimuli such as alphanumeric characters (Strasburger et al., 1991), Gabor patches (Parkes et al., 2001; Poder & Wagemans, 2007), or Vernier acuity stimuli (Levi et al., 1985). Nevertheless, studies utilizing complex displays of these simple stimuli have found important contextual effects of flanking objects on targets, where targets and flankers are more likely to be pooled together when they conform with Gestalt grouping principles (for reviews, see Herzog & Manassi, 2015; Herzog et al., 2015). However, the familiar structure of our environment has facilitated more efficient extraction of information in the environment, and thus high-level scene representations may provide more functional separation between targets and flankers, especially if the Gestalt of the flanking objects extends away from the target object (Van der Burg et al., 2017).

Only recently has crowding of real-world stimuli begun to be investigated. One study investigated crowding of realistic objects on a blank gray background (e.g., a rubber ducky crowded on the vertical meridian by a lamp and watering can) and produced results similar to those from experiments using simpler stimuli (Wallace & Tjan, 2011). While that study was an important advance in the direction of investigating crowding of ecologically valid stimuli, it did not situate the objects in realistic scene contexts, as they would occur in the real world. A step in this direction was recently taken by creating an image to demonstrate crowding of one real-world object, a boy, in one real-world scene, a street (Whitney & Levi, 2011). While the image is a compelling demonstration, it also highlights the lack of empirical studies to both quantify and test the generality of the phenomenal experience engendered by the demonstration.² Another study by Wallis and Bex (2012) investigated crowding in natural scenes using so-called dead leaves, which were synthetic texture patches embedded in various locations and sizes within the image. The results from this study show that, while participants’ ability to detect the leaves varied as a function of patch size and eccentricity, detection was impaired when the dead leaves appeared in regions of the scene where structural complexity was high. While this does not demonstrate crowding of real-world objects in natural scenes, it does demonstrate how real-world environments can contribute to crowding.

A more recent study investigated the detection of vulnerable road users (VRUs) such as pedestrians, motorcyclists, and bicyclists in traffic scenes (Sanocki et al., 2015). The authors found that VRU detection was significantly impaired when vehicles flanked the VRUs, with the presence or absence of flanking vehicles defining the crowding manipulation (Sanocki et al., 2015). Interestingly, the crowding manipulation showed participants’ bias becoming more conservative in crowded scenes, relative to uncrowded scenes. The fact that flanking objects decreased overall object detection could be argued to be inconsistent with previous results showing that crowding does not affect detection of objects, only their identification. Specifically, one could argue that the results of Sanocki et al. either suggest that crowding operates differently in real-world scenes than with simpler stimuli, or that the phenomenon was something other than crowding. The latter possibility cannot be dismissed easily because Sanocki et al. did not explicitly account for the retinal eccentricity of targets and flanking objects, nor did they calculate the target–flanker spacing to interpret their resulting data. However, an alternative argument is that the “VRU detection task” did, in fact, involve identification (or superordinate level categorization), rather than detection, and that their results are therefore consistent with prior crowding results. Specifically, the target VRUs came from a variety of different basic level categories (i.e., pedestrians, bicyclists, and motorcyclists) and had to be identified as members of the superordinate VRU category. Finally, Sanocki et al. leave open the question of whether their effects on detection of VRUs actually involved crowding, per se, rather than some related phenomenon (e.g., masking, attentional capture).

Given that crowding has been argued to be ubiquitous in spatial vision (Levi, 2008), an alternative question could be, is there any good reason not to expect crowding of objects in real-world scenes? In fact, there are. Specifically, if crowding of objects occurred ubiquitously in natural scenes, one might assume that peripheral object recognition in scenes should be difficult, if not impossible. Yet, there is evidence that people are able to recognize objects (animals) in natural scenes presented in their visual periphery, even as far as 70° eccentricity (Thorpe et al., 2001).

One study which has rigorously investigated whether crowding can occur in natural scenes did so by comparing peripheral object recognition for objects completely isolated from their scenes, versus the same objects viewed through apertures (“windows”) that showed increasingly more surrounding context (Wijntjes & Rosenholtz, 2018). If crowding were occurring, the isolated objects would be easiest to recognize. Furthermore, as more surrounding context was revealed, performance would become increasingly degraded. However, the results showed exactly the opposite pattern, with the poorest recognition performance occurring when objects were isolated, and increasingly better recognition performance occurring as more surrounding context was revealed (Wijntjes & Rosenholtz, 2018). The authors argued that any detrimental effects of crowding produced by flanking objects in scenes was more than compensated for by the benefits of context (e.g., by priming scene-consistent object representations). This is consistent with the idea that the gist of a scene facilitates recognition of objects in it (Biederman et al., 1982; Boyce et al., 1989; Davenport & Potter, 2004). Thus, there is good reason to expect little, if any, crowding effect for objects viewed peripherally in scenes. However, there is an equally compelling alternative explanation of their results. It is important to note that the authors produced no positive evidence of crowding in any condition—they only showed negative evidence. Thus, Occam’s razor suggests that the simplest explanation of their results is an issue of construct validity, and that there was no crowding to begin with. If so, how can they explain the lack of crowding in their natural scenes, when we expect it to be there? Here, it is important to note that, as with the study by Sanocki et al. (2015), Wijntjes and Rosenholtz (2018) did not manipulate target–flanker spacing when attempting to evoke the conditions for crowding. Instead, in their first experiment, they varied the size of an aperture which allowed for the possibility of nearby objects to flank the target as they were revealed in the scene image (Wijntjes & Rosenholtz, 2018). Thus, it is plausible that the surrounding scene contexts were simply not cluttered enough to produce clear evidence of crowding. If so, sufficient clutter might have produced clear evidence of crowding in natural scenes. In Experiment 2 of Wijntjes and Rosenholtz (2018), participants categorized objects which were either isolated or occluded from the scene. Surprisingly, both conditions yielded approximately 40% accuracy (with chance being 1.2% correct), suggesting that the scene background itself produced object predictions that were roughly equal to the ability to recognize objects in peripheral vision in the absence of scene context. This suggests that, during the object recognition process, the presence a scene context may reduce the potential number of candidate objects that could be in peripheral vision (Wijntjes & Rosenholtz, 2018). Nevertheless, the question remains whether crowding can occur in natural scenes at all. Without varying the target/flanker relationship while holding scene/object context constant, whether and when we should expect to find crowding of objects in natural scenes remains to be determined.

Thus, this study addresses a key question for crowding research, namely, whether there is compelling evidence of crowding of real-world objects in real-world scenes as has been shown for simpler stimuli (e.g., letters, numbers, or Gabor patches) on blank screens, or is the crowding effect overcome by the benefit of having objects in the context of semantically consistent scenes? As noted earlier, there is scant empirical evidence of crowding of realistic objects in realistic real-world scenes along with evidence against the existence of crowding in scenes, which may be due to facilitation of object recognition by scene context. Therefore, the chief goal of this study was to rigorously experimentally test for a crowding effect on recognition of real-world objects in real-world scenes. If facilitation of object recognition by consistent scene context is sufficient to overcome any crowding effects, then we should find little if any evidence of crowding in scenes containing objects consistent with their scene backgrounds. We therefore produced scenes containing only scene-consistent objects. However, to rigorously test for the existence of crowding in scenes, we have manipulated both the retinal eccentricities of target objects and the spacing between the centers of the target and flanking distractor objects. We have then compared spacing conditions that would be predicted to produce more crowding versus less. We did this by using architectural rendering software to produce highly realistic scenes in which we could precisely control the locations and spacing of target and distractor objects. Furthermore, because the most compelling demonstration to date of crowding in a realistic scene included only a single target object and scene, in order to produce more generalizable results, we produced several scenes with multiple objects in multiple locations.

Experiment 1: Tachistoscopic Presentation

Method

Participants

Forty-four undergraduates from Kansas State University (21 females) gave informed consent to participate in the experiment for course credit. Participants had a mean age of 20.0 years and all had visual acuity of 20/30 or better.

Stimuli

Computer generated living room scenes were created using Autodesk 3ds Max® design software.³ Eight different base images were created, with six versions of each base image: three crowded and three uncrowded (see Figure 1). The objects and backgrounds used were available within the software library. All scene images were 1,024 × 768 pixels and 37 × 27.5 cm. At the viewing distance of 53.3 cm, images subtended 38° × 29° of visual angle. Target objects were presented approximately 11.3° to 13.8° (M = 12.0°, standard deviation [SD] = 1.0°) from the center of fixation at eight potential locations. Four flanker objects were placed around the target object, with the distance between the target and flankers varying in a way that would make the crowding effect more or less likely to occur. The spacing between flankers and target objects ranged between 2.5° and 9.8° (M = 5.75°, SD = 2.39°), and the Bouma factor,⁴ the ratio between the mean visual angle between the target and flankers and target retinal eccentricity, ranged between 0.18 and 0.87 (M = .45, SD = 0.20; see Figure 2). Note that crowding typically diminishes when the Bouma factor is above a critical value of 0.4 (Pelli et al., 2004; Pelli, Tillman, et al., 2007; Strasburger, 2020; Strasburger et al., 1991). The stimuli were constructed so that both the overall scene configuration and the arrangement of target and distractor objects appeared natural and realistic for both the crowded and uncrowded conditions. Objects were placed in scene-consistent locations and all items that appeared within the scene could reasonably be expected to appear in a living room.

Figure 1. — Sample scenes showing examples of crowded (left) and uncrowded (right) targets within a single base-image. The yellow circle in the center of each scene represents the central fixation point. The object located within the red ring is the target object to be identified. Note that the red circles shown in this figure only indicate the target for demonstrative purposes and were not included in the actual experiment. Within each base-scene, six versions were generated: three versions with different target objects and locations, and a crowded and uncrowded version for each target object. Participants saw each unique target once.

Figure 2. — The distribution of flanker/target spacing ratio (the Bouma factor) values or the ratio of the flanker-to-target spacing and the retinal eccentricity of the target. Each point represents a single image version of a given base image, and its color represents its base image. X-values on the plot represent the Bouma factor, and Y-values represent their frequency, which has been smoothed along the x-axis using the kernel density function of the *lattice* library in R.

Procedure

Participants’ visual acuity was tested using a Snellen eye chart, and all had 20/30 vision or better. Participants were then given instructions for the experimental procedures. Participants viewed images using a chin rest to maintain a constant viewing distance throughout the experiment and between participants. To familiarize participants with the names associated with each of the objects, participants went through a learning phase, in which each object was presented at the center of the screen on a neutral gray background for 750 milliseconds followed by its label, which remained on the screen until the participant clicked a mouse to proceed to the next object. A total of 50 objects were presented in the learning phase. Following the learning phase, participants completed six practice trials. Half of the trials included crowded objects and half uncrowded, and participants were given feedback after each trial, to allow them to become familiar with the task and its difficulty level.

Trial Procedures

At the beginning of the trial, the participant was presented with a fixation dot (see Figure 3A). To begin the trial, participants clicked the mouse button. A small white dot flickered 4 times at fixation (one flicker cycle: 36 milliseconds on, 24 milliseconds off) to capture attention at the center of the screen (Carmi & Itti, 2006; Ludwig et al., 2008; Mital et al., 2010). Next, a white dot, 0.5° in diameter, flickered at the target object’s location twice (one flicker cycle: 36 milliseconds on, 24 milliseconds off) as a sudden-onset cue to capture attention at the target location and help store information from the upcoming target in visual working memory (Belopolsky et al., 2008; Theeuwes et al., 1999). Then, a scene image containing the target object was presented for 80 milliseconds. Because the minimum normal saccadic latency is 150 milliseconds, and average saccadic durations are 50 milliseconds, if a participant quickly made a saccade to the cued target location, their eyes should have arrived there no earlier than 200 milliseconds after the cue onset, by which time the target would be extinguished. Then, the scene image was followed by a blank gray screen for 750 milliseconds. Participants responded by clicking on the target object’s category name from among a 15-alternative forced choice matrix containing the names of the target, its flankers, and all other objects used in that scene, but only on the other two trials for that scene. The arrangement of the object names on the response screen was randomized on each trial.

Design

Six different scenes were included in the experiment, with each scene containing three potential target objects. Each scene was shown only 3 times, once for each of three targets in that scene. However, to avoid having the participants learn the targets and distractors, only the specific target and its four specific distractors were shown on any given trial, and on only that single trial. The two other targets for that scene, and each of those targets’ specific distractors, were shown on the two other trials for that scene (see Figure 1). A total of 18 experimental trials were completed, with half of the trials displaying crowded targets and the other half uncrowded targets. No participant viewed both the crowded and uncrowded version of the same target. Scenes were presented in a random order, and the image-condition blocks (i.e., crowded vs. uncrowded) were counterbalanced across participants.

Results

Precursors

With 44 participants each completing 18 trials, there were a total of 792 observations.⁵ Due to an error in stimulus creation and peripheral cue location in three base image pairs, 3 trials out of 18 were filtered out for each participant (i.e., 132 total observations were filtered). After the above filtering, analyses were carried out on a total of 660 observations.

Mean Object Identification Accuracy

In the study of crowding (and psychophysics in general) a two-stage analytical approach often used is to (a) identify each subject’s parameter estimates of their best-fitting psychometric function and (b) then subject these individual parameter estimates to a standard analysis of variance to determine whether they vary across predictors of interest. This analysis is typically performed on data collected over many trials using homogeneous stimuli, for instance, Gabor patch identification against a neutral-gray background. In contrast, this study estimated object identification performance in natural scenes, thus contributing an additional source of within-subject variance in the parameter estimates. Furthermore, not all participants saw the same version of each scene/object pairing, making it impossible to disentangle the influence of each image on each participant’s performance using the standard two-stage approach.

To address this issue, generalized linear mixed-effect modeling was used to calculate the (logistic) psychometric function (i.e., object recognition as a function of Bouma factor) while accounting for the independent sources of variability from subjects and stimuli. Mixed-effect models include fixed-effect predictors (e.g., the Bouma factor) while also modeling the random variation in each subject’s average accuracy, each subject’s sensitivity to within-subject variables (e.g., Bouma factor), and differences in the average difficulty of each scene. This random variation is not treated as the result of fixed effects but rather as the result of random sampling from a population of subjects and stimuli (Baayen et al., 2008). Because the analysis was performed at the individual trial level, the outcome variable (correct vs. incorrect response) was binomial, necessitating the use of a generalized linear model with a binomial distribution for the outcome variable and a logit link function (Jaeger, 2008) with the lme4 package (Bates et al., 2014) in the R statistical software (version 3.1.3).

Our criterion variable was object recognition accuracy (coded as 0 and 1 for incorrect and correct responses, respectively), therefore we specified a binomial distribution with a logit link function and used Bouma factor as our sole fixed effect (i.e., predictor; see Table 1, Column 1). With respect to random-effect structures, two classes of models were constructed to predict object recognition accuracy. The first class of models included by-subject random effects only, accounting only for individual participant differences (see Table 1, Column 2). The second class of models included both by-subject and by-item (base-image) random effects, which accounted for individual differences among participants as well as individual effects of each scene (see Table 1, Table 4). Within each of these two classes of models, three versions were generated. The first version included a random intercept with no fixed effects (i.e., a null model). The second version had a fixed effect of Bouma factor and a by-subject random intercept, which accounted for individual variability among participants’ mean accuracy, but did not account for individual differences when calculating the fixed effect (i.e., the slope) for Bouma factor. The third version had a fixed effect of Bouma factor, a by-subject random intercept and random slope for Bouma factor. This model version calculated the fixed effect (i.e., the slope) of Bouma factor on object recognition accuracy on the basis of individual subjects’ slopes. By-item random effects were included as random intercepts only, as not all participants viewed the same versions (i.e., crowded vs. uncrowded) of each image, and thus, these models accounted for the impact of the image when calculating the fixed effect of Bouma factor. Model fitness was evaluated through likelihood ratio tests between the models with the two highest log-likelihood ratio values (Bates et al., 2015).

The likelihood ratio tests demonstrated that the model which included Bouma factor as a fixed effect, and further included by-subject and by-item random intercepts, provided a significantly better fit than the null fixed-effect model with the same random-effect structure (−327.47 vs. −338.83, respectively; χ² = 22.73, df = 1, p < .001).⁶ Although the model with by-subject random slopes and by-item random intercepts showed a slightly greater likelihood ratio value (−326.89), it was less parsimonious as it also had more free parameters and did not demonstrate significantly better fit (χ² = 1.16, df = 2, p =.56). Therefore, the by-subject and by-item random-intercept model was selected. This model structure is given by the generalized linear equation (Equation 1):

l n \frac{p}{1 - p} = α + β X + Z_{i} + Z_{j}

(1)

In this equation, the log-odds for target recognition accuracy (i.e., ln $\frac{p}{1 - p}$ ) vary as a function of the intercept (α) and the fixed effect of Bouma factor, where X represents the centered.⁷ Bouma factor value for a given image and β represents the slope for Bouma factor. Additional model flexibility is provided by the random intercepts of Subject (Z_i) and Base image (Z_j). This model was selected to evaluate the fixed effect of Bouma factor on object recognition accuracy. Mean accuracy (i.e., the intercept) was low, approximately 17.9%, but well above the chance-level (6.7%) performance (α = −1.52, z = −4.73, p = .001). As shown in Figure 4, the slope for Bouma factor was significantly positive (β = 2.46, z = 4.76, p < .001), meaning that accuracy was lowest when the space between the target and flanker objects was small, but improved with increasing space. Thus, the crowding effect was found for objects in real-world scenes. However, the psychometric function presented in Figure 4 is clearly only a small portion of the overall function, with accuracy continuing to steepen at the upper limit of Bouma factors used in this study.

Figure 4. — Fitted mean accuracy for object identification as a function of the fixed effect of Bouma factor, based on the optimally fit mixed-logit model from Experiment 1. The log-odds for object recognition have been transformed with an inverse-logit function to provide proportion correct for the logistic function. The range of Bouma factor extends from the minimum to the maximum values used in this study (as illustrated in Figure 2). Error bars represent a 95% confidence interval above and below the mean (using model parameter estimates obtained from lme4). The dotted line indicates the chance level of accuracy (6.7%).

In addition, the optimal model’s random effects included only a random intercept, indicating that there was substantial individual variability in terms of mean accuracy (i.e., the intercept) but not as a function of the effect of Bouma factor. This suggests that the effect of Bouma factor on recognition accuracy was quite uniform across participants (see Figure 5A). Furthermore, there was no systematic relationship between the fixed effects of intercept (i.e., individual subjects’ mean accuracy) and slope (i.e., individual subjects’ change in accuracy across Bouma factor), r(43) = −.10, p = .51.

Figure 5. — Fitted log-odds accuracy as a function of Bouma factor for individual participants in Experiment 1. A: Proportion correct is indicated on the right y-axis. Each line represents an individual participant’s fitted accuracy for the by-subject random slopes model (third row, far-right column of Table 1), and the panels are divided among participants with low (left), medium (center), or high (right) levels of mean accuracy for object recognition. Group divisions were based on random-effect residual scores in Figure 5B. The horizontal dotted line indicates chance-level performance. This is not the model used to generate Figure 4, because it has additional flexibility for individual slopes, whereas the optimally fit model for Experiment 1 only used random intercepts. Notice that only the subjects’ intercepts for object recognition accuracy vary between the three accuracy groups, even though slope is also left free to vary. B: Mean accuracy (intercept) residual scores for individual subjects. Individual subjects are ordered by mean accuracy rank, and the vertical lines indicate divisions used to generate the binned accuracy groups in Figure 5A.

Discussion

This experiment provides a rigorous test of object crowding in real-world scenes. This was shown experimentally by comparing conditions predicted to produce crowding, with the Bouma factors ranging from 0.2 to over 0.8. It also used numerous (15) different realistic objects in several (6) different realistic scenes. The objects were also placed in scene-consistent locations within the scenes, which should enhance object identification (Boyce & Pollatsek, 1992; Davenport & Potter, 2004; Wijntjes & Rosenholtz, 2018, but see Hollingworth & Henderson, 1998). Likewise, the flankers at their respective spacings were also put in scene-consistent locations for the same reasons (see Figure 1). However, Experiment 1 had an important limitation. Specifically, all target presentations were presented at roughly equal distances from the center of the image. Thus, if participants noticed this, they could have adopted a strategy of quickly moving their eyes from the center of the screen in hope of fixating the target when it appeared, in which case they would not need to use their peripheral vision to accomplish the task. Wijntjes and Rosenholtz (2018) shared a similar concern in their study. To deal with that concern, in their study, the experimenter watched a live video image of participants in order to visually detect eye movements during each trial. We adopted a different approach to dealing with this concern, by carrying out a second experiment using an eyetracker to more carefully control for eye movements by participants.

Experiment 2: Gaze-Contingent Presentation

To rule out the possibility of participants adopting a strategy of foveating the target object, which would invalidate the results of Experiment 1, we conducted a gaze-contingent version of the experiment using high spatial and temporal resolution eye tracking.