What drives the automatic retrieval of real-world object size knowledge

Simen Hagen; Yuanfang Zhao; Lydia Moonen; Neele Ulken; Marius V Peelen

doi:10.1037/xhp0001189

. Author manuscript; available in PMC: 2024 Oct 1.

Published in final edited form as: J Exp Psychol Hum Percept Perform. 2024 Feb 1;50(4):358–369. doi: 10.1037/xhp0001189

What drives the automatic retrieval of real-world object size knowledge

Simen Hagen ¹, Yuanfang Zhao ¹, Lydia Moonen ¹, Neele Ulken ¹, Marius V Peelen ¹

PMCID: PMC7616435 EMSID: EMS198527 PMID: 38300565

Abstract

Real-world object size is a behaviorally relevant object property that is automatically retrieved when viewing object images: participants are faster to indicate the bigger of two object images when this object is also bigger in the real world. What drives this size Stroop effect? One possibility is that it reflects the automatic retrieval of real-world size after objects are recognized at the basic level (e.g., recognizing an object as a plane activates large real-world size). An alternative possibility is that the size Stroop effect is driven by automatic associations between low-/mid-level visual features (e.g., rectilinearity) and real-world size, bypassing object recognition. Here, we tested both accounts. In Experiment 1, objects were displayed upright and inverted, slowing down recognition while equating visual features. Inversion strongly reduced the Stroop effect, indicating that object recognition contributed to the Stroop effect. Independently of inversion, however, trial-wise differences in rectilinearity also contributed to the Stroop effect. In Experiment 2, the Stroop effect was compared between manmade objects (for which rectilinearity was associated with size) and animals (no association between rectilinearity and size). The Stroop effect was larger for animals than for manmade objects, indicating that rectilinear feature differences were not necessary for the Stroop effect. Finally, in Experiment 3, unrecognizable “texform” objects that maintained size-related visual feature differences were displayed upright and inverted. Results revealed a small Stroop effect for both upright and inverted conditions. Altogether, these results indicate that the size Stroop effect partly follows object recognition with an additional contribution from visual feature associations.

Public significance statement

When viewing an object, we quickly know its real-world size. Here, we show that this automatic retrieval of real-world object size knowledge partly depends on our ability to recognize the object, with stronger size retrieval effects for recognizable compared to non-recognizable objects. However, we also find evidence for an alternative route to extracting size knowledge that does not depend on recognition, based on visual feature-size associations.

Keywords: object size, familiar-size Stroop effect, conceptual knowledge, object categorization

Real-world object size, or familiar size, is a salient and behaviorally relevant object property. For example, small manmade objects (e.g., ball, paperclip) are often portable and manipulable, while large manmade objects (e.g., sofa, cabinet) are often stable and can serve as reliable landmarks (Bainbridge & Oliva, 2015). The relevance of familiar object size is reflected in visual cortex organization, with a familiar size dissociation in ventral temporal cortex (VTC; Konkle & Oliva, 2012a) that emerges quickly after stimulus onset (Khaligh-Razavi et al., 2018).

Interestingly, recent behavioral studies used a size Stroop paradigm to show that familiar object size may be automatically retrieved when viewing visual object images (Konkle & Oliva, 2012b; Long & Konkle, 2017; Long et al., 2019). This work built on the seminal work of Stroop (1935), which showed that an incongruent color name interfered with naming the word’s ink color (e.g., “red” in green font). These results have been interpreted as evidence that word reading occurs in parallel to color naming; because of our extensive experience with reading, processing in this pathway occurs quickly and without the need to allocate attention, thereby interfering with the ongoing color recognition task when the outputs of the two processes are in conflict (e.g., Cohen et al., 1990). Modified versions of the Stroop task have subsequently been used to assess the presumed automaticity of retrieving various forms of knowledge, including object color (Naor-Raz et al., 2003), numerical size (Henik & Tzelgov, 1982; Kaufmann et al., 2005), and familiar object size (Konkle & Oliva, 2012b). For familiar object size, Konkle and Oliva (2012b) presented large and small manmade objects to participants and asked them to select the object that was larger (or smaller) on the screen, ignoring familiar size (i.e., familiar size knowledge was in competition with the screen size task). Crucially, in congruent trials, the objects’ relative screen size was consistent with their relative familiar size, while in incongruent trials, the objects’ relative screen size was inconsistent with their relative familiar size. The main finding was a size Stroop effect, whereby participants were faster at selecting the large-on-screen (or small-on-screen) object in congruent than incongruent trials. From this, it was concluded that the familiar size of objects is automatically retrieved when viewing object images (Konkle & Oliva, 2012b).

These findings raise the question of how the familiar size of objects is retrieved automatically, such that it interferes with the screen size task. One possibility is that familiar size knowledge is automatically retrieved after recognizing the object at the “basic level” – for example, recognizing the object as a house or an apple triggers knowledge about the object’s size (e.g., Rosch et al. 1976; Konkle & Oliva, 2012b). A similar parallel recognition process may account for the word color Stroop effect. For example, simply turning the color words upside down – thereby slowing down automatic word recognition – reduces the classical word color Stroop effect (Liu, 1973). Recognition is also likely to explain the numerical size Stroop effect - the finding that participants are slower to indicate which of two digits is larger on the screen when the digits’ relative physical size is incongruent with the digits’ relative numerical size (e.g., 3 vs 5; Henik & Tzelgov, 1982). For this effect to emerge, participants must have recognized the digits’ numerical size in parallel to performing the screen size task. Physical and semantic information may thus be processed in parallel in multiple domains: color words, numbers, and objects. Conflicting output from these parallel pathways (e.g., about relative size) then leads to slower responses on the primary task.

Interestingly, a recent study proposed an alternative, or additional, route to real-world object size knowledge that “bypasses” high-level visual processing and basic-level recognition (Long & Konkle, 2017). Specifically, low-level “non-meaningful” perceptual features may directly activate familiar size information due to their natural covariance with real-world size. For example, large and small manmade objects differ systematically in terms of simple shape features, with for example large manmade objects typically having more rectilinear features than small objects (Long et al., 2016; Nasr et al., 2014). Long & Konkle (2017) rendered large and small manmade objects unrecognizable while maintaining some texture and form features (including differences in rectilinearity between large and small objects) to create “texforms” of large and small objects. Using the size Stroop paradigm, they found that participants were faster at selecting the texform that was larger (or smaller) on the screen when it was also the larger (or smaller) object in the real world, despite the fact that the texforms could not be recognized (e.g., people could not name the texforms in terms of meaningful categories). Thus, the object size Stroop effect could primarily be a feature-driven effect, with the automatic processing of real-world size relying on the low- or mid-level visual feature differences between large and small objects.

In the current study, we tested whether and to what extent visual feature differences and object recognition contribute to the size Stroop effect. In Experiment 1, we used object inversion to impair higher-level recognition processes by presenting the objects in a less familiar orientation while preserving low- and mid-level features (i.e., a rectilinear edge is equally rectilinear upright and inverted). In Experiment 2, we tested the role of rectilinearity in the size Stroop effect by comparing the effect for manmade objects and animals: large and small objects, but not large and small animals, differ in rectilinearity. Finally, in Experiment 3, we tested if recognition could play a role in the size Stroop effect for putatively unrecognizable texform objects, by presenting the texforms upright and inverted.

Experiment 1

Experiment 1 examined the role of object recognition in the size Stroop effect. We used image inversion to impair recognition while preserving visual feature differences between objects (e.g., Yin, 1969). Previous research has shown that inversion slows down basic-level object recognition (Mack et al., 2008). Therefore, when objects are inverted, participants are more likely to complete the orthogonal screen-size task before recognizing the objects, and thus also before recognizing the objects’ real-world sizes. Crucially, the low- and mid-level features are not changed by image inversion (i.e., an inverted house is equally rectilinear as an upright house, and so this manipulation would not impair low- and mid-level visual processes). Inversion has previously been used to reduce the classic word color Stroop effect, reducing semantic interference by impairing word reading (Liu, 1973). If the size Stroop effect depends on object recognition, then it should be reduced in the inverted condition. In contrast, if the size Stroop effect is fully driven by visual feature differences between large and small objects, then the Stroop effect should be equal for upright and inverted trials.

Methods

Transparency and openness

For this and subsequent experiments, we report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study, and we follow JARS (Kazak, 2018). Data and research materials are available at the Open Science Framework (https://osf.io/ab3dy/). The experiment was programmed in PsychoPy (Peirce et al. 2019) and hosted online at Pavlovia (https://pavlovia.org/). Data were analyzed using Python (Python Software Foundation, Python Language Reference, version 3.9.13. Available at http://www.python.org). R Statistical Software (v4.2.3; R Core Team 2021) was used for power analyses, using the pwr R package (v1.3.0; Champely et al., 2017), and for Bayes Factor analyses, using the BayesFactor R package (v0.9.12-4.4; Morey & Rouder, 2022). The study design and its analysis were not pre-registered.

Participants

34 participants (M = 29.88 years; SD = 6.08 years; females = 18, males = 13; other = 1, undeclared = 2) took part in the study. This sample size was selected based on an a-priori power analysis ensuring 80% to detect an effect size of d = .5 (corresponding with np2 = .21) with p = .05 for the critical main effect of congruency. Previous studies found larger effect sizes for the main effect of congruency (e.g., Konkle & Oliva, 2012b; Long & Konkle, 2017). Participants were recruited online through Prolific (www.prolific.com) in return for monetary compensation. One participant was replaced due to low overall accuracy (72.27% > 3 SDs below the group mean). Informed consent was obtained electronically prior to the experiment. The study was conducted in compliance with the ethical principles and guidelines of the Research Ethics Committee of the Faculty of Science of the Radboud University and APA ethical standards. All data were collected in the year of 2022.

Constraints of generality

Participants were recruited from the US, UK and The Netherlands, with the requirement that they were fluent in English and were between the ages of 18 and 37 years old. While this is a more heterogeneous sample than in typical lab-based psychophysics experiments, the results may still be specific to this population and age group.

Stimuli

The stimuli comprised 60 unique images of 30 small (e.g., hole puncher) and 30 large (e.g., bed) manmade objects (available at https://osf.io/ab3dy/). The stimuli were found online and in publicly available stimulus sets (Konkle & Oliva, 2012a), downloaded from https://konklab.fas.harvard.edu/ImageSets/BigSmallObjects.zip. They were cropped from their background and pasted on a white uniform background of 500 x 500 pixels and displayed at 12 and 6 degrees of visual angle, as measured along the longest dimension of the object.

We used a computational model to quantify the rectilinearity in each manmade object image (for model description, see Li & Bonner, 2020). This analysis showed that large objects were more rectilinear than small objects, t(58) = 2.86, p = .006, np2 = .12, BF = 7.22. Moreover, we quantified the area-to-perimeter ratio for large and small manmade objects (Gayet et al., 2019). Area-to-perimeter ratio is a relevant dimension to object perception and visual cortex organization (Bao et al., 2020; Gayet et al., 2019; Coggan & Tong, 2023; but see Yargholi & Op de Beeck, 2023). Large and small objects did not differ on area-to-perimeter ratio, t(58) = 1.33, p = .190, np2 = .03, BF = .55.

Design

Participants completed a size Stroop task (Konkle & Oliva, 2012b), where each trial presented one large object and one small object (Figure 1). Thirty unique trials were created by pairing a large and small object based on similarity in height/width ratio, with each object image consistently paired with one other object image. In a congruent trial, the large object was presented at a larger size (12 deg. visual angle) than the small object (6 deg. visual angle). In an incongruent trial the large object was presented at a smaller size (6 deg. visual angle) than the small object (12 deg. visual angle). All trial pairs were presented in a congruent trial and an incongruent trial. Upon image presentation, participants were asked to respond by pressing either the ‘f’ or ‘j’ key on the keyboard to indicate the left or right image on the screen, respectively. Images were displayed until a response was given. Participants were instructed to respond as quickly and accurately as possible. All participants completed a block where they selected the large image on the screen and another block where they selected the small image on the screen, and the order of this was counterbalanced across participants. The position of the large and small image was randomly determined in each trial. In total, there were 480 trials (30 object pairs x 2 congruency x 2 orientation x 2 task instruction blocks x 2 repetitions per image in each instruction block). Self-paced breaks were interspersed at regular intervals. Prior to each instruction block (e.g., attend smaller), the participants completed a practice session consisting of eight trials, where feedback (“correct”,” incorrect”) was provided. No feedback was given during the main part of the experiment.

Representative image examples of the objects used in the study (actual images not shown for copyright reasons).

Analysis

Trials with RT less than 200 ms or greater than 1500 ms were excluded from the analysis (.79% of trials), following previous reports (Konkle & Oliva, 2012b; Long & Konkle, 2017). Given the main question of examining if the congruency effect was influenced by inversion, the data for each instruction block (attend larger; attend smaller) was averaged prior to statistical analysis. The data were first analyzed using a 2x2 repeated-measures analysis of variance (ANOVA) with congruency (congruent, incongruent) and orientation (upright, inverted) as within-subjects factors.

Next, to further characterize the size Stroop effect (incongruent – congruent), it was analyzed in an exploratory analysis as a function of how quickly participants responded. This analysis allows for testing whether the conflicting familiar size knowledge influenced both fast and slow responses during the orthogonal screen-size task, inspired by previous work testing conflict effects on both fast and slow RTs (e.g., Ridderinkhof, 2002) and work examining how visual object knowledge (e.g., object color and shapes) contributes to expert object recognition (Hagen et al., 2014; 2016; 2023). The faster the conflicting familiar size knowledge is processed in parallel, the more likely it is to influence fast reaction times. To examine the distribution of correct RTs, the trials of each participant were ranked from the fastest to the slowest, separately for each condition (i.e., by congruency [congruent, incongruent] x orientation [upright, inverted] x task [attend big, small]) before they were grouped into four bins containing the fastest 25% of the responses (quartile bin 1), the next fastest 25% of responses (quartile bin 2), the third fastest 25% of responses (quartile bin 3), and finally the slowest 25% of responses (quartile bin 4). Within each bin, the average correct RT for each condition was calculated (with each participant contributing one data point). Finally, a congruency effect (incongruent – congruent) was calculated for each bin. The congruency effect data were analyzed using a 2x4 repeated-measures ANOVA with orientation (upright, inverted) and bin (1, 2, 3, 4) as within-subjects factors.

In two additional exploratory analyses, we examined the relationship between the size Stroop effect and the rectilinearity difference of the two simultaneously presented objects. First, we ran a 2x2 repeated-measures ANOVA on the size Stroop effect for the RT data (incongruent – congruent) with inversion (upright, inverted) and rectilinearity difference (high, low) as within-subjects factors. The rectilinearity factor was created by ranking the trials based on the rectilinearity difference between the large and small objects on that trial, with image-level rectilinearity computed using a computational model (see Methods), and then performing a median split. In this ranking, trials where large objects were more rectilinear than small objects fall on one end of the spectrum, while trials where small objects were more rectilinear than large objects fall on the other end of the spectrum.

Second, we examined the correlation between the rectilinearity difference and the size Stroop effect. For each participant, we (1) averaged the RTs for each unique trial type (i.e., unique object pairs) separately for congruency (congruent, incongruent) and orientation (upright, inverted), (2) computed a size Stroop effect for each unique trial type (incongruent - congruent), (3) computed a rectilinearity index for each trial type by taking the difference between the rectilinearity measures of the large and small object of that trial type, in the direction that positive values indicate that large objects are more rectilinear than small objects, (4) correlated the size Stroop effect with the rectilinearity index, (5) Fisher transformed the correlations to create a more normal distribution, (6) across participants, tested the average correlation against 0 using a 1-sample t-test.

We report Bayes factors for all analyses. For isolated main and 2x2 interaction effects, we report Bayes factors using non-directional one-sample t-tests, with the standard settings of the BayesFactor R package. For 2x4 interactions in the response time distribution analysis, we report the Bayes factor for comparison of the full model (interaction and the two main effects) with the model containing only the two main effects. For all analyses, we report the Bayes factor in favor of the alternative hypothesis; for example, for a BF = 3, the alternative hypothesis predicts the data 3 times better than the null hypothesis.

Results

Accuracy

The overall accuracy was close to ceiling (M = 97.97%; SD = 2.52%). There was a size Stroop effect, as indicated by a main effect of congruency, with significantly higher accuracy in the congruent (M = 98.2%; SD = 2.6%) than incongruent (M = 97.7%; SD = 2.8%) condition, F(1, 33) = 6.71, p = .014, np2 = .17, BF = 3.21. This size Stroop effect did not differ for upright and inverted objects, as indicated by a non-significant interaction between congruency and orientation, F(1, 33) = 2.67, p = .112, np2 = .08, BF = .61. Accuracy for the upright and inverted objects did not differ, as indicated by a non-significant main effect of orientation, F(1, 33) = 3.6, p = .066, np2 = .1, BF = .91.

Response time

Figure 2A presents the mean RTs for correct trials as a function of congruency and orientation (see Figure 2B for size Stroop effect: incongruent - congruent). There was a size Stroop effect, as indicated by a main effect of congruency, with significantly faster RTs for the congruent (M = 514 ms; SD = 97 ms) than incongruent condition (M = 531 ms; SD = 102 ms), F(1, 33) = 17.39, p < .001, np2 = .35, BF = 130.83. Crucially, the size Stroop effect was larger for upright (M = 25 ms; SD = 29 ms) than inverted (M = 8 ms; SD = 23 ms) images, as indicated by a significant interaction between congruency and orientation, F(1, 33) = 16.92, p < .001, np2 = .34, BF = 112.96. For congruent trials, participants were faster in the upright (M = 508 ms; SD = 93 ms) than the inverted condition (M = 520 ms; SD = 102 ms; t(33) = 3.66, p < .001, np2 = .29, BF = 35.70), while for incongruent trials, participants did not differ in the upright (M = 534 ms; SD = 104 ms) and inverted conditions (M = 528 ms; SD = 100 ms; t(33) = 1.45, p = .158, np2 = .06, BF = .47). However, inversion did not entirely abolish the size Stroop effect (M = 8 ms; SD = 23 ms; t(33) = 2.11, p = .043, np2 = .12, BF = 1.2). Finally, the RTs for upright and inverted objects did not differ, as indicated by a non-significant main effect of orientation, F(1, 33) = 1.24, p = .274, np2 = .04, BF = .32.

**(A)** Reaction times for correct responses as a function of congruency (congruent, incongruent) and orientation (upright, inverted). **(B)** Stroop effect in response times for correct responses, computed as incongruent trials minus congruent trials. The small colored markers represent individual participants, while the black markers represent the group averages. **(C)** Stroop effect in response time for correct responses as a function of rectilinearity split (high, low) and orientation (upright, inverted). All error bars represent 95% CIs. *, **, *** represent p < .05, .01, .001, respectively.

Influence of rectilinearity

The large and small objects included in the study differed in terms of rectilinearity, with large objects being on average more rectilinear than small objects (see Methods). In an additional exploratory analysis, we tested to what extent rectilinearity accounted for the size Stroop effect and whether this influence differed for upright and inverted objects. To this end, we ran a 2x2 repeated-measures ANOVA on the size Stroop effect for the RT data (incongruent – congruent) with inversion (upright, inverted) and rectilinearity (high, low) as within-subjects factors. We only report the statistics related to the rectilinearity factor. For the high-rectilinearity trials, the large objects were on average more rectilinear than the small objects (t(28) = 4.70, p < .001, np2 = .44, BF = 307.58), while there was no difference for the low-rectilinearity trials (t(28) = .93, p = .362, np2 = .03, BF = .48). The ANOVA showed that rectilinearity contributed to the size Stroop effect, as demonstrated by a main effect of rectilinearity, F(1, 33) = 19.53, p < .001, np2 = .37, BF = 18591.44. This effect was independent of inversion, as demonstrated by the lack of a 2-way interaction between inversion and rectilinearity, F(1, 33) = 2.53, p = .121, np2 = .07, BF = .54. As can be observed in Figure 2C, there was no size Stroop effect for inverted trials when rectilinearity was equated.

Correlation between Stroop effect and rectilinearity

In addition to the ANOVA involving rectilinearity, we analyzed whether the size Stroop effect correlated with rectilinearity differences between the large and small objects across trial types (i.e., object pairs). In line with the ANOVA results, for both upright and inverted objects, there was a small but reliable positive correlation (upright: r = .12, t(33) = 4.34, p < .001, np2 = .36, BF = 201.14; inverted: r = .19, t(33) = 5.23, p < .001, np2 = .45, BF = 2193.50), whereby increasing Stroop effects were associated with increasing rectilinearity differences in the direction that large objects are more rectilinear than small objects.

Response time distribution analysis

In a final exploratory analysis, we examined if the size Stroop effect depended on participants reaction times, with faster processing of the competing real-world size manifesting in Stroop effects in faster reaction times. Figure 3A shows the mean correct RTs as a function of congruency, orientation, and response time bin (see Figure 3B for size Stroop effect: incongruent – congruent). The significant interaction between orientation and bin, F(3, 99) = 12.32, p < .001, np2 = .27, BF = 18.92, indicated that orientation influenced the size Stroop effect differently across the RT distribution. Thus, we compared the size Stroop effect for upright and inverted conditions in each bin separately. The size Stroop effect for upright objects was larger than that for the inverted objects in all bins (t(33) > 2.59, p < .015, np2 > .17, BF > 3.18 for all tests) except the first bin (t(33) = -1.53, p = .136, np2 = .07, BF = .53). Finally, there was a congruency effect for upright objects in all bins (t(33) > 2.6, p < .014, np2 > .17, BF > 3.27, for all tests; Figure 3B). By contrast, only bins 2 and 3 showed a congruency effect for inverted displays (all sign. tests, t(33) > 2.79, p < .009, np2 > .19 BF > 4.90; all non-sign. tests for bins 1 and 4, t(33) < -1.02, p > .318, np2 < .03, BF < .30; Figure 3B).

**(A)** Response time distribution for correct responses as a function of congruency, orientation, and response time bin. The y-axis scale is shared across panels. **(B)** Response time distribution for the size Stroop effect for correct responses as a function of orientation and response time bin. Bin 1 contains the 25% fastest responses of each participant. Bin 2 contains the next 25% fastest responses, and so on. All error bars represent 95% CIs. *, **, *** represent p < .05, .01, .001, respectively.

Summary

Overall, results of Experiment 1 show that the size Stroop effect was present for upright objects, replicating previous reports (Konkle & Oliva, 2012b). Importantly, the size Stroop effect was strongly reduced by object inversion. This reduction was most prominent on slower trials. Finally, differences in rectilinearity between large and small objects contributed to the size Stroop effect but did so equally for upright and inverted conditions, indicating a separate influence.

Experiment 2

Experiment 1 showed that the size Stroop effect was almost abolished when recognition was made more difficult by object inversion and was fully abolished when objects did not differ in terms of rectilinearity (Figure 2C). Experiment 2 was designed to further test the role of differences in visual properties between large (e.g., rectilinear) and small (e.g., curvilinear) objects. In this experiment, we contrasted the size Stroop effect for large and small manmade objects, where there are systematic differences in rectilinearity (Long et al., 2016), to the Stroop effect for large and small animate objects, where there are no systematic differences in rectilinearity (see Methods). If differences in rectilinearity are necessary for the size Stroop effect, we would expect no size Stroop effect for animate objects. In contrast, if rectilinearity differences are not necessary for the size Stroop effect, we would expect a similar size Stroop effect for animate objects. Moreover, since animate objects may be recognized faster and more automatically than manmade objects (New et al. 2007), the size Stroop effect for animates could even be stronger than that for manmade objects. Finally, animate objects offer an interesting test since they do not (or only weakly, relative to manmade objects) show a real-world size organization in the human ventral temporal cortex as measured with functional magnetic resonance imaging (fMRI; Konkle & Caramazza, 2013; Luo et al., 2023).