Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2025 Jun 12;122(24):e2414005122. doi: 10.1073/pnas.2414005122

Representation of locomotive action affordances in human behavior, brains, and deep neural networks

Clemens G Bartnik a, Christina Sartzetaki a, Abel Puigseslloses Sanchez a, Elijah Molenkamp a, Steven Bommer a, Nikolina Vukšić a, Iris I A Groen a,b,1
PMCID: PMC12184334  PMID: 40504155

Significance

To navigate the world around us, we can use different actions, such as walking, swimming, or climbing. How does our brain compute and represent such locomotive action affordances? Here, we show that activation patterns in visual regions in the human brain represent information about affordances independent of other visual elements such as surface materials and objects, and do so in an automatic manner. We also demonstrate that commonly used models of visual processing in human brains, namely object- and scene-classification trained deep neural networks, do not strongly represent this information. Our results suggest that locomotive action affordance perception in scenes relies on specialized neural representations different from those used for other visual understanding tasks.

Keywords: scene-selective visual cortex, affordances, locomotion, deep neural networks

Abstract

To decide how to move around the world, we must determine which locomotive actions (e.g., walking, swimming, or climbing) are afforded by the immediate visual environment. The neural basis of our ability to recognize locomotive affordances is unknown. Here, we compare human behavioral annotations, functional MRI (fMRI) measurements, and deep neural network (DNN) activations to both indoor and outdoor real-world images to demonstrate that the human visual cortex represents locomotive action affordances in complex visual scenes. Hierarchical clustering of behavioral annotations of six possible locomotive actions show that humans group environments into distinct affordance clusters using at least three separate dimensions. Representational similarity analysis of multivoxel fMRI responses in the scene-selective visual cortex shows that perceived locomotive affordances are represented independently from other scene properties such as objects, surface materials, scene category, or global properties and independent of the task performed in the scanner. Visual feature activations from DNNs trained on object or scene classification as well as a range of other visual understanding tasks correlate comparatively lower with behavioral and neural representations of locomotive affordances than with object representations. Training DNNs directly on affordance labels or using affordance-centered language embeddings increases alignment with human behavior, but none of the tested models fully captures locomotive action affordance perception. These results uncover a type of representation in the human brain that reflects locomotive action affordances.


Humans navigate their local environments with remarkable ease, whether cycling through busy city streets or hiking along rugged trails. This highlights the ability of the human brain to not only effortlessly process visual information, but also identify appropriate locomotive actions from a wide range of potential actions. The recognition of ecologically relevant behaviors in the immediate environment is commonly referred to as affordance perception (1). Theoretical frameworks of affordance perception are long-standing in fields such as ecological psychology (13), but the neural mechanisms mediating this ability during scene perception are only just beginning to be explored.

While initial research on scene perception emphasized scene-defining objects and their relationships (e.g., refs. 4 and 5), later work included global features such as spatial layout (6), as well as affordance-related concepts such as navigability (7) (see ref. 8, for a review). fMRI has identified three brain regions involved in processing complex visual scenes (9)—the Parahippocampal Place Area (PPA), Occipital Place Area (OPA), and Medial Place Area (MPA, also known as retrosplenial cortex or RSC; 10). Of these, PPA has been consistently associated with the representation of global spatial layout of scenes (1114), while a pioneering study by ref. 15 showed that OPA represents navigable space in particular, in the form of pathways that afford walking. Subsequent analyses showed that these pathway representations in OPA aligned with representations of extended surfaces and floor elements in convolutional neural networks (CNNs) trained on scene recognition or scene segmentation (16, 17). These and other findings (1823) gave rise to the hypothesis that OPA is involved in the computation of navigational scene affordances.

Further empirical evidence of an important role of affordances in scene perception was found by ref. 24, who showed that diverse actions, such as playing sports and transportation, strongly affect human scene categorization of a wide range of environments in a large-scale scene dataset. Ref. 25 replicated these behavioral findings, but found no evidence for representation of broadly defined affordances in the scene-selective cortex, nor in object- or scene-classification trained CNNs. These results thus argue against a role of scene-selective brain regions, including OPA, in representing affordances. The discrepancy in findings with ref. 15 may be due to differences in the types of stimuli used (diverse environments vs. indoor scenes) and to how affordances were operationalized (potential action labels vs. navigable pathways). One intriguing hypothesis that could potentially unify these findings is that scene-selective regions represent only a distinct subset of action affordances, namely those that pertain specifically to navigation. Here, we test whether scene-selective regions differentiate visual environments based on whether they afford locomotive actions other than walking, such as cycling, driving, or swimming.

Given OPA’s hypothesized role in navigational affordance perception, we might expect such locomotive action affordances to be primarily encoded in OPA; however, other work linking spatial scene and object properties with human interaction possibilities (26, 27) also implicates other visual brain regions. A priori, it seems likely that disambiguating different locomotive action affordances requires perceiving not only spatial properties of scenes (e.g., the presence of navigable surfaces such floors or roads) but also surface materials (asphalt for driving or cycling) and contained objects (rocks for climbing, a body of water for swimming), i.e., a variety of diagnostic features (28). Such properties may be encoded not only in OPA but also PPA, which represents not only the spatial layout of scenes, but also textures (29, 30) and objects (3133). However, it is unclear which, if any, combination of these properties is sufficient to predict the perceived affordances of a scene, and their corresponding representation in the human brain, or whether additional (possibly nonvisual) information is involved. Finally, neural representation of action-related information has been shown to be sensitive to task context (e.g., ref. 34), but prior studies did not probe participants to actively report perceived scene affordances during fMRI measurements, which may impact neural representations of diagnostic features.

Here, we investigated whether and how humans represent locomotive action affordances by testing how well various scene properties predict action affordance representation as assessed with both behavioral and fMRI measurements. Using a novel set of natural scenes spanning six common locomotive affordances, we test the hypothesis that the human scene-selective cortex represents action possibilities that pertain to locomotion. We furthermore assess whether locomotive affordance representations are task-dependent by comparing neural activation patterns during tasks that require participants to explicitly report locomotive affordances, vs. tasks that do not. To probe the neural computations underlying affordance representations, we examine how well these representations can be predicted by features extracted from a wide variety of DNNs, including image, video, and multimodal vision-language models, and we explore how well linguistic descriptions from large language models (LLMs) can capture these representations. Collectively, our results provide evidence for locomotive action affordance representation in the scene-selective cortex that is not fully captured by other scene properties, DNNs, or LLMs.

Results

To investigate whether the human brain represents locomotive action possibilities in real-world scenes, we collected and curated a novel set of scene images, spanning indoor, outdoor man-made, and outdoor natural environments (Fig. 1A). Human participants then annotated these images on six different actions that people can use to move in the immediate environment (Fig. 1 B, Left). The images were also annotated on four other types of visual scene properties thought to be important for scene perception (materials, scene category, objects, and global properties; Fig. 1 B, Right). We then applied representational similarity analysis (RSA, 35) by computing, for each pair of images, the dissimilarity in the annotations of locomotive action affordances, as well as annotations of the four different visual scene properties, resulting in five representational dissimilarity matrices (RDMs), that we then compared with one another. In RSA, high dissimilarity between a pair of images in the RDM indicates these images differ strongly on the annotated property (e.g., one affording swimming while the other affords walking), while low dissimilarity indicates the images are perceived as having similar properties (e.g., both afford swimming). Furthermore, a high correlation between RDMs of two different scene properties indicates that these properties overlap, while a low correlation indicates independent representations. We also used RSA on a subset of these images and tasks in an fMRI experiment, to test for locomotive action affordance representations in neural responses in the scene-selective cortex, and on visual feature activations extracted from various DNNs (Fig. 1C). Here, we interpret a significant RSA correlation between fMRI response patterns with the human behavioral annotations for a given scene property as evidence for neural representation of that property, while a high RSA correlation between DNN feature activations and human behavior or brain responses indicates that the DNN adequately captures the human representations, i.e., is “representationally aligned” (36).

Fig. 1.

Fig. 1.

Stimuli, behavioral tasks, and experimental design. (A) Examples from our newly curated stimulus set to measure locomotive action affordance representation, consisting of 231 high-resolution photographs (collected from Flickr under CC0 1.0, CC BY 2.0, and CC BY-SA 4.0 licenses), evenly distributed across indoor, outdoor man-made, and outdoor natural environments. (B) Overview of the five labeling tasks used to collect behavioral annotations. In addition to annotating six locomotive action affordances (Left), participants also annotated (Right) perceived materials (labels: Water, Wood, Vegetation, Dirt/Soil, Sand, Stone/Concrete, Snow/Ice, Pavement, Carpet, Metal), scene category (labels: Room, Inside city, Hallway, Forest, Desert, Open country, Mountain, Coast/River), objects (labels: Building/Wall, Tree/Plant, Road/Street, Furniture, Body of water, Rocks/Stones), and global properties (Man-made vs. Natural, Open vs. Closed, Near vs. Far, Navigable vs. Nonnavigable). (C) In addition to behavioral annotations, multivoxel activity patterns were measured in three scene-selective brain ROIs: PPA, OPA, and MPA, under three different task instructions (locomotive action affordance labeling; object labeling; orthogonal task at fixation). We also extracted layer activations to this new stimulus set from a variety of DNNs trained with different task objectives. RSA was then used to compare the resulting representational spaces, by computing RDMs based on pairwise differences in responses between images for each type of measurement, and correlating those to one another.

Locomotive Action Affordances Form a Distinct Representational Space.

To understand how humans represent locomotive action affordances, we sorted the RDM obtained from the affordance annotation task in two ways: by the type of environment (indoor, outdoor-manmade, outdoor-natural; Fig. 2 A, Left) and after applying hierarchical clustering (SI Appendix, SI Methods and Fig. 2 A, Right). The RDM sorted on environment type shows that indoor scenes are most similar to one another in terms of the locomotive actions they afford, while outdoor scenes are more dissimilar. This is likely because most indoor environments primarily afford walking, while outdoor environments typically afford multiple actions (e.g., walking and cycling). Indeed, hierarchical clustering of the affordance RDM yields at least four major clusters, the biggest of which refers to walking, mostly containing indoor scenes (Fig. 2 A, Right). However, this cluster also contains scene images from other environments, and the remaining clusters also span multiple environment types, indicating there is not a one-to-one mapping of environment type to locomotive affordance. Indeed, conceptual RDMs of environment type that explicitly separate the image set into distinct superordinate classes yield only modest correlations with the affordance RDM (indoor/outdoor-natural/outdoor-manmade: ρ = 0.24; indoor/outdoor: ρ = 0.12; man-made/natural: ρ = 0.23).

Fig. 2.

Fig. 2.

Locomotive action affordances form a distinct representational space. (A) RDM derived from locomotive action affordance annotations, sorted on environment type (Left) or after hierarchical clustering on affordances (Right). Horizontal legend colors (green/blue/black) indicate different environment types. Vertical legend colors (gray/yellow/red/blue) indicate different locomotive actions, determined based on the predominant affordance label (SI Appendix, SI Methods). (B) MDS of the action affordance RDM shows that clusters extend in three directions. A high-resolution version with image thumbnails is provided in SI Appendix, Fig. S1. (C) PCA of the locomotive affordance annotations reveals a three-dimensional structure with images separating along three PCs that represent a swimming/boating dimension, a biking/driving dimension, and a climbing dimension. (D) Comparisons between the locomotive action affordances RDM and RDMs derived from other scene properties, indicating high correlations between objects, materials, global properties, and scene categories, and comparatively lower correlations with affordances. See SI Appendix, Fig. S2 for correlations between RDMs computed with alternative distance metrics. RDMs for each behavioral annotation task, as well as a comparison with automated labeling methods, are in SI Appendix, Fig. S3. (E) Variance partitioning of locomotive action affordance RDM by the three scene properties yielding the highest combined correlation. The Euler diagram shows the unique and shared variance in locomotive affordances explained by objects, materials, and global properties. In total, 80.1% of the variance in the affordance representational space remains unexplained (residual circle; not drawn to scale).

To better understand the clustering of locomotive affordances, we visualized the dissimilarity in affordance annotations for all scenes in a 2D space by applying multidimensional scaling (MDS) to the affordance RDM (Fig. 2B). This shows that walking, as the most common category, forms a central hub surrounded by distinct clusters for the other actions: one extending toward biking and driving, another toward swimming and boating, and a third toward climbing. This three-dimensional structure is also evident when applying principal component analysis (PCA), a dimensionality reduction technique that computes directions in representational space that best describe the variance in affordance annotations. Visualizing the scene images along the first three principal components (PCs) shows clear separation of the locomotive affordances along these dimensions (Fig. 2C). These results suggest that the locomotive action affordance labels provided by our human participants form a structured representational space spanning at least three distinct dimensions.

We quantified the relation between locomotive action affordances and other scene properties by correlating the affordance RDM with RDMs derived from annotations of objects, materials, scene categories, and global properties obtained in the same behavioral experiment (Fig. 1 B, Right). Affordances show lower correlations with other scene properties (average ρ = 0.34, SD = 0.04) compared to how correlated these are with one another (average ρ = 0.69, SD = 0.07) (Fig. 2D). This pattern was robust across multiple distance metrics for quantifying representational dissimilarity (SI Appendix, Fig. S2), suggesting affordances show minimal overlap with other scene properties considered in isolation. To test whether a combination of these properties better predicts affordance perception, we conducted a variance partitioning analysis, which computes a series of linear regression models to estimate the degree of unique and shared variance between the affordance RDM and combinations of the three best correlating RDMs of other scene properties (SI Appendix, SI Methods). This analysis shows that even when combined, other scene properties only account for a small amount (19.9%) of the total variance in locomotive action affordances annotations, with the only unique contribution coming from objects (unique r2 = 4.7%). Materials and objects share 4.0% of the variance in affordance annotations, global properties and objects share 2.4%, and most of the explained variance (8.5%) is shared by all three scene properties (Fig. 2E). This high proportion of shared variance in objects, materials, and global properties highlights a well-known property of natural scenes, namely that they exhibit inherent covariance of visual properties (37). Our observation that locomotive affordance annotations are not trivially predicted by (a linear combination of) other scene properties suggests they form a distinct representational space.

The limited correspondence between locomotive affordance representations and other scene properties is surprising given our a priori intuition that we use (a combination of) readily available scene features to compute affordances, as well as the clear separation of the affordance annotations into distinct clusters separating, for example, water-based actions (swimming, boating) from other actions (Fig. 2 B and C). One possibility is that our annotations of other scene properties, consisting of a handful of labels in each task, did not exhaustively capture the representation of locomotive action affordances. To test this, we also computed RDMs based on outputs of machine learning classifiers trained on the SUN Attribute Database (38), ADE20K (39), and Places365 (40), which have larger sets of labels of materials, global properties, categories, and objects (e.g., 150 objects in ADE20k; 37 materials in SUN Attribute Database; 365 scene categories in Places365). Despite this more exhaustive sampling, these automated labels yielded similarly low predictions of our locomotive action affordance annotations (average ρ = 0.21, SD = 0.09; SI Appendix, Fig. S3). Moreover, classifier outputs of function labels from the SUN Attribute Database, which include several locomotive actions such as cycling and driving, also correlated comparatively low with the same classifier’s outputs of other scene properties (average ρ = 0.28, SD = 0.14). These results suggest that the low predictability of human locomotive action affordance perception by other scene properties is not driven by our precise choice of annotation labels, but rather reflects a persistent difference in representational spaces.

Together, these behavioral results suggest that our selection of six locomotive action affordances of visual scenes forms a three-dimensional representational space that is readily and robustly accessed by human participants, yet weakly correlates with environment types and other scene properties.

Scene-Selective PPA and OPA Uniquely Represent Locomotive Action Affordances.

Above, we used behavioral annotations to establish the distinct representational space of affordances for six ways people can move in the world. Next, we tested our hypothesis that these locomotive action affordances are represented in the visual system. Brain activity was measured using a 3 T MRI scanner while human participants viewed a strategically sampled subset of 90 images that maintained the three-dimensional structure of the behavioral affordance representational space (SI Appendix, Fig. S4 and SI Methods). To examine a potential influence of task instructions on affordance representation in the brain, all participants performed three distinct tasks on each of the 90 images while in the scanner: affordance annotation, object annotation, and an orthogonal task at fixation (SI Appendix, SI Methods).

We first established that our behavioral tasks transferred well to the scanner environment by analyzing the behavior of the fMRI participants on the affordance and object annotation tasks. Average RDMs derived from in-scanner behavioral responses (Fig. 3A) correlated strongly with those derived from annotations obtained in the original behavioral experiment (affordances: ρ = 0.85; objects: ρ = 0.82), with a similar modest correlation between affordance and object annotations as in the original experiment (ρ = 0.44; SI Appendix, Fig. S5A). In-scanner affordance annotations also exhibited reasonable agreement among subjects (mean ρ = 0.52; SI Appendix, Fig. S5B), showing that action affordance representations are stable across both testing environment and participants, even more so than object annotations (mean across-subject ρ = 0.33). Overall, the in-scanner behavior replicates the original behavioral task for a smaller set of images, ensuring that participants performed the same behavior while we measured activity in their visual cortex.

Fig. 3.

Fig. 3.

Scene-selective regions represent locomotive action affordances. (A) RDMs derived from in-scanner locomotive action affordance (Left) and object (Right) annotations, sorted by hierarchical clustering of the action affordance RDM. (B) Correlation of response pattern dissimilarity in scene-selective brain regions with locomotive action affordance (red) and object (blue) RDMs. Bars represent averages for individual participants (gray dots). Error bars indicate SEM across participants. Shaded areas delineate upper and lower noise ceilings, reflecting similarity between individual participant’s RDMs and the group mean RDM. Asterisks indicate one-sample t tests against zero (P < 0.05); horizontal brackets indicate significant differences (P < 0.05) between the average locomotive affordance and object RSA correlations in paired-sample t tests. Significance was corrected for multiple comparisons by applying Bonferroni correction across the two behavioral spaces, the three scene-selective ROIs, and both one-sample and pairwise tests. (C) Correlation between action affordance and fMRI RDMs, when partialling out the object space, and vice versa. Plot elements as in (B). (D) Task-specific correlations with the locomotive action affordances RDM (dark shades: locomotive action affordance task; medium shades: object task; light shades: fixation task). (E) Task-specific correlations with object annotations RDM. Shades as in (D). SI Appendix, Tables S7–S10 report the full set of statistical results obtained in each of these figure panels.

To investigate what scene properties are represented in scene-selective brain areas, we constructed RDMs based on pairwise correlation distances of multivoxel activity patterns in each individual subject’s PPA, OPA, and MPA (SI Appendix, Fig. S6), averaged across all tasks, and correlated these RDMs with the average RDMs derived from the in-scanner affordance and object behavioral annotations, respectively (Fig. 3A). We found significant evidence for locomotive affordance representations in PPA and OPA [Fig. 3B; both ρ> 0.07, t(19) < 4.81, P < 0.001] but not MPA [ρ = 0.03, t(19) = 2.59, P = 0.018; all tests were Bonferroni-corrected for multiple comparisons across behavioral RDMs and fMRI regions of interest (ROIs)]. Both PPA and OPA, but not MPA also represented object information; furthermore, through the use of paired samples t tests, we established that responses in PPA correlated significantly higher with locomotive affordances than with objects (see SI Appendix, Table S7 for full statistical summary). In contrast, OPA and MPA did not show significant differences between the two behavioral spaces. Similar results were obtained when comparing fMRI responses to individual participant behavior (SI Appendix, Fig. S7A).

The significant and stronger correlation with action affordance representations compared to objects suggests that locomotive action affordances are indeed represented in the scene-selective cortex. However, given the nonzero correlation between the affordance and object annotation-derived RDMs, some of the variance in fMRI response patterns may be equally well described by both spaces. To determine the degree of unique representation of affordances compared to objects, we computed partial correlations (Fig. 3C; see SI Appendix, Table S8 for statistical summary) and found significant unique correlations with the affordance space in PPA and OPA, but not MPA, and no significant partial correlation with the object space in any ROI. Paired t tests furthermore showed that the partial correlation with the affordance space is significantly higher than the object space for PPA, but not for OPA and MPA. These results suggest that PPA and OPA, but not MPA, represent locomotive action affordances of real-world scenes independent of their contained objects.

Comparisons of the fMRI response patterns with all five representational spaces derived from the original behavioral experiment (Fig. 1B) furthermore show that affordance annotations exhibit the highest correlation in all three ROIs, followed by global properties (SI Appendix, Fig. S7B). Interestingly, unlike objects, affordances and global properties both exhibit a unique correlation with PPA (SI Appendix, Fig. S7C). We furthermore find that the locomotive action affordance representation in PPA and OPA is independent from broad superordinate environment type distinctions (indoor/outdoor, man-made/natural) as well as low-level visual representations captured by the GIST model, a feature descriptor used in computer vision to represent the global structural characteristics of an image by summarizing the spectral energy across different scales, orientations, and image locations (6) (SI Appendix, Fig. S8). Overall, these results show that scene-selective PPA and OPA represent different locomotive action affordances in scenes, and that these representations are at least partly independent from other scene properties.

Task-Independent Representation of Locomotive Affordances in PPA and OPA.

Having established unique locomotive action affordance representations in PPA and OPA, we next asked how task instructions affect the strength of these representations in the brain. We generated RDMs based on fMRI responses separately for the three tasks performed in the scanner (affordance annotation, object annotation, orthogonal fixation task) and again correlated each of these with the RDMs derived from in-scanner behavior (Fig. 3A). If locomotive affordances are only represented in the brain when task-relevant, we expect to find significant correlations with the affordance RDM only during the affordance task, but not the object task or the fixation task. Alternatively, if locomotive affordances are computed automatically, significant correlations with the affordance RDM are expected in all three tasks. A third possibility is that locomotive action affordances are represented when participants actively respond to the scene images (i.e., during either the affordance or object task), but not when they respond to the fixation cross (fixation task).

The results (Fig. 3D; see SI Appendix, Table S9 for statistical summary), are most consistent with task-independent computation of locomotive action affordances in PPA and OPA. PPA exhibited significant correlations with the affordance space across both the affordance and the object annotation tasks, while OPA only showed a significant correlation in the object task. However, the average correlations are overall of similar magnitude across tasks, and pairwise comparisons between tasks were nonsignificant, indicating a lack of a clear, robust task effect. Consistent with the initial task-averaged analysis, we find no significant correlation with the affordance space for any tasks in MPA. These results suggest that the representations of locomotive action affordances within PPA and OPA, but not MPA, remain largely consistent regardless of the task at hand. For objects, only PPA showed a significant correlation during the object task (Fig. 3E; see SI Appendix, Table S10 for statistical summary), but correlations were overall weak and pairwise comparisons provided no statistical evidence for task-dependent object representations. Similar results were obtained when comparing partial correlations across tasks (SI Appendix, Fig. S9). Together, these results suggest that representations of locomotive action affordances are automatically extracted by OPA and PPA.

Representation of Locomotive Affordances Extends into Mid-Level Visual Regions.

Thus far, our findings demonstrate that locomotive action affordances are represented within scene-selective ROIs. To explore whether affordance representations extend beyond these predefined regions, we performed a series of whole-brain searchlight analyses. Using spherical ROIs with a 5-mm radius centered around each voxel, we extracted multivoxel patterns spanning the entire brain volume of each participant (SI Appendix, SI Methods). Analogous to the ROI analysis, we correlated RDMs computed from those multivoxel patterns to the behavioral RDMs derived from in-scanner affordance or object annotations (Fig. 3A).

The searchlight results were consistent with the ROI analysis. Significant correlations with locomotive affordances were found in OPA and PPA (Fig. 4A), but not in MPA. These correlations also extended beyond the scene-selective regions into other parts of the visual cortex. For visualization, we overlaid the ROIs from the Julian atlas (41) that were used in our main analyses, as well as an additional ROI of early visual area V1 from the Juelich Brain atlas (42). To account for small scene-selective regions in the Julian atlas (which includes only voxels that systematically overlap across many participants) we also overlaid averaged scene-selective ROIs from a publicly available, high-powered fMRI study (the Natural Scenes Dataset, 43). Overlaying these ROIs with the searchlight correlation maps shows that the representation of locomotive affordances extends mainly into mid-level and early visual cortex, including V1, while not showing any extensions into more anterior regions (e.g., parietal or premotor areas).

Fig. 4.

Fig. 4.

Affordance representations extend into mid-level visual regions. (A) Medial (Left) and lateral (Right) views of the average searchlight RSA correlation (Spearman’s ρ) with locomotive action affordances (one-sample test against zero, thresholded using a whole-brain permutation-based multiple comparison correction at P < 0.05). Unthresholded correlation maps are shown as smaller Insets. Group-level ROIs—PPA, OPA, and MPA—derived from the Julian atlas (41) are highlighted in lighter colors and extended with the V1 ROI from the Juelich atlas (42). Corresponding group-level ROIs from NSD (43) are overlaid in darker shades of the same colors. (B) Average partial searchlight RSA correlation with locomotive action affordances, controlling for GIST representations. (C) Average partial searchlight RSA correlation with locomotive action affordances, controlling for object representations. (D) Average searchlight correlations in predefined ROIs. Error bars indicate SEM across participants. Asterisks indicate one-sample t tests against zero (P < 0.05) corrected for multiple comparisons using Bonferroni correction.

Analogous to the ROI analysis, we conducted partial correlation analyses to investigate to what extent these correlations can be explained by overlap with other scene properties, specifically object and GIST features. Fig. 4 B and C highlight that accounting for this overlap reduces the extent of significant searchlight correlations, especially in the early visual cortex, but does not completely eradicate correlations in mid-level visual regions. Applying the Julian atlas ROIs to the searchlight correlation maps directly (Fig. 4D) shows that the searchlight analysis replicates our ROI findings, demonstrating significant locomotive action affordance representation in the OPA and PPA, but not MPA. Notably, while significant correlations are also found in V1, this region shows overall lower correlations compared to PPA and OPA, with a comparable correlation for affordances as for GIST. Furthermore, controlling for overlapping features by partialling out the GIST or object features substantially reduces the correlations with locomotive affordances in this region.

In sum, searchlight analyses reproduce our hypothesis-driven ROI analysis, providing robust evidence for locomotive affordance representations in scene-selective regions, while also demonstrating extensions into lower and mid-level visual regions.

Pretrained DNNs Show Weak Alignment with Locomotive Action Affordances.

To further investigate the representations underlying locomotive action affordance perception, we again used RSA to examine to what extent deep neural networks (DNNs) capture human behavioral and brain responses to our image set. We sampled models with different architectures (both CNNs and Transformers), task objectives (object/scene classification, scene segmentation, video classification), and training procedures (supervised learning using labels, contrastive learning with image–text pairs, self-supervised learning) and focus on the comparative ability of these models to capture the dissimilarity between human annotations of perceived locomotive affordances vs. objects. We computed RDMs from the DNN activations to the scene stimuli in a subset of layers in each model (SI Appendix, SI Methods), and report for each model the layer that correlated highest with human behavioral or brain response (correlations for all individual layers of a few example models are provided in SI Appendix, Fig. S12).

All DNNs demonstrate significant correlations with both locomotive action affordance and object representations [all t(19) > 17.87, all P < 0.001; Fig. 5 A and B]. However, across all models, average correlations were substantially lower with locomotive action affordances (average ρ = 0.19, SD = 0.06) than with objects [average ρ = 0.31, SD = 0.07; pairwise t(12) = 10.23, P < 0.001]. Classic CNNs optimized for object classification on ImageNet correlated least with human behavioral action affordances, but correlations increased for CNNs trained on scene datasets, with the highest correlation for a CNN trained on semantic segmentation (ResNet50 trained on ADE20k: ρ = 0.21). Video-CNNs trained on action recognition (X3D and SlowFast) and CNNs trained with alternative task objectives (DINO or CLIP) show comparable correlations as these scene-dataset trained models, while Vision Transformers (ViTs) exhibit the highest correlations with locomotive affordances (maximum for ViT Base Patch 16 with ρ = 0.29). Human object annotations (Fig. 5B) also showed the highest correlation with the ViT Base Patch 16 (ρ = 0.38).

Fig. 5.

Fig. 5.

Comparison of DNNs with human behavior and fMRI responses in the scene-selective cortex. (A) Correlation between in-scanner locomotive action affordance annotation RDMs and RDMs derived from feature activations inside DNNs with different architectures and training objectives. Bars represent average across participants for the best correlating layer (SI Appendix, Table S1); gray dots indicate individual participants. Significant correlations are marked by asterisks indicating P < 0.05 for a one-sample t test corrected for multiple comparisons using Bonferroni correction. The shaded area delineates upper and lower noise ceiling; error bars indicate SEM across participants. (B) Correlation of DNN RDMs with in-scanner object annotation RDMs. Plot elements as in (A). (C) Correlation of ViT Base Patch 16 with annotations for the full set of images from the original behavioral experiment and automatic labeling outputs. Black horizontal lines indicate significant (P < 0.05) pairwise comparisons with a two-tailed sign test (FDR-corrected). Error bars indicate 95% CI derived from a bootstrap distribution (n = 1,000). (D) Correlation of each DNN’s best correlating layer with PPA (averaged across tasks); OPA and MPA are reported in SI Appendix, Fig. S10. Plot elements as in (A). (E) Correlations between scene-selective cortex and locomotive action affordances, when partialling out CLIP ViT-32. (F) Same as (E) but for objects. (G) Correlation between in-scanner human behavior and four different ResNet50 models: original (pretrained on Places365, same as in A and B), finetuned (last fully connected layer retrained on function labels of Scene Attribute Database), scratch (trained end-to-end using function labels), untrained (randomly initialized). (H) Correlations between in-scanner human behavior and RDMs derived from LLM embeddings of three types of image captions (general caption; action affordance-centered caption, object-centered caption; see SI Appendix, Table S2). (I) Correlation between in-scanner human behavior and RDMs derived from GPT-4 outputs, using different annotation prompts: choose one (Single) or multiple (Multi) affordances/objects from the same options human participants had; or list a maximum of 50 possible actions/objects in the scene (BoW; see SI Appendix, Table S3). (GI) plot elements as in (A).

These results suggest that, unlike human PPA and OPA, the DNNs we tested generally exhibit higher correspondence with object representations than with locomotive action affordances. Indeed, even the best correlating DNN with in-scanner affordance annotations (ViT Base Patch 16) correlated significantly higher with objects [pairwise t(19) = 3.66, P = 0.0017]. Moreover, the relatively large gap between the best correlating model and the noise ceiling for locomotive affordance behavior shows that a substantial amount of variance in the stable affordance representational space in humans is unexplained by DNNs. We find the same pattern when comparing this ViT with the behavioral annotations obtained in the original behavioral experiment conducted on the full image set, as well as automated labels obtained with the scene segmentation model trained on ADE20k, scene categories from Places365 and attribute classifiers (Fig. 5C): affordance annotations, whether obtained from human behavioral annotations or automatic labeling, show significantly lower correlations with the ViT activations than with all other representational spaces (with the exception of the sparse scene categories from Places365), again demonstrating their distinct representation.

Together, these results demonstrate significantly lower representational alignment of DNNs with locomotive action affordances compared to objects. While this is unsurprising for DNNs explicitly trained to classify objects, we note that this result also holds for models trained on other tasks such as scene classification and scene segmentation, and models trained with self-supervised (DINO) and contrastive learning objectives (CLIP).

Affordance Representation in PPA and OPA Is Independent of DNN Features.

To determine how well the different types of DNNs align with human brain responses to our novel stimulus set, we also compared their feature activations with fMRI response patterns in PPA (Fig. 5D), and OPA and MPA (SI Appendix, Fig. S10). All models show significant correlations with the scene ROIs [PPA: all t(19) > 4.08, P < 0.001, OPA: all t(19) > 4.51, P < 0.001 and MPA: all t(19) > 4.71, P < 0.001]. Mirroring behavior, ViTs showing numerically highest correlations in all three ROIs (PPA vs. CLIP ViT-B 32, ρ = 0.070; OPA vs. CLIP ViT-B 32, ρ = 0.061; MPA vs. CLIP ViT-B 16, ρ = 0.043) although the ResNet50 trained on scene segmentation performs on par (PPA, ρ = 0.068). However, model correlations are far below noise ceiling, and models show less differentiation here as compared to behavioral measurements (Fig. 5 A and B).

Partial correlation analyses furthermore show that the correlations of DNN features with fMRI responses in the scene-selective cortex primarily reflect object-related information. Both PPA and OPA exhibit significant unique correlations with locomotive affordances when partialling out the best correlating DNN (CLIP ViT-B 32), while the DNN itself also explains unique variance in all three ROIs, Fig. 5E; see SI Appendix, Table S11 for statistical summary). In contrast, neither PPA, OPA, or MPA show significant positive correlations with object annotations when partialling out DNN features (Fig. 5F), while still showing significant unique correlations with the DNN features. These results suggest that DNN features do not fully account for locomotive action affordance representation in PPA and OPA, instead capturing object representations in these regions.

Enhancing Alignment with Human Affordance Perception Using Supervision and LLMs.

Our results point to a consistent gap in DNN’s abilities to explain locomotive affordance compared to object representations reflected in human behavioral and brain measurements. In this final section, we explore three ways in which DNN alignment with human affordance perception could potentially be improved.

First, while we tested a variety of DNNs, none of them was explicitly trained to perform locomotive affordance perception. One approach toward closing the gap between DNN object vs. affordance perception could be to simply train the model on affordance labels, instead of object labels. We therefore assessed whether DNN alignment with human behavior increases with explicit supervision with affordance labels, using the SUN Attribute Database subset of function labels (SI Appendix, Table S4 and SI Methods). We find that both end-to-end training and fine-tuning on these labels improves correlations with human-perceived affordances, relative to (pre)training on scene classification (Fig. 5G) [fine-tuned vs. original ∆ρ = 0.07, pair-wise t test t(19) = 6.89, P < 0.001; end-to-end training vs. original ∆ρ = 0.06, t(19) = 7.5, P < 0.001]. This shows that with direct mapping to affordance labels, visual features extracted from (pretrained) DNNs can better approximate human-perceived affordances. However, even with such direct supervision, the gap in alignment relative to objects remains (average difference in ρ = 0.15), and the increase obtained by training on affordance labels is relatively modest, remaining lower than the best performing ViT model trained on object recognition (Fig. 5A).

Second, prior work has proposed that human vision is not only shaped by visual features but also by language (e.g., ref. 44). For example, similarity judgments of visual stimuli are not only captured by visual features, but also by linguistic description such as captions (45). Moreover, recent fMRI studies suggest that DNNs trained with richer linguistic descriptions yield better predictions of high-level visual cortex responses to natural scenes (46, 47). Consistently, our results indicate that visual features extracted from CLIP models trained to pair images with captions are among the best ranking models in terms of alignment with human behavior and fMRI responses (Fig. 5 A, B, and D), although notably they do not show substantially higher correlations than the ViT Base Patch 16 or the scene segmentation model, which were not trained with rich linguistic descriptors. To examine to what extent purely linguistic representations capture locomotive action affordance perception, we used a multimodal LLM (GPT-4) to generate an affordance-centered caption, an object-centered caption, and a general caption for each image in our stimulus set (see SI Appendix, Table S2 for examples). These captions were then fed to another LLM (SI Appendix, SI Methods) to extract language embeddings, from which we computed RDMs to compare to our human behavioral annotation RDMs.

As shown in Fig. 5H, LLM embeddings of all three types of captions correlate significantly with the human perceived locomotive affordances [general caption ρ = 0.22; affordance caption ρ = 0.26; object caption ρ = 0.22; all t(19) > 19.65,P < 0.001]. Moreover, LLM embeddings of affordance-centered captions correlate significantly higher with human affordance behavior than the other two types of captions [both pairwise t(19) > 8.02, P < 0.001]. However, all three LLM embeddings still show higher correlations with object annotations [general caption ρ = 0.38, affordance caption ρ = 0.39, object caption ρ = 0.38; all t(19) > 20.54, P < 0.001]. This indicates that while LLMs capture scene semantics to some extent, they still exhibit an alignment bias toward object representations vs. locomotive action affordances, similar to DNNs. Moreover, we find that the LLM-based captions do not capture fMRI responses in PPA and OPA well, showing substantially lower correlations compared to affordance annotations, and no significant partial correlations (SI Appendix, Fig. S13).

Finally, we explored whether it is possible to use a multimodal LLM to predict locomotive affordances directly, by conducting an analogous behavioral experiment with GPT-4 as with our human participants. We presented our images to GPT-4 with different prompts, asking it to either select one (Single) or multiple (Multi) actions/objects from the same list of response options as given to our human participants, or to name all possible actions and objects in the image (BoW; see SI Appendix, Table S3). We then created RDMs by converting the model outputs to one-hot vectors (SI Appendix, SI Methods) and correlated them to the RDMs derived from human behavior. As shown in Fig. 5I, GPT-4 generates behavioral output that now aligns to a similar extent with locomotive affordances as with object annotations, with overall highest correlations when given the exact same task as the participants (affordances: ρ = 0.515; objects: ρ = 0.440). Notably, however, even with the identical annotation task, a gap relative to the noise ceiling still remains, suggesting GPT-4 does not fully emulate human behavioral annotations.

These findings suggest that DNNs trained on object or scene classification, as well as several other commonly used visual understanding tasks, capture only some aspects of human scene representations. Specifically, we show that good alignment of visual DNN features with human object representations does not automatically transfer to alignment with affordance representations. Training DNNs directly on affordance labels increases alignment but does not fully close the gap with object recognition, and while a multimodal LLM can partially recapitulate human affordance annotations, LLM-generated linguistic descriptions alone show comparatively poor alignment with human affordance representations in the brain. Collectively, our comparisons suggest that the networks we tested do not fully capture locomotive action affordance perception in humans.

Discussion

The aim of our study was to assess whether the human visual system represents different locomotive actions afforded by complex natural scenes, and if these representations are distinct from other scene properties. The notion of action affordances shaping visual perception is long-standing in psychology (1), but the presence of distinct action affordance representations for scene environments in the human brain had so far not been demonstrated. Utilizing behavioral and fMRI measurements alongside computational DNN models, we found that human perceived locomotive affordances—reflecting different ways we can move through our local environment—form a unique representational space that is not trivially predicted from other perceived properties such as objects, materials, or global layout. Moreover, scenes associated with specific locomotive actions, such as driving or swimming, evoked similar multivoxel activation patterns in PPA and OPA (but not MPA), with representations extending into early- and mid-level visual regions. By accounting for variation in other visual scene properties, we find that these locomotive action affordances are indeed represented in the brain independent from objects and other (low-level) scene properties. Importantly, these representations proved to be task-independent, suggesting that OPA and PPA encode distinct locomotive action affordances automatically even when participants passively view the images or perform a nonnavigational task. While features extracted from pretrained DNNs partially explained activity in scene-selective regions, their alignment with locomotive affordances was lower than with object representations, suggesting that such features do not fully account for human affordance perception.

Prior fMRI research on affordance perception has primarily focused on the representation of navigational affordances, with evidence accumulating that OPA in particular encodes egocentric information relevant to navigate the immediate environment (1523). In these studies, affordances are typically operationalized as the presence of “navigable space” in a scene, for example in the form of walkable pathways within images of indoor rooms (15), or distance to navigational boundaries in hallways (18). In contrast, our study operationalizes affordances as multiple possible locomotive action labels in both indoor and outdoor scenes. By considering multiple types of environments and navigational actions other than walking, our study therefore extends these prior findings by suggesting that OPA may also represent different types of locomotive actions that one could perform in navigable space. A recent study by ref. 48 also investigated whether OPA differentiated locomotive actions other than walking—specifically crawling and flying (neither of which were included in our behavioral annotations) and found selective responses to videos taken only during walking. This could mean that neural representation of distinct action affordances may be restricted to a relatively narrow range of locomotive actions, although methodological differences (the use of videos and analysis of univariate rather than multivariate responses) could also explain the discrepancy with our results.

In addition to OPA, we here also find representation of locomotive action affordances in PPA, which has been previously shown to represent a wide range of scene properties, including scene category (49), and spatial layout (1113), but also textures (29, 30) and objects (3133). In fact, the wide range of proposed roles of PPA in visual scene representation (see refs. 8 and 9, for recent reviews) and the inherent correlations between different scene properties (37, 50, 51) motivated us to collect multiple types of annotations for our scenes, allowing direct comparison against our new operationalization of action affordance representation in the scene-selective cortex. As shown in SI Appendix, Fig. S7B, we find that both locomotive action affordances and global properties are consistently represented in PPA, while scene categories and objects are less consistently represented. Importantly, representations of global properties do not ‘explain away’ affordances in PPA (or OPA) (SI Appendix, Fig. S7C), and results were robust across multiple ways of assessing affordance representation (in-scanner behavior, separate online experiment, and automatic labeling using classifiers). In contrast, the lack of representation of locomotive action affordances in MPA is consistent with a substantial body of work showing that MPA is involved in navigating the broader, rather than the immediate environment (52, 53), linking visual inputs to cognitive maps in hippocampal systems (54, 55).

In addition to expanding the operationalization of affordances to different types of locomotive actions, we also assessed the task dependency of these representations by measuring brain responses to our scenes under three different task instructions. Prior evidence on task effects on occipitotemporal cortex representations is mixed, with some studies finding modulations of representational patterns depending on in-scanner tasks in object-selective (56, 57) and scene-selective cortex (25, 34, 58), while others observed representations of spatial layout in PPA (12) or pathways in OPA (15) even when not task-relevant. A direct comparison between a categorization and navigation task on neural representations of indoor walkways found task-specific effects in PPA, but not OPA (59). Here, we found that both OPA and PPA show task-independent representations of action affordances, consistent with classic psychophysical findings by ref. 7 highlighting that visual properties relevant to navigation are processed within very brief glimpses, as well as M/EEG studies showing early modulation of neural responses by scene properties (60, 61), including affordance-related information (62, 63). Our searchlight analysis also highlights that the representation of locomotive affordances may involve relatively early visual responses. Possibly, more task-dependent representations are present in other brain regions, including dorsal stream parietal or frontal cortex (34, 64).

DNNs have recently emerged as a popular tool for studying computations underlying human visual perception (65, 66), by mapping representations of DNN features to different brain regions (e.g., refs. 67 and 68). Studies focusing specifically on scene-selective regions found that they tend to correlate highest with mid-level layers in object- and scene-classification trained CNNs (16, 25, 69). Here, we find the same pattern for those CNNs, but models with other architectures or training objectives show a more complex pattern (SI Appendix, Fig. S12). Furthermore ref. 17, pointed out that DNNs trained on scene segmentation, i.e., with the objective to spatially locate different elements of a scene, show a higher alignment with scene-selective regions; we observe a similar advantage in our data (Fig. 5D). More in-depth inspection of the model features driving DNN alignment with scene-selective regions in ref. 15 showed that it is driven by presence of extended surfaces and floor elements in the lower visual field (16, 17), consistent with the general notion that these regions contain visuospatial representations (70).

While we here also find significant correlations of DNN features with scene regions, they did not correlate strongly with behavioral locomotive action affordance annotations, and did not explain away the variance in the brain response related to affordances (Fig. 5E). Instead, we find that all DNNs are more closely aligned with object representations—this distinction was consistently observed across multiple DNN architectures trained on a variety of tasks and datasets, arguing against the idea that affordance representations emerge “automatically” from DNN features learned for the purpose of image classification (16). The relatively poor alignment of DNN features with our behavioral and brain responses may reflect our operationalization of affordances through action labels, which are potentially less spatial and more semantic in nature than navigable pathways. However, our findings do not necessarily imply that locomotive affordance representations are not linked to visual features in scenes at all: Our DNN training and fine-tuning results (Fig. 5G) show that it is possible to achieve enhanced prediction of behaviorally perceived affordances using direct supervision with affordance labels, suggesting that relevant image features can be learned by DNNs to some extent, but that they might be different from those useful for object recognition.

We also explored whether nonvisual, linguistic features could explain the gap in DNN alignment with affordance perception. Our results show that purely linguistic descriptions of scenes, even when containing affordance-related information, only partially close this gap, still showing lower alignment with behavioral annotations of affordances than with objects (Fig. 5H), and low correlations with scene-selective cortex responses (SI Appendix, Fig. S13). In addition, visual features extracted from DNNs trained with richer linguistic supervision did not show substantially higher correlations compared to models trained with single labels (Fig. 5 A, B, and D). Interestingly, we do find that the gap in alignment with affordance behavior relative to objects can be closed when using a GPT-4-vision to emulate human behavior directly (Fig. 5I), suggesting this particular multimodal model has achieved a similarly efficient mapping of relevant image features for affordance labels as for object labels. This serves as a proof of principle that such a mapping can be learned, at least for the particular labels used in our study. However, one notable difference between DNNs and humans that could be an important factor in explaining the misalignment is that these models lack embodiment. Indeed, the classic ecological psychology view on affordance perception states that it is fundamentally shaped by the (bodily) abilities possessed by the observer (1, 2). Therefore, one potential promising direction toward closing the alignment gap could be to compare human perception with embodied AI systems that perform the kinds of visually guided locomotive behaviors examined here.

Several key questions remain for future work. First, we only sampled a subset of pretrained DNNs, and it is possible that a DNN we did not consider here, or a different alignment method (36) than RSA will provide better predictions of human affordance perception. Our goal here was to explore differences in alignment of affordances relative to other scene properties, not to provide a systematic comparison of how different DNN aspects (architectures, training datasets, task objectives) affect alignment with human perception (7173). To facilitate such a comprehensive assessment as well as further exploration of the action affordance representational space, we have made our stimuli, behavioral, and neural data openly available (74), allowing testing of more models, as well as comparison with other public fMRI and behavioral benchmarks. Second, our analyses involving fine-tuning and end-to-end training DNNs showed only modest improvements in human alignment; this gain could potentially be further improved by increasing the supervision quality. Here, we used Sun Attribute Database labels by refs. 38 and 75, which were not collected for the purpose of studying affordances per se, and notably do not include the label “walking,” which as shown in Fig. 1A strongly drives human representational similarity. A dedicated large scale-dataset of locomotive action affordance annotations is needed to precisely assess how well human affordance perception can be mimicked using DNNs trained via direct supervision. Third, our comparison of task effects may be weakened by our experimental design in which participants were only probed in a subset of trials in each fMRI run; although response probing was randomized across trials and therefore unpredictable for the participants, incentivizing them to perform the task throughout the run, stronger task effects may be observed when participants are consistently probed on each trial.

Finally, we here tested locomotive action affordance perception on static images for which participants indicated how they would move in a scene image using simple button presses. This is arguably substantially different from real-world vision, in which humans experience 1) a continuous flow of input and 2) actively navigate the world. Future work should ideally combine more continuous inputs with more immersive and realistic navigational tasks (76, 77). In terms of computational models, a promising direction to close the gap in human alignment may be to consider DNN models that are trained on visual inputs that better approximate our daily navigational experiences (78, 79). Moreover, it would be very interesting to explore alignment of human affordance perception with vision-language-action models from robotics (e.g., ref. 80), as these models have been trained in what could be considered more embodied contexts. Overall, our results suggest that to understand the visuo-semantic transformations underlying human perception of scene affordances, we need to move beyond object recognition and consider a wider spectrum of ecologically relevant behaviors.

To summarize, our key finding is that different locomotive action affordances are distinctly represented in behavioral and neural measurements of human visual perception. We performed extensive comparisons to demonstrate that this space can be reliably measured across different experimental paradigms and in fMRI. We also explored the relation of these representations with different types of visual information derived from human behavioral annotations, automatically derived labels using machine classification, feature activations in a large set of DNN models, or by training or prompting such models explicitly. Our results suggest human representations of scene affordances are not fully captured by any of these models, thus offering a new human-alignment challenge for computational models of visual processing.

Materials and Methods

We collected a novel stimulus set by sourcing 231 high-resolution color photographs freely available on the internet. For the full set of scenes, we collected behavioral annotations from human observers (n = 152) during an online experiment. Participants provided informed consent by selecting a checkbox on a digital consent form, and the Ethical Committee of the Computer Science Department at the University of Amsterdam approved the experiment. Brain activation patterns were measured in human participants (n = 20) for a subset of 90 stimuli in an fMRI experiment consisting of four separate scanning sessions. Before participating in each fMRI session, participants filled out prescreening forms for MR safety and gave written informed consent. The Ethical Committee of the Psychology Department at the University of Amsterdam approved the experiment. To quantify scene image properties, we made use of a variety of computer vision and machine learning methods, including low-level feature models, linear classifiers, (multimodal) deep neural networks, large language models, and a generative AI chatbot. Full details on materials and procedures are provided in SI Appendix, SI Methods.

Supplementary Material

Appendix 01 (PDF)

pnas.2414005122.sapp.pdf (18.3MB, pdf)

Acknowledgments

This work was supported by a Veni grant (VI.Veni.194030) from the Netherlands Organisation for Scientific Research to I.I.A.G. We thank Lukas Muttenthaler, Steven Scholte, and Chris Baker for helpful feedback.

Author contributions

C.G.B. and I.I.A.G. designed research; C.G.B., C.S., A.P.S., E.M., S.B., and N.V. performed research; C.G.B. contributed new reagents/analytic tools; C.G.B. analyzed data; and C.G.B. and I.I.A.G. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission. W.H.W. is a guest editor invited by the Editorial Board.

Data, Materials, and Software Availability

Anonymized behavioral data, fMRI data, and feature activations of neural networks data, as well as a code repository containing Python scripts to perform preprocessing and analysis to reproduce the results, have been deposited in Locomotive action affordances in brains, behavior and DNNs (https://osf.io/v3rcq/) (74).

Supporting Information

References

  • 1.Gibson J. J., “The theory of affordances” in Perceiving, Acting and Knowing: Toward an Ecological Psychology, Shaw R., Bransford J., Eds. (Erlbaum, 1977), pp. 67–82. [Google Scholar]
  • 2.Rietveld E., Kiverstein J., A rich landscape of affordances. Ecol. Psychol. 26, 325–352 (2014). [Google Scholar]
  • 3.Osiurak F., Rossetti Y., Badets A., What is an affordance? 40 years later. Neurosci. Biobehav. Rev. 77, 403–417 (2017). [DOI] [PubMed] [Google Scholar]
  • 4.Biederman I., Mezzanotte R. J., Rabinowitz J. C., Scene perception: Detecting and judging objects undergoing relational violations. Cognit. Psychol. 14, 143–177 (1982). [DOI] [PubMed] [Google Scholar]
  • 5.Aminoff E. M., Kveraga K., Bar M., The role of the parahippocampal cortex in cognition. Trends Cogn. Sci. 17, 379–390 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Oliva A., Torralba A., Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175 (2001). [Google Scholar]
  • 7.Greene M. R., Oliva A., The briefest of glances: The time course of natural scene understanding. Psychol. Sci. 20, 464–472 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bartnik C. G., Groen I. I. A., Visual Perception in the Human Brain: How the Brain Perceives and Understands Real-World Scenes (Oxford Research Encyclopedia of Neuroscience, Oxford University Press, 2023). [Google Scholar]
  • 9.Epstein R. A., Baker C. I., Scene perception in the human brain. Ann. Rev. Vis. Sci. 5, 373–397 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Silson E. H., Steel A. D., Baker C. I., Scene-selectivity and retinotopy in medial parietal cortex. Front. Hum. Neurosci. 10, 412 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Epstein R. A., Kanwisher N., A cortical representation of the local visual environment. Nature 392, 598–601 (1998). [DOI] [PubMed] [Google Scholar]
  • 12.Kravitz D. J., Peng C. S., Baker C. I., Real-world scene representations in high-level visual cortex: It's the spaces more than the places. J. Neurosci. 31, 7322–7333 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Park S., Brady T. F., Greene M. R., Oliva A., Disentangling scene content from spatial boundary: Complementary roles for the parahippocampal place area and lateral occipital complex in representing real-world scenes. J. Neurosci. 31, 1333–1340 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Park S., Konkle T., Oliva A., Parametric coding of the size and clutter of natural scenes in the human brain. Cereb. Cortex 25, 1792–1805 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bonner M. F., Epstein R. A., Coding of navigational affordances in the human visual system. Proc. Natl. Acad. Sci. U.S.A. 114, 4793–4798 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bonner M. F., Epstein R. A., Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Comput. Biol. 14, e1006111 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dwivedi K., Cichy R. M., Roig G., Unraveling representations in scene-selective brain regions using scene-parsing deep neural networks. J. Cogn. Neurosci. 33, 2032–2043 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Park J., Park S., Coding of navigational distance and functional constraint of boundaries in the human scene-selective cortex. J. Neurosci. 40, 3621–3630 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Persichetti A. S., Dilks D. D., Perceived egocentric distance sensitivity and invariance across sceneselective cortex. Cortex 77, 155–163 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Dilks D. D., Julian J. B., Kubilius J., Spelke E. S., Kanwisher N., Mirror-image sensitivity and invariance in object and scene processing pathways. J. Neurosci. 31, 11305–11312 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Julian J., Ryan J., Hamilton R., Epstein R., The occipital place area is causally involved in representing environmental boundaries during navigation. Curr. Biol. 26, 1104–1109 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kamps F. S., Lall V., Dilks D. D., The occipital place area represents first-person perspective motion information through scenes. Cortex 83, 17–26 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kamps F. S., Julian J. B., Kubilius J., Kanwisher N., Dilks D. D., The occipital place area represents the local elements of scenes. Neuroimage 132, 417–424 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Greene M. R., Baldassano C., Esteva A., Beck D. M., Fei-Fei L., Visual scenes are categorized by function. J. Exp. Psychol. Gen. 145, 82–94 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Groen I. I. A., et al. , Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior. Elife 7, e32962 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bainbridge W. A., Oliva A., Interaction envelope: Local spatial representations of objects at all scales in scene-selective regions. Neuroimage 122, 408–416 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Josephs E. L., Konkle T., Large-scale dissociations between views of objects, scenes, and reachable-scale environments in visual cortex. Proc. Natl. Acad. Sci. U.S.A. 117, 29354–29362 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Groen I. I. A., Silson E. H., Baker C. I., Contributions of low- and high-level properties to neural processing of visual scenes in the human brain. Philos. Trans. R. Soc. Lond. B Biol. Sci. 372, 20160102 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Henriksson L., Mur M., Kriegeskorte N., Rapid invariant encoding of scene layout in human OPA. Neuron 103, 161–171.e3 (2019). [DOI] [PubMed] [Google Scholar]
  • 30.Cant J. S., Xu Y., Object ensemble processing in human anterior-medial ventral visual cortex. J. Neurosci. 32, 7685–7700 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Janzen G., van Turennout M., Selective neural representation of objects relevant for navigation. Nat. Neurosci. 7, 673–677 (2004). [DOI] [PubMed] [Google Scholar]
  • 32.Marchette S. A., Vass L. K., Ryan J., Epstein R. A., Outside looking in: Landmark generalization in the human navigational system. J. Neurosci. 35, 14896–14908 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Harel A., Kravitz D. J., Baker C. I., Deconstructing visual scenes in cortex: Gradients of object and spatial layout information. Cereb. Cortex 23, 947–957 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Bracci S., Daniels N., Op de Beeck H., Task context overrules object-and category-related representational content in the human parietal cortex. Cereb. Cortex 27, 310–321 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kriegeskorte N., Representational similarity analysis—Connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2, 249 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sucholutsky I., et al. , Getting aligned on representational alignment. arXiv [Preprint] (2023). 10.48550/arXiv.2310.13018 (Accessed 24 February 2025). [DOI]
  • 37.Malcolm G. L., Groen I. I. A., Baker C. I., Making sense of real-world scenes. Trends Cogn. Sci. 20, 843–856 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Patterson G., Xu C., Su H., Hays J., The SUN Attribute Database: Beyond categories for deeper scene understanding. Int. J. Comput. Vis. 108, 59–81 (2014). [Google Scholar]
  • 39.Zhou B., et al. , “Scene parsing through ADE20K dataset” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2017). [Google Scholar]
  • 40.Zhou B., Lapedriza A., Khosla A., Oliva A., Torralba A., “Places: A 10 Million image database for scene recognition“ in IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), vol. 40, pp. 1452–1464. [DOI] [PubMed] [Google Scholar]
  • 41.Julian J., Fedorenko E., Webster J., Kanwisher N., An algorithmic method for functionally defining regions of interest in the ventral visual pathway. Neuroimage 60, 2357–2364 (2012). [DOI] [PubMed] [Google Scholar]
  • 42.Amunts K., Mohlberg H., Bludau S., Zilles K., Julich-Brain: A 3D probabilistic atlas of the human brain’s cytoarchitecture. Science 369, 988–992 (2020). [DOI] [PubMed] [Google Scholar]
  • 43.Allen E. J., et al. , A massive 7T fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022). [DOI] [PubMed] [Google Scholar]
  • 44.Simanova I., Francken J. C., de Lange F. P., Bekkering H., Linguistic priors shape categorical perception. Lang. Cogn. Neurosci. 31, 159–165 (2016). [Google Scholar]
  • 45.Marjieh R., et al. , “Words are all you need? language as an approximation for human similarity judgments” in The Eleventh International Conference on Learning Representations (ICLR) (2023).
  • 46.Wang A. Y., Kay K., Naselaris T., Tarr M. J., Wehbe L., Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nat. Mach. Intell. 5, 1415–1426 (2023). [Google Scholar]
  • 47.Doerig A., et al. , Semantic scene descriptions as an objective of human vision. arXiv [Preprint] (2022). 10.48550/arXiv.2209.11737 (Accessed 24 February 2025). [DOI]
  • 48.Jones C. M., Byland J., Dilks D. D., The occipital place area represents visual information about walking, not crawling. Cereb. Cortex 33, 7500–7505 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Walther D. B., Caddigan E., Fei-Fei L., Beck D. M., Natural scene categories revealed in distributed patterns of activity in the human brain. J. Neurosci. 29, 10573–10581 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Greene M. R., Hansen B. C., Disentangling the independent contributions of visual and conceptual features to the spatiotemporal dynamics of scene categorization. J. Neurosci. 40, 5283–5299 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lescroart M. D., Stansbury D. E., Gallant J. L., Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Front. Comput. Neurosci. 9, 135 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Epstein R. A., Vass L. K., Neural systems for landmark-based wayfinding in humans. Phil. Trans. R. Soc. Lond. B Biol. Sci. 369, 20120533 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Park S., Chun M. M., Different roles of the parahippocampal place area (PPA) and retrosplenial cortex (RSC) in panoramic scene perception. Neuroimage 47, 1747–1756 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Vann S. D., Aggleton J. P., Maguire E. A., What does the retrosplenial cortex do? Nat. Rev. Neurosci. 10, 792–802 (2009). [DOI] [PubMed] [Google Scholar]
  • 55.Dilks D. D., Kamps F. S., Persichetti A. S., Three cortical scene systems and their development. Trends Cogn. Sci. 26, 117–127 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Harel A., Kravitz D., Baker C., Task-dependent representations of visual objects. J. Vis. 12, 273–273 (2012). [Google Scholar]
  • 57.Hebart M. N., Bankson B. B., Harel A., Baker C. I., Cichy R. M., The representational dynamics of task and object processing in humans. Elife 7, e32816 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lowe M. X., Gallivan J. P., Ferber S., Cant J. S., Feature diagnosticity and task context shape activity in human scene-selective cortex. Neuroimage 125, 681–692 (2016). [DOI] [PubMed] [Google Scholar]
  • 59.Persichetti A. S., Dilks D. D., Dissociable neural systems for recognizing places and navigating through them. J. Neurosci. 38, 10295–10304 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Groen I. I. A., Ghebreab S., Prins H., Lamme V. A. F., Scholte H. S., From image statistics to scene gist: Evoked neural activity reveals transition from low-level natural image structure to scene category. J. Neurosci. 33, 18814–18824 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Harel A., Groen I. I. A., Kravitz D. J., Deouell L. Y., Baker C. I., The temporal dynamics of scene processing: A multifaceted EEG investigation. eNeuro 3, ENEURO.0139-16.2016 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Harel A., Nador J. D., Bonner M. F., Epstein R. A., Early electrophysiological markers of navigational affordances in scenes. J. Cogn. Neurosci. 34, 397–410 (2022). [DOI] [PubMed] [Google Scholar]
  • 63.Djebbara Z., Fich L. B., Gramann K., The brain dynamics of architectural affordances during transition. Sci. Rep. 11, 2796 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.McKee J. L., Riesenhuber M., Miller E. K., Freedman D. J., Task dependence of visual and category representations in prefrontal and inferior temporal cortices. J. Neurosci. 34, 16065–16075 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Kietzmann T. C., McClure P., Kriegeskorte N., “Deep neural networks in computational neuroscience” in Oxford Research Encyclopedia of Neuroscience (Oxford University Press, 2019). [Google Scholar]
  • 66.Storrs K. R., Kriegeskorte N., Deep learning for cognitive neuroscience. arXiv [Preprint] (2019). https://arxiv.org/abs/1903.01458 (Accessed 24 February 2025).
  • 67.Khaligh-Razavi S.-M., Kriegeskorte N., Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Dwivedi K., Bonner M. F., Cichy R. M., Roig G., Unveiling functions of the visual cortex using task-specific deep neural networks. PLoS Comput. Biol. 17, e1009267 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.King M. L., Groen I. I. A., Steel A., Kravitz D. J., Baker C. I., Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. Neuroimage 197, 368–382 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Groen I. I. A., Dekker T. M., Knapen T., Silson E. H., Visuospatial coding as ubiquitous scaffolding for human cognition. Trends Cogn. Sci. 26, 81–96 (2022). [DOI] [PubMed] [Google Scholar]
  • 71.Conwell C., Prince J. S., Kay K. N., Alvarez G. A., Konkle T., A large-scale examination of inductive biases shaping high-level visual representation in brains and machines. Nat. Commun. 15, 9383 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Muttenthaler L., Dippel J., Linhardt L.,Vandermeulen R. A., Kornblith S., “Human alignment of neural network representations” in The Eleventh International Conference on Learning Representations (ICLR) (2023).
  • 73.Sartzetaki C., Roig G., Snoek C. G., Groen I. I. A., “One hundred neural networks and brains watching videos: Lessons from alignment” in The Twelfth International Conference on Learning Representations (ICLR) (2024).
  • 74.Groen I. I. A., Bartnik C. G., Locomotive action affordances in brains, behavior and DNNs. Open Science Framework. https://osf.io/v3rcq/. Deposited 3 March 2025.
  • 75.Patterson G., Hays J., “SUN attribute database: Discovering, annotating, and recognizing scene attributes“ in 2012 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, Providence, RI, 2012), pp. 2751–2758. [Google Scholar]
  • 76.Gregorians L., Spiers H. J., “Affordances for spatial navigation“ in Affordances in Everyday Life, Djebbara Z., Ed. (Springer International Publishing, Cham, 2022), pp. 99–112. [Google Scholar]
  • 77.Zhang T., Gallant J. L., A naturalistic navigation task reveals rich distributed representations of information across the human cerebral cortex. J. Vis. 20, 462 (2020). [Google Scholar]
  • 78.Greene M. R., et al. , The visual experience dataset: Over 200 recorded hours of integrated eye movement, odometry, and egocentric video. J. Vis. 24, 6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Venkataramanan S., Rizve M. N., Carreira J., Asano Y. M., Avrithis Y., Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video (2024), eprint: 2310.08584.
  • 80.Kim M. J., et al. , OpenVLA: An open-source vision-language-action model. arXiv [Preprint] (2024). 10.48550/arXiv.2406.09246 (Accessed 24 February 2025). [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2414005122.sapp.pdf (18.3MB, pdf)

Data Availability Statement

Anonymized behavioral data, fMRI data, and feature activations of neural networks data, as well as a code repository containing Python scripts to perform preprocessing and analysis to reproduce the results, have been deposited in Locomotive action affordances in brains, behavior and DNNs (https://osf.io/v3rcq/) (74).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES