Abstract
Visual scene perception enables rapid interpretation of the surrounding environment by integrating multiple visual features related to task demands and context, which is essential for goal-directed behavior. In the present work, we investigated the temporal neural dynamics underlying the interaction between the processing of bottom-up visual features and top-down contextual knowledge during scene perception. We asked whether newly acquired spatial knowledge would immediately modulate the early neural responses involved in the extraction of navigational affordances available (i.e., the number of open doors). For this purpose, we analyzed electroencephalographic data from 30 participants performing interleaved blocks of a scene memory task and a visuospatial memory task in which we manipulated the number of navigational affordances available. We used convolutional neural networks coupled with gradient-weighted class activation mapping to assess the main electroencephalographic channels and time points contributing to the classification performances. The results indicated an early temporal window of integration in occipitoparietal activity (50–250 ms post stimulus) for several aspects of visual perception, including scene color and number of affordances, as well as for spatial memory content. Moreover, a convolutional neural network trained to detect affordances in the scene memory task failed to generalize to detect the same affordances after participants learned spatial information about goal position within the scene. Taken together, these results reveal an early common window of integration for scene and visuospatial memory information, with a specific and immediate top-down influence of newly acquired spatial knowledge on early neural correlates of scene perception.
Keywords: visual perception, scene, spatial memory, temporal dynamic, EEG, grad-CAM
Introduction
Visual scene perception is a critical cognitive function that enables individuals to rapidly encode, interpret, and interact with their environment. This process requires the integration of low-, mid-, and high-level visual information (e.g., object features, spatial configurations, and landmarks) (Bartnik & Groen, 2023) to support efficient and adaptive behavior in dynamic settings such as natural environments (Epstein & Baker, 2019). The complexity of visual perception becomes even more pronounced in everyday life when we consider the variety of goals, tasks, contexts, or prior knowledge that modulate scene processing and its neural correlates (Bar, 2009; Bar et al., 2006; Kay, Bonnen, Denison, Arcaro, & Barack, 2023; Malcolm, Groen, & Baker, 2016; Nau, Schmid, Kaplan, Baker, & Kravitz, 2024; Ritchie, Wardle, Vaziri-Pashkam, Kravitz, & Baker, 2024). Therefore, a deeper understanding of the neural dynamics underlying scene perception, especially the interaction between bottom-up (e.g., visual features) and top-down (e.g., prior knowledge) information, is essential for elucidating the mechanisms of visual cognition.
In this context, the processing of navigational affordances available in visual scenes represents a promising theoretical and methodological framework (Bartnik, Vukšić, Bommer, & Groen, 2024; Djebbara, Fich, Petrini, & Gramann, 2019; Gregorians & Spiers, 2022; Naveilhan, Saulay-Carret, Zory, & Ramanoël, 2024). Initially proposed to be automatically extracted solely by bottom-up mechanisms during scene perception (Bonner & Epstein, 2017; Harel, Nador, Bonner, & Epstein, 2022), recent findings also suggest the influence of top-down processes such as contextual information on scene processing (Aminoff & Tarr, 2021; Choi, McCloskey, & Park, 2020; Naveilhan et al., 2024). For example, Naveilhan et al. (2024) reported that learning the position of a goal situated in an adjacent room interfered with the number of navigational affordances available (i.e., open doors in a room). Precisely, participants exhibited a linear decrease in task accuracy as the number of doors increased, a pattern that was not present before participants learned this spatial information. However, the neural mechanisms underlying this effect remain poorly understood. Indeed, whereas Naveilhan et al. (2024) proposed that learning contextual information might modulate early neural markers of visual scene processing, their results did not clarify how scene perception and visuospatial memory interact at the neural level to produce the behavioral effects observed. A possible explanation resides in the fact that these analyses were restricted to only few occipitoparietal electrodes reported to be involved in the extraction of affordances (Bonner & Epstein, 2017; Harel, Groen, Kravitz, Deouell, & Baker, 2016; Harel et al., 2022; Kaiser, Häberle, & Cichy, 2020; Kamps, Chen, Kanwisher, & Saxe, 2024). Despite the interest of these restricted analyses, this may also have constrained the detection of broader, integrative brain dynamics that likely underpin complex visuospatial behaviors (Groen, Silson, & Baker, 2017; Kravitz, Saleem, Baker, & Mishkin, 2011). Thus, the neural correlates of this interaction between scene and visuospatial memory information remain elusive.
Recent advances in signal processing and supervised learning with neural networks now make it possible to consider multiple brain regions that support complex dynamics for visual perception. These computational approaches enable the extraction of spatiotemporally distributed neural patterns beyond what is accessible with traditional event-related potential (ERP) averaging. They offer a data-driven means to disentangle the contributions of distinct brain regions over time without prior assumptions about electrode relevance, but also how the information supporting classification generalize to different contexts. Notably, Orama and Motoyoshi (2023) trained a convolutional neural network (EEGNet Lawhern et al., 2018) to classify both scene categories and their global properties (e.g., naturalness, openness, roughness) based on electroencephalographic (EEG) data. By applying gradient-weighted class activation mapping (Grad-CAM; Selvaraju et al., 2017), the authors reported that early EEG signals from occipital electrodes were crucial for classifying openness, whereas later signals from frontal electrodes were crucial for determining naturalness and scene categories. Dwivedi, Sadiya, Balode, Roig, and Cichy (2024) further explored this complex sequencing of scene perception using representational similarity analysis to compare neural responses across occipital electrodes with computational models of two-dimensional, three-dimensional, semantic features, and navigational affordances. Their results showed that visual features are processed earlier in the temporal sequence (130–170 ms post stimulus) than navigational affordances (approximately 300 ms), suggesting hierarchical processing in line with the findings of the Orima group. However, these results also seem to be partially inconsistent with those of Harel et al. (2022), who showed that navigational affordances were processed earlier (approximately 230 ms). A possible explanation is based on the fact that in previous protocols participants were passively presented with the scene (Bonner & Epstein, 2017; Harel et al., 2022), whereas in Dwivedi et al. (2024) participants had explicitly engaged in navigational affordance processing.
These differences in task engagement, contextual information, and the features of the visual scenes used could influence the temporal dynamics of navigational affordance extraction, as previously reported only at the behavioral level (Naveilhan et al., 2024). Extending this notion of top-down influence to the neural level, Enge, Süß, and Rahman (2023) proposed that acquiring semantic knowledge about an object's function can immediately alter its visual processing. Using EEG, they demonstrated that, when participants viewed unfamiliar objects paired with a functional keyword rather than a filler, semantic insight immediately enhanced the high-level N170 component (150–200 ms), reduced the N400 (400–700 ms), and, during subsequent uncued viewing, increased the early P1 response (100–150 ms). These findings indicate that semantic knowledge can rapidly modulate both low- and high-level perceptual representations during goal-directed behavior, aligning with models of visual perception (Bar, 2009; Bar et al., 2006) proposing that frontal context-driven predictions preactivate spatial templates in high-level visual areas. However, this mechanism, embedding goal position information into the initial feedforward sweep, was entirely overlooked by the univariate ERP analysis in Naveilhan et al. (2024), which focused exclusively on P2 amplitude at occipitoparietal electrodes. Thus, within the domain of scene perception, the temporal dynamics of the neural correlates underlying the interaction between scene information processing and prior spatial information remain elusive and represent a major challenge for current research in visual cognition (Nau et al., 2024; Ritchie et al., 2024).
To address this issue, we conducted novel analyses using convolutional neural networks (CNNs) applied to the previously acquired dataset described above (Naveilhan et al., 2024). Participants completed scene perception and visuospatial memory tasks during which they first viewed the scenes, then learned the position of a goal and were subsequently asked to retrieve it. Throughout the experiment, we systematically manipulated the number of available navigational affordances while controlling for low-level visual features across conditions. We included all 64 available electrodes to investigate the global dynamics of the expected interaction between task-relevant top-down (i.e., prior spatial knowledge) and bottom-up information processing (i.e., visual features such as the number of opened doors). The aim of the present study was to provide new insights into the temporal neural dynamics underlying task-specific visuospatial representations by leveraging the complexity of CNNs to disentangle the rich spatiotemporal structure of EEG data. Specifically, we hypothesize that top-down information will modulate the early neural signature of scene perception, including neural correlates of visual scene features, extending results from Enge et al. (2023) to scene processing. In this sense, we expect that processing of prior spatial knowledge will modulate early occipitoparietal activity (approximately 200 ms), allowing the CNN to classify whether contextual information has been acquired. We also propose that this time window carries specific representations of the previously learned goal position, suggesting an early common integration for both scene and visuospatial memory information. Finally, we hypothesize that learned spatial information will modulate the early neural signature of affordance processing, even when visual features remain identical. Specifically, we expect that a CNN trained to classify navigational affordances without prior spatial knowledge will fail to generalize to conditions after participants have learned goal-related information, even though the visual scene features are the same across conditions. This would suggest that visuospatial memory information interacts with the processing of visual scene features, thereby modifying how affordances are represented at the neural level.
Methods
Participants
In the present study, we conducted a reanalysis of a dataset from a previously published study (Naveilhan et al., 2024). EEG data were collected from 30 young adults (mean age, 24.31 ± 0.65 years; range, 19–31 years; 16 females). Sample size was determined a priori using G*Power (version 3.1.9.7; Faul, Erdfelder, Lang, & Buchner, 2007) based on a previous EEG study of scene perception (Naveilhan et al., 2025), which indicated that a minimum of 23 participants was required to achieve statistical power of 0.95 at an alpha level of 0.05. The study was approved by the local ethics committee (CERNI-UCA opinion no. 2021-050), and all participants gave informed consent prior to participation. They were all right handed, had normal or corrected-to-normal vision, and had no history of neurological or cognitive disorders.
Stimuli and procedure
The experiment used visual stimuli developed with Unity Engine (v2019.2.0.0f1) and presented on an iiyama ProLite B2791HSU monitor (1,920 × 1,080 resolution, 30–83 Hz) positioned 60 cm from the participants. Stimulus presentation was controlled by PsychoPy (v2022.13) running on a Dell Precision 7560 workstation with an Intel Xeon W-11955 processor. The stimuli consisted of images of simple rectangular rooms, each containing either a door (i.e., a navigational affordance) or a gray rectangle on three visible walls. These designs were adapted from previous studies (Bonner & Epstein, 2017; Harel et al., 2022). To avoid potential associations between door locations and wall colors, seven different door configurations and seven wall color variations were used, resulting in 49 unique stimuli. All stimuli are publicly available in the OSF repository, and a subset is presented in Figure 1.
Figure 1.
Presentation of a subset of the stimuli used, showing the seven possible affordance configurations and three of the eight wall colors included in the protocol. The complete set of stimuli is available in the OSF repository.
Participants completed two tasks: a scene memory task and a spatial memory task. In the protocol they completed six similarly structured blocks, and within each block, stimulus presentation was pseudorandomized to maintain engagement. Each block began with the scene memory task, a one-back paradigm in which participants indicated whether the current scene matched the previous one based on wall color alone and responded with a designated key if a match was detected. This task consisted of 64 images depicting four door configurations in two environments (i.e., two wall colors), with each image presented eight times. Participants then performed the spatial memory task. First, they passively viewed a guided navigation sequence through a series of rooms to learn the location of a hidden target that could appear on the left, center, or right. Importantly, each wall color was consistently associated with a specific target direction. Participants were then presented with the same pictures as those used in the scene memory task and asked to indicate the remembered goal direction as quickly and accurately as possible using a USB keyboard (e.g., in the red room the goal is on the right and on the left in the blue one). Although the visual stimuli were the same in both tasks, the spatial memory task required participants to integrate spatial contextual information gained during the navigation phase. In both tasks, accurate performance depended on attention to the color of the wall, thus aligning visual input and attentional demands across tasks. The spatial memory task always followed the scene memory task within each block to prevent contamination of the scene memory task by prior spatial learning.
For each task, images were displayed for 1 second, followed by a fixation cross lasting between 0.5 and 1.0 second, and auditory feedback for responses. Over the course of the experiment, participants viewed a total of 1,540 images (770 per task), during a unique session lasting approximately 55 minutes. Each task involved 7 possible affordances, resulting in 110 repetitions of each configuration per participant, with variations in wall color. With data from 30 participants, this resulted in an average of approximately 3,300 trials per class (e.g., left door opened in the scene memory task). This number is comparable with the 3,337 trials per class reported in previous work (Orima & Motoyoshi, 2023) and the order of magnitude suggested for such analyses in deep learning-based EEG studies (Roy et al., 2019).
EEG preprocessing
EEG data were acquired at a sampling rate of 500 Hz using a 64-channel waveguard cap with active wet electrodes, connected to an eego mylab amplifier (ANT Neuro) and digitized at 24-bit resolution. The reference electrode was positioned at CPz, and the ground electrode at AFz. Electrode impedances were maintained below 10 kΩ, with the majority under 5 kΩ. EEG recordings were synchronized with stimulus presentations using LabRecorder in the Lab Streaming Layer (Kothe et al., 2024).
Because preprocessing steps have been proposed to play an important role in the decoding capabilities of EEGNet (Kessler, Enge, & Skeide, 2024), we opted for an open source and fully reproducible pipeline implemented in MATLAB (R2024a), namely, the BeMobil pipeline (Klug et al., 2022) using EEGLAB v2024.2 (Delorme & Makeig, 2004). Non-experimental segments and highly artifacted portions of the data were manually excluded and the signals were downsampled to 250 Hz. Line noise was automatically detected and removed using Zapline Plus (Klug & Kloosterman, 2022). Bad electrodes were detected based on a correlation threshold with adjacent electrodes of 0.8 and a maximum downtime of 0.6, and then interpolated using spherical interpolation methods, with an average of 3.10 ± 2.58 electrodes removed per subject. The data were referenced to the common average. Additional artifact cleaning was performed in the time domain using artifact subspace reconstruction with a cutoff threshold of 20 (Chang, Hsu, Pion-Tonachini, & Jung, 2020), resulting in the removal of an average of 7.51 ± 3.52% of the data points. We included artifact subspace reconstruction in the BeMoBIL pipeline to remove high-amplitude, non-stereotypical noise bursts early on (Delaux et al., 2021) stabilizing the subsequent independent component analysis decomposition, which then isolates structured artifacts (Chang et al., 2020).
Subsequently, the data were temporally high-pass filtered at 1.5 Hz and adaptive mixture independent component analysis decomposition was applied with 10 rejection iterations and a sigma threshold of 3 (Klug, Berg, & Gramann, 2024; Klug & Gramann, 2021). Dipole fitting was performed using DipFit, and independent components were classified using ICLabel with default parameters (Pion-Tonachini, Kreutz-Delgado, & Makeig, 2019). Components identified as muscle, line noise, eye, or heart artifacts were excluded, and only those classified as brain or other with residual variance less than 15% were retained (Delorme, Palmer, Onton, Oostenveld, & Makeig, 2012). All the computed adaptive mixture independent component analysis information and dipole fitting were then copied to the initial preprocessed unfiltered dataset. The cleaned data were then band-pass filtered between 0.3 and 50.0 Hz, and epochs from 1 second before to 2 seconds after stimulus onset were extracted. Epochs containing artifacts larger than 100 µV were excluded, resulting in an average of 1,461 ± 90.02 epochs per subject, balanced across conditions. This final check ensured high signal quality which was proposed to help improve model robustness (e.g., see Kessler et al., 2024; Roy et al., 2019)
EEGNet and Grad-CAM procedure
Data preparation
Data were exported from MATLAB to Python for subsequent analyses using an EEG-based classification framework implemented with PyTorch (Paszke et al., 2019). Analyses were performed on an NVIDIA RTX 4090 GPU with 24 GB of VRAM using CUDA 12.6. Each dataset was normalized at the subject level and balanced to ensure equal class representation (e.g., we ensured an equal number of spatial and scene memory trials for the first model). We used full EEG epochs, from image onset to 1 second post stimulus (250 time points), with baseline correction applied between −200 ms and 0 ms, following standard ERP preprocessing practices and recommendation for neural network (Kessler et al., 2024). Both electrodes of the right and left mastoid were removed from the dataset; these electrodes were highly artefacted for most of the participants.
EEGNet procedure
EEG decoding was performed using the EEGNet convolutional architecture (Lawhern et al., 2018), which efficiently combines depthwise spatial and separable temporal convolutions. Each trial (61 electrodes × 251 time points) was input to a first temporal convolution layer that extracted frequency-specific features, followed by a depthwise–spatial convolution and a pointwise separable convolution. This sequence consisted of ConvBlock 1, whose output was then batch-normalized (Ioffe & Szegedy, 2015), activated with an exponential linear unit (Clevert, Unterthiner, & Hochreiter, 2016), averaged, and regularized by two-dimensional dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). A second, identical ConvBlock 2 further refined spatiotemporal patterns. The resulting feature maps were flattened and passed to a fully connected layer, generating one logit per class (e.g., two tasks or seven affordance categories). By alternating temporal and spatial convolutions, batch normalization, exponential linear unit activations, average pooling, and dropout, EEGNet captures both when (temporal patterns) and where (spatial patterns across electrodes) relevant EEG features occur, while keeping the parameter count low for rapid, single‐trial classification (Lawhern et al., 2018). Batch normalization accelerates convergence and stabilizes training (Ioffe & Szegedy, 2015), exponential linear unit mitigates vanishing gradients (Clevert et al., 2016), and channel-wise dropout prevents overfitting (Srivastava et al., 2014). To compensate for class imbalances and promote generalization, we optimized the network using class-weighted cross-entropy loss (Cui, Jia, Lin, Song, & Belongie, 2019).
Evaluation of the trained model
After training the model on 80% of the full pooled dataset (i.e., containing data from all the subjects) with early stopping, performance was evaluated on the 20% remaining dataset using five-fold cross-validation for each individual subject. Classification accuracy was then compared with chance levels using a binomial cumulative distribution approach. For each analysis, a binomial distribution was computed using MATLAB's binoinv function, taking into account the predefined significance level (α = 0.05), the number of predictions (the average number of classifications made per participant), and the number of classes (2 for task predictions, 7 for affordances, and 8 for wall color). The binomial method was used to determine statistical significance, as it is comparable in reliability to permutation testing for datasets with more than 100 trials, but does not have the extensive computational requirements of permutation testing (Combrisson & Jerbi, 2015; Vahid, Mückschel, Stober, Stock, & Beste, 2020). We then used 100,000 iterations of bootstrap resampling to calculate the mean difference between model accuracy and a random sample drawn from the generated binomial distribution. From the bootstrap distribution of the mean differences between accuracy of the model and chance, we derived a 95% confidence interval and a p value, which are presented in the Results.
GradCAM procedure
Finally, Grad-CAM (Selvaraju et al., 2017) was used to visualize the contributions to classification by focusing on activations from the second convolutional block. This approach was informed by previous work demonstrating that applying Grad-CAM to intermediate convolutional layers in EEGNet effectively highlights spatiotemporal patterns relevant to cognitive tasks, such as scene perception (Orima & Motoyoshi, 2023). By targeting the second convolutional block, the visualization captures mid-level features that balance spatial and temporal information, providing more interpretable insights into the neural dynamics underlying scene processing. Specifically, gradients of predicted values were computed with respect to feature map activations, resulting in two-dimensional localization maps (electrodes and time points). These maps were processed through rectified linear unit layers, normalized, and averaged across participants to identify the key EEG channels and time points that most significantly contributed to the classifications, as illustrated in the Results with the relative importance score (a value normalized between 0 and 1). To ensure the robustness and generalizability of our findings, all analyses were repeated five times, and the median results were reported. Detailed information about the different models architectures and example code are available in the study's OSF repository.
Hyperparameter selection
In previous work using a similar architecture (Orima & Motoyoshi, 2023), the primary limitation identified was the arbitrary selection of hyperparameters. To overcome this issue, the present study adopts a more systematic approach, with hyperparameter selection using Bayesian optimization and HyperBand (Falkner, Klein, & Hutter, 2018). Using Optuna's Tree‐structured Parzen Estimator sampler and Hyperband pruner (Akiba, Sano, Yanase, Ohta, & Koyama, 2019; Li, Jamieson, DeSalvo, Rostamizadeh, & Talwalkar, 2018), we explored a search space encompassing training epochs, batch sizes, learning, weight decays, dropout rates, temporal filters in ConvBlock 1, spatial filters in ConvBlocks 2 and 3, and early stopping patience. To ensure robustness, this entire search–train–test procedure was repeated five times, yielding consistent accuracy across all repetitions.
Results
Early EEG signal is modulated by visuospatial memory information
In the first analysis, we aimed to test whether learning contextual information (i.e., knowing that one of the opened doors hides a goal) modulates early neural markers of scene processing in a top-down manner. We also sought to determine whether this modulation is restricted to occipitoparietal regions, as suggested by Naveilhan et al. (2024), or whether it engages a broader network, including frontal regions. To investigate this, we trained a CNN on balanced data to classify the task participants were performing: either the scene memory task (before learning the goal's position) or the spatial memory task (after learning it). For each participant, we performed five-fold cross-validation and applied Grad-CAM to the second layer of the trained CNN (Figure 2) to identify the time points and electrodes most relevant for task classification. Here, we argue that, if the CNN can reliably distinguish whether participants have previously learned the goal's position, this would suggest that early neural markers of scene processing are indeed modulated by learning spatial contextual information, given that the visual stimuli remained identical across both conditions.
Figure 2.
Schematic representation of the CNN architecture and interpretability pipeline. The input consists of a 61-channel EEG epoch with a duration of 1 second. It is then processed through three convolutional blocks and a fully connected layer to predict one of three door-count categories. For example here, the one door in red has been poorly classified, which would lead in this example of a 75% accuracy of the model. To enhance interpretability Grad-CAM heatmaps were used to visualize the contributions to classification by focusing on activations from the second convolutional block.
Looking at the accuracy of the model (Figure 3A), we find an average classification accuracy of 75.39 ± 7.28%. Bootstrapping analysis using the binomial distribution showed that the accuracy of the model was better than the 50% chance level (95% confidence interval [CI] of the mean difference from chance: 95% CI, 19.40–31.45; p < 0.0001). This result suggests that learning contextual information modified the neural correlates of scene perception allowing the network to classify the task. In order to identify which time points and EEG channels contributed most, we then looked at the Grad-CAM results (Figure 3B). These results highlight that the activity of a cluster of occipitoparietal electrodes (P8, POz, O1, O2, P6, PO4, PO6, PO7, PO8, and Oz) between 50 and 180 ms after stimulus presentation contributed the most to task classification, a result that may reflect differences in visual stimuli processing. The same electrodes also seemed to be involved later between 200 and 250 ms. Finally, later activity appeared after 600 ms, which also enabled the classification of the task, reflecting the onset of motor activity to produce the response.
Figure 3.
Results of the EEGNet classification and Grad-CAM analysis, for the CNN trained on the entire merged dataset to classify the task being performed. (A) Violin plot showing the accuracy of the model. (B) Grad-CAM results showcasing the time-channel points in the second CNN layer that most significantly contributed to the classification. The colorbar represents the relative importance score, normalized between 0 and 1. The most contributive points are in red, and the colorbar represents arbitrary units. The channels are arranged sequentially from frontal to occipital regions.
In a second step, we extracted the EEG activity of participants involved in the spatial memory task (i.e., when they had to retrieve the target). We selected only trials in which participants were presented with two open doors and in which they successfully retrieved the position of the goal (93.47 ± 0.58% of trials per participant). This ensured that the network had sufficient trials and used only the EEG activity associated with goal direction information as support for the classification.
The accuracy of the model (Figure 4A) indicated that on average the network classified 45.41 ± 5.37% of the tested trials correctly, performing better than the 33% expected by chance (95% CI, 3.74–13.63; p = 0.0004). The Grad-CAM results (Figure 4B) highlighted activity in frontal electrodes between 180 and 200 ms, as well as activity in occipitoparietal electrodes between 200 and 220 ms as the most contributive to the goal position identification. These results suggest that early EEG activity also contains information regarding the position of the goal, irrespective of the visual features of the scene. This interpretation was further strengthened when we performed a similar analysis to detect the position of the goal, but in the three-affordance condition. Here, we also found that the model accurately classified the position of the goal (mean classification accuracy = 48.67 ± 6.54%; 95% CI, 6.95–16.89; p < 0.0001), despite the fact that there was no visual information in the presented scene allowing to distinguish between conditions. Together, these findings indicate that early EEG activity also encodes information about the goal's position, regardless of the visual characteristics of the scene.
Figure 4.
Outcomes of the EEGNet classification and Grad-CAM analysis, for the CNN trained on the dataset with the two affordances conditions to classify the goal position. (A) Violin plot displaying the model's accuracy. (B) Grad-CAM results indicating the time-channel points in the second CNN layer, with those making the highest contribution to the classification presented in red.
Prior spatial knowledge modulates early neural activity related to affordance processing
In this second analysis, we delve further into the previous results suggesting a modulation of early neural markers of scene processing following the learning of contextual information. First, we used EEG activity from participants performing the scene memory task and trained a model to detect affordances within the scene (i.e., the number and position of the open doors). Based on previous findings (Harel et al., 2022), we hypothesized that the model would perform above the chance level, confirming that navigational affordances are extracted automatically from visual scenes (Bonner & Epstein, 2017). Once the model was trained, we then used it to test our main hypothesis: that the neural correlates of affordance extraction are modified after participants learn that one of the doors leads to a goal. To test this, we applied this model to EEG data from the spatial memory task, where participants had learned the location of the goal. We argue that, if the model fails to generalize to this task, it would suggest that the information supporting decoding was modulated by learned spatial contextual information, providing further evidence for top-down modulation.
The model tested on the scene memory task (Figure 5A) showed an accuracy of 35.80 ± 10.45%, and bootstrap testing indicated that it performed better than the 14.29% chance level (95% CI, 10.88–23.68; p < 0.0001). However, when we tested the model on EEG data from the spatial memory task to test the generalization of the model trained on the scene memory (Figure 5C), the classification accuracy dropped to 13.59 ± 1.78% and was no longer higher than chance level (95% CI, −7.41 to 3.11; p = 0.77). These results suggest that the information from neural activity enabling the classification of the position of affordances is modulated by spatial learning. The results of the Grad-CAM analysis (Figure 5B) indicated that the most contributive time points for the classification were once again located in the occipitoparietal region at approximately 200 ms, as previously suggested by modulation of the P2 component in ERP analyses (Harel et al., 2022). Interestingly, earlier occipitoparietal activity (50–100 ms) and frontal activity (150–180 ms) were also involved to a lesser extent in the classification, replicating previous findings on the neural correlates of low-visual feature processing such as the openness of visual scenes (Greene & Oliva, 2009; Hansen, Noesen, Nador, & Harel, 2018; Orima & Motoyoshi, 2023). In a supplementary analysis (see Supplementary Figure S1A), we tested the corollary of this approach. We thus trained the model to detect the number and position of opened doors from the spatial memory task. When then tested on the spatial memory remaining EEG data it performed better than chance level (mean accuracy, 29.18 ± 6.19, significantly above the 12.5% chance level) (95% CI, 1.25–20.16; p = 0.014). However, when we tested it on the scene memory the model was no longer able to detect affordances (mean accuracy, 13.85 ± 2.40; p = 0.942). Finally, we compared classification accuracies of both models (i.e., the one trained and tested on spatial memory and the one trained and tested on scene memory) to see if affordances were represented more distinctly when participants had no prior knowledge of the position of the goal (Supplementary Figure S1B). The second model outperformed the first. t(58) = 3.16, p < 0.001, 95% CI, 2.54–11.29, suggesting that affordances are represented more distinctly when participants had no prior information about the goal location. This finding aligns with the idea that participants may process affordances differently depending on task demands, emphasizing task-relevant affordance (i.e., the one hiding the goal) when required to recall goal positions.
Figure 5.
Results of the EEGNet classification and Grad-CAM analysis, for the CNN trained on the data of participants performing the scene memory task to detect the position of the affordances. (A) Violin plot showing the accuracy of the model during the scene memory task. (B) Grad-CAM results highlighting the time-channel points in the second CNN layer that contribute most to the classification. The most contributive points are in red (C). Violin plot of the accuracy for the model tested on the data of participants performing the spatial memory task.
Finally, we aimed to test the specificity of the modulation by contextual information. We hypothesized that contextual information about goal position may selectively influence the extraction of scene features related to this information (i.e., navigational affordances), but not other unrelated features (e.g., the color of the wall). To this end, we conducted a supplementary analysis using a CNN trained to classify the color of the wall, a scene feature present in both tasks. Similar to the previous analyses, we trained a CNN on EEG data from participants performing the scene memory task, and then tested it on unseen EEG data from the same task as well as on data from the spatial memory task, to assess the generalization of the information supporting classification (Figure 6). For the scene memory task, the CNN achieved an average accuracy of 34.34 ± 11.18%, significantly above the 12.5% chance level (95% CI, 13.75–25.87; p < 0.0001). However, this time, when tested on EEG data from the spatial memory task, the accuracy averaged 18.75 ± 2.09%, which was still statistically above chance (95% CI, 0.31–9.65; p = 0.019). These results suggest that, unlike affordance detection, EEG information supporting the classification of wall color generalizes to some extent across tasks. This finding thus supports the interpretation that contextual information specifically modulates neural correlates of the extraction of scene features associated with that information. This final analysis is particularly important because, in our previous interpretation, the absence of generalization was attributed to neural activity modulation, which could have been affected by other external factors. Although the ability of the wall color model to generalize reduces these concerns, a cautious interpretation of the results is still warranted.
Figure 6.
Accuracy of the CNN, trained on the data from scene memory task to detect the color of the wall (chance level at 12.5%), in the scene memory or the spatial memory task.
Discussion
In this study, we investigated the temporal neural dynamics underlying visual perception during scene processing and its interaction with visuospatial memory information. Our results demonstrate that a CNN trained on EEG data can accurately classify several aspects of visual perception, including the color of the wall within the scene, navigational affordances, and the location of a previously learned target, primarily based on early occipitoparietal brain activity (between 100 and 230 ms). To further support the potential influence of top-down information suggested by this common time window of integration, we showed that a model trained on EEG data to decode affordance-related information in a scene memory task failed to generalize to the task in which participants had learned the position of a goal. In contrast, another model trained to detect wall color performed significantly above chance in the same context. Taken together, these results highlight the strong influence of top-down information related to prior spatial knowledge on early occipitoparietal neural activity during scene perception.
Navigational affordances and visuospatial memory processing share a common early integration time window
The Grad-CAM analyses highlighted occipitoparietal activity at approximately 200 ms as the most important contributor to the classification of navigational affordances within the scene, consistent with previous findings by Harel et al. (2022). In their study, the authors also used a multivariate approach and suggested that the location of affordances may be represented even earlier, at approximately 130 ms. In the present study, we did not distinguish between the position and number of affordances, but found that activity at approximately both 100 ms and 200 ms contributed to the classification of affordances, with the latter time window playing a more important role. This early extraction of the position of available pathways for movement may reflect the brain's ability to rapidly construct high-level representations of visual scenes based on categorical distinctions and global spatial properties (Orima & Motoyoshi, 2021, 2023). In support of this point, studies combining magnetoencephalography measurements with deep neural network analyses have demonstrated a hierarchical progression of scene representations, integrating information from low-level features to higher-order scene features within 200 ms (Cichy, Khosla, Pantazis, & Oliva, 2017). These findings suggest that the processing of affordances from visual scenes occurs as early as other perceptual processes, bridging the gap between scene perception (e.g., encoding the number of affordances) and action (Djebbara et al., 2019; Harel et al., 2022).
Interestingly, our results suggest that higher-level information, such as spatial contextual information about the target location, is also represented in this early occipitoparietal activity. This common integration time window is consistent with several lines of evidence for top-down influences on early visual processing during scene and object perception (Enge et al., 2023; Klink, Kaiser, Stecher, Ambrus, & Kovács, 2023; Steel, Billings, Silson, & Robertson, 2021; Steel, Garcia, Goyal, Mynick, & Robertson, 2023; Steel, Silson, Garcia, & Robertson, 2024). For example, Enge et al. (2023) demonstrated that prior knowledge of an object's function shapes early cortical markers at approximately 200 ms and influences higher-order visual perception (Greene & Hansen, 2020; Groen et al., 2018). Similarly, Klink et al. (2023) reported that scene familiarity modulates early EEG activity, suggesting a role for learned scene context in shaping early visual processing. The current results support this interpretation by highlighting the critical role of visuospatial memory-derived information in shaping early visual markers of scene processing. Furthermore, these results offer novel insights into a common early time window for integrating both bottom-up and top-down information, emphasizing the dynamic interplay between scene perception and prior visuospatial knowledge. To extend these findings, future studies could examine whether similar modulation occurs when participants repeat the same task before and after learning contextual information, for example, in a scene memory task where the goal location is learned but not retrieved. However, this makes it impossible to assess whether the information was encoded successfully, because participants are not tested on this knowledge, which is why we opted for the current protocol.
Visuospatial memory immediately influences the early neural activity of affordance extraction
To investigate the influence of top-down modulation on early neural correlates of scene perception, we examined how prior learning of the target location affects the classification abilities of the CNN in decoding affordance information. To test this, we trained a model that accurately decoded affordances in the scene memory task, confirming that the extraction of navigational affordances is represented in early occipitoparietal activity (Dwivedi et al., 2024; Harel et al., 2022). However, this model failed to detect affordances after participants learned the path to the goal in the spatial memory task, even though the presented images were exactly the same. This finding suggests that the neural correlates of navigational affordances, including their number and location, are modulated by the learning of spatial contextual information about these affordances. In a previous analysis of the same dataset, we reported at the behavioral level that increasing the number of affordances decreased participants' accuracy in the spatial memory task only (Naveilhan et al., 2024). We interpreted these results as the consequence of information related to prior spatial knowledge interacting with the automatic encoding of navigational affordances. The present neural results strengthen this interpretation, revealing a modulation of the early neural makers of affordances extractions once contextual information is learned. Critically, this effect cannot be attributed merely to task differences that would hinder the model's ability to generalize, as demonstrated by a control analysis showing significant generalization between scene memory and spatial memory tasks in a model similarly trained to decode wall colors. This lack of modulation observed for wall color features is consistent with previous findings by Hansen et al. (2018), which showed that processing of low-level visual information is rapid and minimally affected by observer-based goals. Thus, our results indicate a modulation of neural activity that is specific to scene features associated with goal location. Importantly, this effect emerges rapidly, suggesting that such modulation does not require extensive learning. This directly extends the findings of Enge et al. (2023), who proposed that semantic knowledge about objects (i.e., knowing how to use an object) influenced the early neural marker of their visual processing. Our results extend this idea to scene perception, where knowing that a goal is located behind one of the opened doors modulates early neural markers of scene processing. These findings on the interaction between scene feature integration and prior spatial information during navigational affordance processing illustrate the dynamic and complex nature of visual scene perception. They also extend the principle of top-down facilitation, originally described in object recognition. For example, Bar et al. (2006) showed that coarse visual information triggers early orbitofrontal cortex activity approximately 50 ms before recognition-related responses in high-level visual areas, thereby facilitating subsequent perceptual analysis. By analogy, knowing that a goal lies behind one of the doors likely initiates frontal feedback that could primes occipitoparietal circuits to extract the task relevant navigational affordance (see Figure 4 for this temporal pattern), highlighting the need to consider a more integrated action-oriented approach for visual perception (Nau et al., 2024).
Perspective and limitations
These findings on the interaction between scene feature integration and prior spatial information during navigational affordance processing illustrate the dynamic and complex nature of visual scene perception. From a broader perspective, this is consistent with the predictive coding framework for efficient perception (Peelen, Berlot, & de Lange, 2024; Rao & Ballard, 1999) and the ecological basis of affordances (Gibson, 1979). This notion emphasizes the intrinsic connection between action and perception, highlighting the role of perception in guiding action, and is consistent with the notion of active inference, which views action and perception as integrated processes that work to minimize prediction error (Friston, 2010; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017; Kaplan & Friston, 2018; Pezzulo & Cisek, 2016). Consistent with this theoretical framework, our results suggest that spatial memory content modulates the early neural correlates of navigational affordance processing to support efficient, goal-directed visual information processing.
Despite these interesting results it is important to acknowledge some limitations related to the methodology used here. Even if we addressed most of the limitations raised in previous similar work (i.e., using an unbiased automatic selection of hyperparameters and testing against chance level the accuracy of our models), certain methodological points merit further evaluation. In particular, the use of a fixed convolutional architecture could be enhanced by implementing nested cross-validation to optimize a deep ConvNet model, such as that proposed by Schirrmeister et al. (2017). This would allow for the adaptive tuning of temporal and spatial filter parameters, which can significantly influence the timing and spatial distribution of Grad-CAM activations. Additionally, gradient-based attribution methods often emphasize high-frequency components, potentially leading to misleading interpretations (Simonyan, Vedaldi, & Zisserman, 2014; Sundararajan, Taly, & Yan, 2017). To mitigate this, EEG analyses should maintain realistic electrode topographies rather than treating channels as independent features. Further research could also leverage integrated tools like the newly developed MEEGNet (Arthur, Annalisa, Harel, Irina, & Karim, 2025), which merges CNNs with Grad-CAM for seamless magnetoencephalography/EEG decoding, enhancing workflow efficiency and providing transparent analysis from EEG data preprocessing to interpretability.
Finally, it is important to consider alternative explanations for the modulation of affordance processing. The first concerns potential task-related difference, such as variations in task difficulty or cognitive load between the spatial and scene memory tasks. However, control analyses of theta activity over frontomedial electrodes activity, a well-established marker of cognitive load (Cavanagh & Frank, 2014) help to mitigate most concerns related to this issue (see Supplementary Figure S2). This interpretation is further supported by the fact that the task-classification model can also determine whether participants successfully learned the spatial information (see Supplementary Figure S3), suggesting that features allowing the classification are indeed related to the acquisition of this information. A second explanation arises from the findings of Castelhano and Witherspoon (2016). They demonstrated that knowledge of an object's function significantly increases search efficiency by directing attention to functionally relevant areas within a scene (Bouwkamp, de Lange, & Spaak, 2024). In our paradigm, participants may have explored the visual scene differently, and possibly directed their attention towards the task-relevant affordance (i.e., the opened door containing the goal to retrieve). Even though participants in our design had to focus on the wall to identify their color to perform the task, future studies incorporating eye tracking could help to disentangle these possibilities and clarify this potential confound.
Conclusions
In conclusion, this work supports the notion that, during scene perception, neural processes associated with behaviorally relevant tasks, such as retrieving the position of a goal hidden behind a door, share a common time window of integration, approximately 100 to 250 ms over occipitoparietal regions, with processes representing visual information, such as navigational affordances. The present results also demonstrate an immediate influence of the newly learned spatial knowledge on the early neural activity associated with scene processing, modulating even the first stages of visual scene perception. Thus, knowing where to go may shape what you see.
Data availability
The complete set of data, stimuli generated and code used for preprocessing, and CNN are available on the OSF repository: https://osf.io/5cqxy/?view_only=2957dac1a1304c2db6e1fb3b056c5008.
Supplementary Material
Acknowledgments
This research was made possible by the generous participation of volunteer participants, to whom the authors are sincerely grateful. We also thank Catherine Buchanan for her careful reading of the manuscript and her feedback.
Supported by the French government through the France 2030 investment plan managed by the National Research Agency (ANR), as part of the Initiative of Excellence Université Côte d'Azur under reference number ANR-15-IDEX-01 and, in particular, by the interdisciplinary Institute for Modeling in Neuroscience and Cognition (NeuroMod) of Université Côte d'Azur.
Author contributions: C.N., Conceptualization; Formal analysis; Investigation; Methodology; Writing—Original draft; Writing—Review & editing. R.Z., Funding acquisition; Writing—Review & editing. S.R., Conceptualization; Funding Acquisition; Methodology; Project administration; Supervision, Writing—Review & editing.
Commercial relationships: none.
Corresponding author: Clément Naveilhan.
Email: clement.naveilhan@univ-cotedazur.fr.
Address: Université Côte d'Azur, LAMHESS, Nice 06100, France.
References
- Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631, 10.1145/3292500.3330701. [DOI]
- Aminoff, E. M., & Tarr, M. J. (2021). Functional context affects scene processing. Journal of Cognitive Neuroscience, 33(5), 933–945, 10.1162/jocn_a_01694. [DOI] [PubMed] [Google Scholar]
- Arthur, D., Annalisa, P., Harel, Y., Irina, R., & Karim, J. (2025). MEEGNet: An open source python library for the application of convolutional neural networks to MEG (p. 2025.03.20.644276). bioRxiv, 10.1101/2025.03.20.644276. [DOI]
- Bar, M. (2009). The proactive brain: Memory for predictions. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1235–1243, 10.1098/rstb.2008.0310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bar, M., Kassam, K. S., Ghuman, A. S., Boshyan, J., Schmid, A. M., Dale, A. M., ... Halgren, E. (2006). Top-down facilitation of visual recognition. Proceedings of the National Academy of Sciences of the United States of America, 103(2), 449–454, 10.1073/pnas.0507062103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartnik, C. G., & Groen, I. I. A. (2023). Visual perception in the human brain: How the brain perceives and understands real-world scenes. In: Oxford Research Encyclopedia of Neuroscience. Retrieved from https://oxfordre.com/neuroscience/view/10.1093/acrefore/9780190264086.001.0001/acrefore-9780190264086-e-437, 10.1093/acrefore/9780190264086.013.437. [DOI] [Google Scholar]
- Bartnik, C. G., Vukšić, N., Bommer, S., & Groen, I. I. A. (2024). Unveiling task-dependent action affordance representations: Insights from scene-selective cortex and deep neural networks. Journal of Vision, 24(10), 897, 10.1167/jov.24.10.897. [DOI] [Google Scholar]
- Bonner, M. F., & Epstein, R. A. (2017). Coding of navigational affordances in the human visual system. Proceedings of the National Academy of Sciences of the United States of America, 114, 4793–4798, 10.1073/pnas.1618228114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bouwkamp, F. G., de Lange, F. P., & Spaak, E. (2024). Spatial predictive context speeds up visual search by biasing local attentional competition. Journal of Cognitive Neuroscience, 37, 1–15, 10.1162/jocn_a_02254. [DOI] [PubMed] [Google Scholar]
- Castelhano, M., & Witherspoon, R. (2016). How you use it matters: Object function guides attention during visual search in scenes. Psychological Science, 27, 606–621, 10.1177/0956797616629130. [DOI] [PubMed] [Google Scholar]
- Cavanagh, J. F., & Frank, M. J. (2014). Frontal theta as a mechanism for cognitive control. Trends in Cognitive Sciences, 18(8), 414–421, 10.1016/j.tics.2014.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang, C.-Y., Hsu, S.-H., Pion-Tonachini, L., & Jung, T.-P. (2020). Evaluation of artifact subspace reconstruction for automatic artifact components removal in multi-channel EEG recordings. IEEE Transactions on Biomedical Engineering, 67(4), 1114–1121. IEEE Transactions on Biomedical Engineering, 10.1109/TBME.2019.2930186. [DOI] [PubMed] [Google Scholar]
- Choi, B., McCloskey, M., & Park, S. (2020). Representing navigational affordance based on high-level knowledge of scenes. Journal of Vision, 20(11), 646, 10.1167/jov.20.11.646. [DOI] [Google Scholar]
- Cichy, R. M., Khosla, A., Pantazis, D., & Oliva, A. (2017). Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks. NeuroImage, 153, 346–358, 10.1016/j.neuroimage.2016.03.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (ELUs) (arXiv:1511.07289). arXiv, 10.48550/arXiv.1511.07289. [DOI]
- Combrisson, E., & Jerbi, K. (2015). Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. Journal of Neuroscience Methods, 250, 126–136, 10.1016/j.jneumeth.2015.01.010. [DOI] [PubMed] [Google Scholar]
- Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. arXiv:1901.05555 10.48550/arXiv.1901.05555. [DOI]
- Delaux, A., de Saint Aubert, J.-B., Ramanoël, S., Bécu, M., Gehrke, L., Klug, M., ... Arleo, A. (2021). Mobile brain/body imaging of landmark-based navigation with high-density EEG. European Journal of Neuroscience, 54(12), 8256–8282, 10.1111/ejn.15190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delorme, A., & Makeig, S. (2004). EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods, 134(1), 9–21, 10.1016/j.jneumeth.2003.10.009. [DOI] [PubMed] [Google Scholar]
- Delorme, A., Palmer, J., Onton, J., Oostenveld, R., & Makeig, S. (2012). Independent EEG sources are dipolar. PLoS One, 7(2), e30135, 10.1371/journal.pone.0030135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Djebbara, Z., Fich, L. B., Petrini, L., & Gramann, K. (2019). Sensorimotor brain dynamics reflect architectural affordances. Proceedings of the National Academy of Sciences of the United States of America, 116(29), 14769–14778, 10.1073/pnas.1900648116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dwivedi, K., Sadiya, S., Balode, M. P., Roig, G., & Cichy, R. M. (2024). Visual features are processed before navigational affordances in the human brain. Scientific Reports, 14, 5573, 10.1038/s41598-024-55652-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enge, A., Süß, F., & Rahman, R. A. (2023). Instant effects of semantic information on visual perception. Journal of Neuroscience, 43(26), 4896–4906, 10.1523/JNEUROSCI.2038-22.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Epstein, R. A., & Baker, C. I. (2019). Scene perception in the human brain. Annual Review of Vision Science, 5, 373–397, 10.1146/annurev-vision-091718-014809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Falkner, S., Klein, A., & Hutter, F. (2018). BOHB: Robust and efficient hyperparameter optimization at scale. Proceedings of the 35th International Conference on Machine Learning, 1437–1446. https://proceedings.mlr.press/v80/falkner18a.html.
- Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191, 10.3758/BF03193146. [DOI] [PubMed] [Google Scholar]
- Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 2, 10.1038/nrn2787. [DOI] [PubMed] [Google Scholar]
- Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., & Pezzulo, G. (2017). Active inference: A process theory. Neural Computation, 29(1), 1–49, 10.1162/NECO_a_00912. [DOI] [PubMed] [Google Scholar]
- Gibson, J. J. (1979). The Ecological Approach to Visual Perception: Classic Edition. Hove, UK: Psychology Press, 10.4324/9781315740218. [DOI] [Google Scholar]
- Greene, M. R., & Hansen, B. C. (2020). Disentangling the independent contributions of visual and conceptual features to the spatiotemporal dynamics of scene categorization. Journal of Neuroscience, 40(27), 5283–5299, 10.1523/JNEUROSCI.2088-19.2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greene, M. R., & Oliva, A. (2009). Recognition of natural scenes from global properties: Seeing the forest without representing the trees. Cognitive Psychology, 58(2), 137–176, 10.1016/j.cogpsych.2008.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gregorians, L., & Spiers, H. J. (2022). Affordances for spatial navigation. In: Djebbara Z. (ed.), Affordances in Everyday Life: A Multidisciplinary Collection of Essays (pp. 99–112). New York: Springer International Publishing, 10.1007/978-3-031-08629-8_10. [DOI] [Google Scholar]
- Groen, I. I. A., Jahfari, S., Seijdel, N., Ghebreab, S., Lamme, V. A. F., & Scholte, H. S. (2018). Scene complexity modulates degree of feedback activity during object detection in natural scenes. PLoS Computational Biology, 14(12), e1006690, 10.1371/journal.pcbi.1006690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Groen, I. I. A., Silson, E. H., & Baker, C. I. (2017). Contributions of low- and high-level properties to neural processing of visual scenes in the human brain. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1714), 20160102, 10.1098/rstb.2016.0102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hansen, N. E., Noesen, B. T., Nador, J. D., & Harel, A. (2018). The influence of behavioral relevance on the processing of global scene properties: An ERP study. Neuropsychologia, 114, 168–180, 10.1016/j.neuropsychologia.2018.04.040. [DOI] [PubMed] [Google Scholar]
- Harel, A., Groen, I. I. A., Kravitz, D. J., Deouell, L. Y., & Baker, C. I. (2016). The temporal dynamics of scene processing: A multi-faceted EEG investigation. eNeuro, 3, ENEURO.0139–16.2016, 10.1523/ENEURO.0139-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harel, A., Nador, J. D., Bonner, M. F., & Epstein, R. A. (2022). Early electrophysiological markers of navigational affordances in scenes. Journal of Cognitive Neuroscience, 34(3), 397–410, 10.1162/jocn_a_01810. [DOI] [PubMed] [Google Scholar]
- Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (arXiv:1502.03167). arXiv. http://arxiv.org/abs/1502.03167.
- Kaiser, D., Häberle, G., & Cichy, R. M. (2020). Cortical sensitivity to natural scene structure. Human Brain Mapping, 41(5), 1286–1295, 10.1002/hbm.24875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamps, F. S., Chen, E. M., Kanwisher, N., & Saxe, R. (2024). Representation of navigational affordances and ego-motion in the occipital place area. Imaging Neuroscience, 41(5), 1286–1295, 10.1162/imag_a_00424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan, R., & Friston, K. J. (2018). Planning and navigation as active inference. Biological Cybernetics, 112(4), 323–343, 10.1007/s00422-018-0753-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kay, K., Bonnen, K., Denison, R. N., Arcaro, M. J., & Barack, D. L. (2023). Tasks and their role in visual neuroscience. Neuron, 111(11), 1697–1713, 10.1016/j.neuron.2023.03.022. [DOI] [PubMed] [Google Scholar]
- Kessler, R., Enge, A., & Skeide, M. A. (2024). How EEG preprocessing shapes decoding performance. arXiv:2410.14453, 10.48550/arXiv.2410.14453. [DOI] [PMC free article] [PubMed]
- Klink, H., Kaiser, D., Stecher, R., Ambrus, G. G., & Kovács, G. (2023). Your place or mine? The neural dynamics of personally familiar scene recognition suggests category independent familiarity encoding. Cerebral Cortex, 33(24), 11634–11645, 10.1093/cercor/bhad397. [DOI] [PubMed] [Google Scholar]
- Klug, M., Berg, T., & Gramann, K. (2024). Optimizing EEG ICA decomposition with data cleaning in stationary and mobile experiments. Scientific Reports, 14(1), 14119, 10.1038/s41598-024-64919-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klug, M., & Gramann, K. (2021). Identifying key factors for improving ICA-based decomposition of EEG data in mobile and stationary experiments. European Journal of Neuroscience, 54(12), 8406–8420, 10.1111/ejn.14992. [DOI] [PubMed] [Google Scholar]
- Klug, M., Jeung, S., Wunderlich, A., Gehrke, L., Protzak, J., Djebbara, Z., ... Gramann, K. (2022). The BeMoBIL Pipeline for automated analyses of multimodal mobile brain and body imaging data (p. 2022.09.29.510051). bioRxiv, 10.1101/2022.09.29.510051. [DOI]
- Klug, M., & Kloosterman, N. A. (2022). Zapline-plus: A Zapline extension for automatic and adaptive removal of frequency-specific noise artifacts in M/EEG. Human Brain Mapping, 43(9), 2743–2758, 10.1002/hbm.25832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kothe, C., Shirazi, S., Stenner, T., Medine, D., Boulay, C., Crivich, M., ... Makeig, S. (2024). The Lab Streaming Layer for Synchronized Multimodal Recording. bioRxiv, 10.1101/2024.02.13.580071. [DOI]
- Kravitz, D. J., Saleem, K. S., Baker, C. I., & Mishkin, M. (2011). A new neural framework for visuospatial processing. Nature Reviews Neuroscience, 12(4), 4, 10.1038/nrn3008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawhern, V. J., Solon, A. J., Waytowich, N. R., Gordon, S. M., Hung, C. P., & Lance, B. J. (2018). EEGNet: A compact convolutional network for EEG-based brain-computer interfaces. Journal of Neural Engineering, 15(5), 056013, 10.1088/1741-2552/aace8c. [DOI] [PubMed] [Google Scholar]
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18, 1–52. [Google Scholar]
- Malcolm, G. L., Groen, I. I. A., & Baker, C. I. (2016). Making Sense of Real-World Scenes. Trends in Cognitive Sciences, 20(11), 843–856, 10.1016/j.tics.2016.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nau, M., Schmid, A. C., Kaplan, S. M., Baker, C. I., & Kravitz, D. J. (2024). Centering cognitive neuroscience on task demands and generalization. Nature Neuroscience, 27(9), 1656–1667, 10.1038/s41593-024-01711-6. [DOI] [PubMed] [Google Scholar]
- Naveilhan, C., Delaux, A., Durteste, M., Lebrun, J., Zory, R., Arleo, A., ... Ramanoel, S. (2025). EXPRESS: Age-related differences in electrophysiological correlates of visuospatial reorientation. Quarterly Journal of Experimental Psychology, 17470218251369786, 10.1177/17470218251369786. [DOI] [PubMed]
- Naveilhan, C., Saulay-Carret, M., Zory, R., & Ramanoël, S. (2024). Spatial contextual information modulates affordance processing and early electrophysiological markers of scene perception. Journal of Cognitive Neuroscience, 36(10), 2084–2099, 10.1162/jocn_a_02223. [DOI] [PubMed] [Google Scholar]
- Orima, T., & Motoyoshi, I. (2021). Analysis and synthesis of natural texture perception from visual evoked potentials. Frontiers in Neuroscience, 15, 698940, 10.3389/fnins.2021.698940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orima, T., & Motoyoshi, I. (2023). Spatiotemporal cortical dynamics for visual scene processing as revealed by EEG decoding. Frontiers in Neuroscience, 17, 1167719, 10.3389/fnins.2023.1167719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html. [Google Scholar]
- Peelen, M. V., Berlot, E., & de Lange, F. P. (2024). Predictive processing of scenes and objects. Nature Reviews Psychology, 3(1), 1, 10.1038/s44159-023-00254-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pezzulo, G., & Cisek, P. (2016). Navigating the affordance landscape: Feedback control as a process model of behavior and cognition. Trends in Cognitive Sciences, 20(6), 414–424, 10.1016/j.tics.2016.03.013. [DOI] [PubMed] [Google Scholar]
- Pion-Tonachini, L., Kreutz-Delgado, K., & Makeig, S. (2019). ICLabel: An automated electroencephalographic independent component classifier, dataset, and website. NeuroImage, 198, 181–197, 10.1016/j.neuroimage.2019.05.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87, 10.1038/4580. [DOI] [PubMed] [Google Scholar]
- Ritchie, J., Wardle, S., Vaziri-Pashkam, M., Kravitz, D., & Baker, C. (2024). Rethinking category-selectivity in human visual cortex. arxiv. 2411-08251, 10.48550/arXiv.2411.08251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T. H., & Faubert, J. (2019). Deep learning-based electroencephalography analysis: A systematic review. Journal of Neural Engineering, 16(5), 051001, 10.1088/1741-2552/ab260c. [DOI] [PubMed] [Google Scholar]
- Schirrmeister, R. T., Springenberg, J. T., Fiederer, L. D. J., Glasstetter, M., Eggensperger, K., Tangermann, M., ... Ball, T. (2017). Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping, 38(11), 5391–5420, 10.1002/hbm.23730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. 2017 IEEE International Conference on Computer Vision (ICCV), 618–626, 10.1109/ICCV.2017.74. [DOI]
- Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (arXiv:1312.6034). arXiv, 10.48550/arXiv.1312.6034. [DOI]
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958. [Google Scholar]
- Steel, A., Billings, M. M., Silson, E. H., & Robertson, C. E. (2021). A network linking scene perception and spatial memory systems in posterior cerebral cortex. Nature Communications, 12(1), 1, 10.1038/s41467-021-22848-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steel, A., Garcia, B. D., Goyal, K., Mynick, A., & Robertson, C. E. (2023). Scene perception and visuospatial memory converge at the anterior edge of visually responsive cortex. Journal of Neuroscience, 43(31), 5723–5737, 10.1523/JNEUROSCI.2043-22.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steel, A., Silson, E. H., Garcia, B. D., & Robertson, C. E. (2024). A retinotopic code structures the interaction between perception and memory systems. Nature Neuroscience, 27(2), 339–347, 10.1038/s41593-023-01512-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep Networks. Proceedings of the 34th International Conference on Machine Learning, 3319–3328. Retrieved from https://proceedings.mlr.press/v70/sundararajan17a.html.
- Vahid, A., Mückschel, M., Stober, S., Stock, A.-K., & Beste, C. (2020). Applying deep learning to single-trial EEG data provides evidence for complementary theories on action control. Communications Biology, 3(1), 1–11, 10.1038/s42003-020-0846-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The complete set of data, stimuli generated and code used for preprocessing, and CNN are available on the OSF repository: https://osf.io/5cqxy/?view_only=2957dac1a1304c2db6e1fb3b056c5008.






