Skip to main content
eLife logoLink to eLife
. 2019 Oct 9;8:e48182. doi: 10.7554/eLife.48182

A neural mechanism for contextualizing fragmented inputs during naturalistic vision

Daniel Kaiser 1,2,, Jacopo Turini 2,3, Radoslaw M Cichy 2,4,5
Editors: Huan Luo6, Joshua I Gold7
PMCID: PMC6802952  PMID: 31596234

Abstract

With every glimpse of our eyes, we sample only a small and incomplete fragment of the visual world, which needs to be contextualized and integrated into a coherent scene representation. Here we show that the visual system achieves this contextualization by exploiting spatial schemata, that is our knowledge about the composition of natural scenes. We measured fMRI and EEG responses to incomplete scene fragments and used representational similarity analysis to reconstruct their cortical representations in space and time. We observed a sorting of representations according to the fragments' place within the scene schema, which occurred during perceptual analysis in the occipital place area and within the first 200 ms of vision. This schema-based coding operates flexibly across visual features (as measured by a deep neural network model) and different types of environments (indoor and outdoor scenes). This flexibility highlights the mechanism's ability to efficiently organize incoming information under dynamic real-world conditions.

Research organism: Human

Introduction

During natural vision, the brain continuously receives incomplete fragments of information that need to be integrated into meaningful scene representations. Here, we propose that this integration is achieved through contextualization: the brain uses prior knowledge about where information typically appears in a scene to meaningfully sort incoming information.

A format in which such prior knowledge about the world is represented in the brain is provided by schemata. First introduced to philosophy to explain how prior knowledge enables perception of the world (Kant, 1781), schemata were later adapted by psychology (Barlett, 1932; Piaget, 1926) and computer science (Minsky, 1975Rumelhart, 1980) as a means to formalize mechanisms enabling natural and artificial intelligence, respectively.

In the narrower context of natural vision, scene schemata represent knowledge about the typical composition of real-world environments (Mandler, 1984). Scene schemata for example entail knowledge about the distribution of objects across scenes, where objects appear in particular locations across the scene and in particular locations with respect to other objects (Kaiser et al., 2019a; Torralba et al., 2006; Võ et al., 2019; Wolfe et al., 2011).

The beneficial role of such scene schemata was first investigated in empirical studies of human memory, where memory performance is boosted when scenes are configured in accordance with the schema (Brewer and Treyens, 1981; Mandler and Johnson, 1976Mandler and Parker, 1976).

Recently however, it has become clear that scene schemata not only organize memory contents, but also the contents of perception. For example, knowledge about the structure of the world can be used to generate predictions about a scene’s content (Bar, 2009; Henderson, 2017), or to efficiently organize the concurrent representation of multiple scene elements (Kaiser et al., 2014; Kaiser et al., 2019b). This position is reinforced by behavioral studies demonstrating a beneficial role of schema-congruent naturalistic stimuli across a variety of perceptual tasks, such as visual detection (Biederman et al., 1982; Davenport and Potter, 2004; Stein et al., 2015) and visual search (Kaiser et al., 2014; Torralba et al., 2006; Võ et al., 2019).

Here, we put forward a novel function of scene schemata in visual processing: they support the contextualization of fragmented sensory inputs. If sensory inputs are indeed processed in relation to the schema context, scene fragments stemming from similar typical positions within the scene should be processed similarly and fragments stemming from different positions should be processed differently. Therefore, the neural representations of scene fragments should be sorted according to their typical place within the scene.

We tested two hypotheses about this sorting process. First, we hypothesized that this sorting occurs during perceptual scene analysis, which can be spatiotemporally pinpointed to scene-selective cortex (Baldassano et al., 2016; Epstein, 2014) and the first 250 ms of processing (Cichy et al., 2017; Harel et al., 2016). Second, given that schema-related effects in behavioral studies (Mandler and Parker, 1976) are more robustly observed along the vertical dimension, where the scene structure is more rigid (i.e., the sky is almost always above the ground), we hypothesized that the cortical sorting of information should primarily occur along the vertical dimension.

To test these hypotheses, we used a novel visual paradigm in which participants were exposed to fragmented visual inputs, and recorded fMRI and EEG data to resolve brain activity in space and time.

Results

In our study, we experimentally mimicked the fragmented nature of naturalistic visual inputs by dissecting scene images into position-specific fragments. Six natural scene images (Figure 1a) were each split into six equally-sized fragments (three vertical × 2 horizontal), resulting in 36 conditions (six scenes × 6 fragments). In separate fMRI (n = 30) and EEG (n = 20) experiments, participants viewed these fragments at central fixation while performing an indoor/outdoor categorization task to ensure engagement with the stimulus (Figure 1b). Critically, this design allowed us to investigate whether the brain sorts the fragments with respect to their place in the schema in the absence of explicit location differences (Figure 1c).

Figure 1. Experimental design and rationale of schema-based information sorting.

(a) The stimulus set consisted of six natural scenes (three indoor, three outdoor). Each scene was split into six rectangular fragments. (b) During the fMRI and EEG recordings, participants performed an indoor/outdoor categorization task on individual fragments. Notably, all fragments were presented at central fixation, removing explicit location information. (c) We hypothesized that the visual system sorts sensory input by spatial schemata, resulting in a cortical organization that is explained by the fragments’ within-scene location, predominantly in the vertical dimension: Fragments stemming from the same part of the scene should be represented similarly. Here we illustrate the hypothesized sorting in a two-dimensional space. A similar organization was observed in multi-dimensional scaling solutions for the fragments’ neural similarities (see Figure 1—figure supplement 1 and Video 1). In subsequent analyses, the spatiotemporal emergence of the schema-based cortical organization was precisely quantified using representational similarity analysis (Figure 2).

Figure 1.

Figure 1—figure supplement 1. MDS visualization of neural RDMs 34 MDS visualization of neural RDMs.

Figure 1—figure supplement 1.

(a/b) A multi-dimensional scaling (MDS) of the fragments’ neural similarity in OPA (a) and after 200 ms of processing (b) revealed a sorting according to vertical location, which was visible in a two-dimensional solution. This visualization suggests that schemata are a prominent organizing principle for representations in OPA and after 200 ms of vision. A time-resolved MDS for the EEG data can be found in Video 1.

Video 1. Time-resolved MDS visualization of the neural RDMs.

Download video file (5.6MB, mp4)
DOI: 10.7554/eLife.48182.004

To directly visualize the emergence of schematic coding from the neural data, we performed a multi-dimensional scaling (MDS) analysis, where the time-resolved neural RDMs (averaged across participants) were projected onto a two-dimensional space. The RDM time series was smoothed using a sliding averaging window (15 ms width). Computing MDS solutions across time yielded a movie (5 ms resolution), where fragments travel through an arbitrary space, eventually forming a meaningful organization. Notably, around 200 ms, a division into the three vertical locations can be observed.

To quantify the sorting of fragments during cortical processing we used spatiotemporally resolved representational similarity analysis (Cichy et al., 2014; Kriegeskorte et al., 2008). We first extracted representational dissimilarity matrices (RDMs) from the fMRI and EEG data, which indexed pairwise dissimilarities of the fragments’ neural representations (for details on RDM construction see Figure 2—figure supplement 1). In the fMRI (Figure 2a), we extracted spatially-resolved neural RDMs from scene-selective occipital place area (OPA) and parahippocampal place area (PPA), and from early visual cortex (V1) (for temporal response profiles in these regions see Figure 2—figure supplement 2). In the EEG (Figure 2b), we extracted time-resolved neural RDMs from −200 ms to 800 ms relative to stimulus onset from posterior EEG electrodes (for other electrode groups see Figure 2—figure supplements 35).

Figure 2. Spatial schemata determine cortical representations of fragmented scenes.

(a) To test where and when the visual system sorts incoming sensory information by spatial schemata, we first extracted spatially (fMRI) and temporally (EEG) resolved neural representational dissimilarity matrices (RDMs). In the fMRI, we extracted pairwise neural dissimilarities of the fragments from response patterns across voxels in the occipital place area (OPA), parahippocampal place area (PPA), and early visual cortex (V1). (b) In the EEG, we extracted pairwise dissimilarities from response patterns across electrodes at every time point from −200 ms to 800 ms with respect to stimulus onset. (c) We modelled the neural RDMs with three predictor matrices, which reflected their vertical and horizontal positions within the full scene, and their category (i.e., their scene or origin). (d) The fMRI data revealed a vertical-location organization in OPA, but not V1 and PPA. Additionally, the fragment’s category predicted responses in both scene-selective regions. (e) The EEG data showed that both vertical location and category predicted cortical responses rapidly, starting from around 100 ms. These results suggest that the fragments’ vertical position within the scene schema determines rapidly emerging representations in scene-selective occipital cortex. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean. In further analysis, we probed the flexibility of this schematic coding mechanism (Figure 3).

Figure 2.

Figure 2—figure supplement 1. Details on neural dissimilarity construction.

Figure 2—figure supplement 1.

Pairwise neural dissimilarity values were into representational dissimilarity matrices (RDMs), so that for every time point one 36 × 36 matrix containing estimates of neural dissimilarity was available. Here, an example RDM at 200 ms post-stimulus is shown, which exemplifies the ordering of fragment combinations for all RDMs.

Figure 2—figure supplement 2. fMRI response time courses.

Figure 2—figure supplement 2.

(a) Functional MRI data were analyzed in three regions of interest (here shown on the right hemisphere): primary visual cortex (V1), occipital place area (OPA), and parahippocampal place area (PPA). Each of these ROIs showed reliable net responses to the fragments, peaking 3 TRs after stimulus onset. The activation time courses were baseline-corrected by subtracting the activation from the first two TRs. (b), GLM analysis across the response time course. Most prominently after 3 TRs, the neural organization in OPA was explained by the fragments’ vertical location, reflecting a neural coding in accordance with spatial schemata. Additionally, scene category predicted neural organization in OPA and PPA. Error margins reflect standard errors of the mean. Significance markers represent p<0.05 (corrected for multiple comparisons across ROIs).

Figure 2—figure supplement 3. Pairwise decoding across EEG electrode groups.

Figure 2—figure supplement 3.

Based on previous studies on multivariate decoding of visual information, we restricted our main analysis to a group of posterior electrodes (where we expected the strongest effects). For comparison, we also analyzed data in central and anterior electrode groups. The central group consisted of 20 electrodes (C3, TP9, CP5, CP1, TP10, CP6, CP2, Cz, C4, C1, C5, TP7, CP3, CPz, CP4, TP8, (C6, C2, T7, T8) and the anterior group consisted of 26 electrodes (F3, F7, FT9, FC5, FC1, FT10, FC6, FC2, F4, F8, Fp2, AF7, AF3, AFz, F1, F5, FT7, FC3, FCz, FC4, FT8, F6, F2, AF4, AF8, Fpz). RDMs were constructed in an identical fashion to the posterior group used for the main analyses (Figure 2—figure supplement 1). We computed general discriminability of the 36 scene fragments in the three groups by averaging all off-diagonal elements of the RDMs. As expected, the resulting time courses of pair-wise discriminability revealed the strongest overall decoding in the posterior group, followed by the central and anterior groups. RSA results for these electrodes are found in Figure 2—figure supplement 4/5. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

Figure 2—figure supplement 4. RSA using central electrodes.

Figure 2—figure supplement 4.

(a/b) Repeating the main RSAs for the central electrode group yielded a similar pattern as the posterior group, revealing both vertical location information (from 85 ms to 485 ms) and category information (from 100 ms to 705 ms). (c/d) Removing DNN features abolished category information, but not vertical location information, most prominently between 185 ms and 350 ms. This result is consistent with the schematic coding observed for posterior signals. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

Figure 2—figure supplement 5. RSA using anterior electrodes.

Figure 2—figure supplement 5.

(a/b) Also responses recorded from the anterior group yielded both vertical location information (from 85 ms to 350 ms) and category information (from 165 ms to 610 ms). (c/d) In contrast to the other electrode groups, removing DNN features rendered location and category information insignificant, suggesting that they are not primarily linked to sources in frontal brain areas. This observation also excludes explanations based on oculomotor confounds. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

Figure 2—figure supplement 6. Vertical location effects across experiment halves.

Figure 2—figure supplement 6.

We interpret the vertical location organization in the neural data as reflecting prior schematic knowledge about scene structure. Alternatively, however, the vertical location organization could in principle result from learning the composition of the scenes across the experiment. In the latter case, one would predict that vertical location effects should primarily occur late in the experiment (e.g., in the second half), and less so towards the beginning (e.g., in the first half). To test this, we split into halves both the fMRI data (three runs each) and the EEG data (first versus second half of trials) and for each half modeled the neural data as a function of the vertical and horizontal location and category predictors. (a) For the fMRI data, we found significant vertical location information in the OPA for in the first half (t[29]=3.46, p<0.001, pcorr <0.05) and a trending effect for the second half (t[29] = 2.07, p = 0.024, pcorr >0.05). No differences between the splits were found in any region (all t<0.90, p>0.37). (b) For the EEG data, we also found very similar results for the two spits, with no significant differences emerging at any time point. Together, these results suggest that the vertical location organization cannot solely be explained by extensive learning over the course of the experiment. Significance markers represent p<0.05 (corrected for multiple comparisons). Empty markers represent p<0.05 (uncorrected). Error margins reflect standard errors of the mean.

Figure 2—figure supplement 7. Pairwise comparisons along the vertical axis.

Figure 2—figure supplement 7.

To test whether vertical location information can be observed across all three vertical bins, we modelled the neural data as a function of the fragments’ vertical location, now separately for each pairwise comparison along the vertical axis (i.e., top versus bottom, top versus middle, and middle versus bottom). (a) For the fMRI data, we only found consistent evidence for vertical location information in the OPA: top versus bottom (t[29]=4.10, p<0.001, pcorr <0.05), top versus middle (t[29]=2.13, p=0.021, pcorr >0.05), middle versus bottom (t[29]=2.06, p=0.024, pcorr >0.05). Although the effect was numerically bigger for top versus bottom, we did not find a significant difference between the three pairwise comparisons in OPA (F[2,58]=2.71, p=0.075). (b) For the EEG data, we found significant vertical location information for all three comparisons. Here, the middle-versus-bottom comparison yielded the weakest effect, which was significantly smaller than the effect for top versus bottom from 120 ms and 195 ms and significantly smaller than the effect for top versus middle from 110 ms to 285 ms. Together, these results suggest that schematic coding can be observed consistently across the different comparisons along the vertical axis, although comparisons including the top fragments yielded stronger effects. Significance markers represent p<0.05 (corrected for multiple comparisons). Empty markers represent p<0.05 (uncorrected). Error margins reflect standard errors of the mean.

Figure 2—figure supplement 8. Controlling for task difficulty.

Figure 2—figure supplement 8.

(a) To control for task difficulty effects in the indoor/outdoor classification task, we computed paired t-tests between all pairs of fragments, separately for their associated accuracies and response times. We then constructed two predictor RDMs that contained the t-values of the pairwise tests between the fragments: For each pair of fragments, these t-values corresponded to dissimilarity in task difficulty (e.g., comparing two fragments associated with similarly short categorization response times would yield a low t-value, and thus low dissimilarity). This was done separately for the fMRI and EEG experiments (matrices from the EEG experiment are shown). The accuracy and response time RDMs were mildly correlated with the category RDM (fMRI: accuracy: r = 0.10, response time: r = 0.15; EEG: accuracy: r = 0.17, response time: r = 0.16), but not with the vertical location RDM (fMRI: both r < 0.01, EEG: both r < 0.01). After regressing out the task difficulty RDMs, we found highly similar vertical location and category information as in the previous analyses (Figure 3b/c). (b) In the fMRI, only category information in OPA was significantly reduced when task difficulty was accounted for. (c) In the EEG, towards the end of the epoch – when participants responded – location and category information were decreased. This shows that the effects of schematic coding – emerging around 200 ms after onset – cannot be explained by differences in task difficulty. The dashed significance markers represent significantly reduced information (compared to the main analyses, Figure 3b/c) at p<0.05 (corrected for multiple comparisons).

Figure 2—figure supplement 9. Categorical versus Euclidean vertical location predictors.

Figure 2—figure supplement 9.

We defined our vertical location predictor as categorical, assuming that top, middle, and bottom fragments are coded distinctly in the human brain. An alternative way of constructing the vertical location predictor is in terms of the fragments’ Euclidean distances, where fragments closer together along the vertical axis (e.g., top and middle) are represented more similarly than fragments further apart (e.g., top and bottom). (a) For the fMRI data, we found that the categorical and Euclidean predictors similarly explained the neural data, with no statistical differences between them (all t[29] <1.15, p>0.26). (b) For the EEG data, we found that both predictors explained the neural data well. However, the categorical predictor revealed significantly stronger vertical location information from 75 ms to 340 ms, suggesting that, at least in the EEG data, the differentiation along the vertical axis is more categorical in nature. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

We then quantified schema effects using separate model RDMs for horizontal and vertical locations (Figure 2c). These location RDMs reflected whether pairs of fragments shared the same location or not. We additionally constructed a category model RDM, which reflected whether pairs of fragments stemmed from the same scene or not.

Critically, if cortical information is indeed sorted with respect to scene schemata, we should observe a neural clustering of fragments that stem from the same within-scene location – in this case, the location RDM should predict a significant proportion of the representational organization in visual cortex.

To test this, we modeled neural RDMs as a function of the model RDMs using general linear models, separately for the fMRI and EEG data. The resulting beta weights indicated to which degree location and category information accounted for cortical responses in the three ROIs and across time.

The key observation was that the fragments’ vertical location predicted neural representations in OPA (t[29] = 4.12, p<0.001, pcorr <0.05), but not in V1 and PPA (test statistics for all analyses and ROIs are reported in Supplementary file 1) (Figure 2d) and between 55 ms and 685 ms (peak: t[19] = 9.03, p<0.001, pcorr <0.05) (Figure 2e). This vertical-location organization was consistent across the first and second half of the experiments (see Figure 2—figure supplement 6) and across all pairwise comparisons along the vertical axis (see Figure 2—figure supplement 7). No effects were observed for horizontal location, consistent with more rigid spatial scene structure in the vertical dimension (Mandler and Parker, 1976). This result provides a first characterization of where and when incoming information is organized in accordance with scene schemata: in OPA and rapidly after stimulus onset, scene fragments are sorted according to their origin within the environment.

The schema-based organization co-exists with a prominent scene-category organization: In line with previous findings (Lowe et al., 2018; Walther et al., 2009), category was accurately predicted in OPA (t[29] = 3.12, p=0.002, pcorr <0.05) and PPA (t[29] = 4.26, p<0.001, pcorr <0.05) (Figure 2d), and from 60 ms to 775 ms (peak: t[19] = 6.39, p<0.001, pcorr <0.05) (Figure 2e).

To efficiently support vision in dynamic natural environments, schematic coding needs to be flexible with respect to visual properties of specific scenes. The absence of vertical location effects in V1 indeed highlights that schematic coding is not tied to the analysis of simple visual features. To more thoroughly probe this flexibility, we additionally conducted three complementary analyses (Figure 3).

Figure 3. Schematic coding operates flexibly across visual and conceptual scene properties.

(a) To determine the role of categorization-related visual features in this schematic organization, we regressed out RDMs obtained from 18 layers along the ResNet50 DNN before repeated the three-predictor general linear model (GLM) analysis (Figure 2c). (b/c) Removing DNN features abolished category information in fMRI and EEG signals, but not vertical location information. (d) To test for generalization across different scene types, we restricted location predictor RDMs to comparisons across indoor and outdoor scenes. Due to this restriction, category could not be modelled. (e/f) In this analysis, vertical location still predicted neural organization in OPA and from 70 ms. (g) Finally, we combined the two analyses: we first regressed out DNN features prior and then modelled the neural RDMs using the restricted predictor RDMs (d). (h) In this analysis, we still found significant vertical location information in OPA. (i) Notably, vertical location information in the EEG signals was delayed to after 180 ms, suggesting that at this stage schematic coding becomes flexible to visual and conceptual attributes. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

Figure 3.

Figure 3—figure supplement 1. AlexNet as a model of visual categorization.

Figure 3—figure supplement 1.

(a) In addition to the ResNet50 DNN, we also used the more widely used AlexNet DNN architecture (pretrained on the ImageNet dataset, implemented in the MatConvNet toolbox) as a model for visual categorization. AlexNet consists of 5 convolutional and three fully-connected layers. We created 8 RDMs, separately for each layer of the DNN. (b/c) Removing the AlexNet DNN features rendered category information non-significant in fMRI and EEG signals. However, we still found vertical location information in OPA and from 65 ms to 375 ms. (c–e) When additionally restricting the analysis to comparisons between indoor and outdoor scenes, the fragments’ vertical location still predicted neural activations in OPA and from 95 ms to 375 ms. In sum, these results are highly similar to the results obtained with the ResNet50 model (Figure 3b/c/h/i). Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

Figure 3—figure supplement 2. DNN model fit.

Figure 3—figure supplement 2.

(a/b) Goodness of fit (R2) across ROIs (a) and time (b) of the GLMs used to regress out DNN features, obtained from ResNet50 (left) or AlexNet (right). For the EEG time series, mean R2 across the baseline period were subtracted. Note that GLMs based on the ResNet50 RDMs had more predictor variables, which may contribute to their better fit. Error bars represent standard errors of the mean.

Figure 3—figure supplement 3. Low-level control models.

Figure 3—figure supplement 3.

We used three control models that explicitly account for low-level visual features: a pixel-dissimilarity model, GIST descriptors, and the fragments’ neural dissimilarity in V1. Critically, all three models did not account for the fragments’ vertical location organization. Moreover, unlike the DNN models, the low-level models were also unable to account for the fragments’ categorical organization. (a/b) Results after regressing out the pixel dissimilarity model, which captured the fragments’ pairwise dissimilarity in pixel space (i.e., 1- the correlation of their pixel values). (c/d) Results after regressing out the GIST model, which captured the fragments’ pairwise dissimilarity in GIST descriptors (i.e., in their global spatial envelope). (e/f) Results after regressing out the V1 model, which captured the fragments’ pairwise neural dissimilarity in V1 (i.e., the averaged RDM across participants) and thereby provides a brain-derived measure of low-level feature similarity. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

First, we tested whether schematic coding is tolerant to stimulus features relevant for visual categorization. Categorization-related features were quantified using a deep neural network (DNN; ResNet50), which extracts such features similarly to the brain (Wen et al., 2018). We removed DNN features by regressing out layer-specific RDMs constructed from DNN activations (see Materials and Methods for details) (Figure 3a); subsequently, we re-estimated location and category information.

After removing DNN features, category information was rendered non-significant in both fMRI and EEG signals. When directly comparing category information before and after removing the DNN features, we found reduced category information in PPA (t[29] = 2.48, p = 0.010, pcorr <0.05) and OPA (t[29] = 1.86, p = 0.036, pcorr >0.05), and a strong reduction of category information across time, from 75 ms to 775 ms (peak t[19] = 13.0, p<0.001, pcorr <0.05). Together, this demonstrates that categorization-related brain activations are successfully explained by DNN features (Cichy et al., 2016; Cichy et al., 2017; Groen et al., 2018; Güçlü and van Gerven, 2015; Wen et al., 2018), indicating the appropriateness of our DNN for modelling visual brain activations. Despite the suitability of our DNN model for modelling categorical brain responses, vertical location still accounted for the neural organization in OPA (t[29] = 2.37, p = 0.012, pcorr <0.05) (Figure 3b) and between 75 ms and 335 ms (peak: t[19] = 5.06, p<0.001, pcorr <0.05) (Figure 3c). Similar results were obtained using a shallower feed-forward DNN (see Figure 3—figure supplement 1). This result suggests that schematic coding cannot be explained by categorization-related features extracted by DNN models.

DNN features are a useful control for flexibility regarding visual features, because they cover both low-level and high-level features, explaining variance across fMRI regions and across EEG processing time (see Figure 3—figure supplement 2; see also Cichy et al., 2016; Güçlü and van Gerven, 2015). However, to more specifically control for low-level features, we used two commonly employed low-level control models: pixel dissimilarity and GIST descriptors (Oliva and Torralba, 2001). These models neither explained the vertical location organization nor the category organization in the neural data (see Figure 3—figure supplement 3). Finally, as an even stronger control of the low-level features encoded in V1, we used the neural dissimilarity structure in V1 (i.e., the neural RDMs) as a control model, establishing an empirical neural measure of low-level features. With V1 housing precise low-level feature representations, this measure should very well capture the features extracted during the early processing of simple visual features. However, removing the V1 dissimilarity structure did neither abolish the schematic coding effects in the OPA nor in the EEG data (see Figure 3—figure supplement 3). This shows that even if we had control models that approximated V1 representations extremely well – as well as the V1 representations approximate themselves – these models could not explain vertical location effects in downstream processing. Together, these results provide converging evidence that low-level feature processing cannot explain the schematic coding effects reported here.

Second, we asked whether schematic coding operates flexibly across visually diverse situations. To test this explicitly we restricted RDMs to comparisons between indoor and outdoor scenes, which vary substantially in visual characteristics (Torralba and Oliva, 2003) (Figure 3d).

Vertical location still predicted cortical organization in OPA (t[29] = 3.05, p = 0.002, pcorr <0.05) (Figure 3e) and from 70 ms to 385 ms (peak: t[19] = 7.47, p<0.001, pcorr <0.05) (Figure 3f). The generalization across indoor and outdoor scenes indicates that schematic coding operates similarly across radically different scenes, suggesting that the mechanism can similarly contextualize information across different real-life situations.

Finally, for a particularly strong test of flexibility, we tested for schematic coding after removing both DNN features and within-category comparisons (Figure 3g). In this analysis, OPA representations were still explained by the fragments’ vertical location (t[29] = 2.38, p = 0.012, pcorr <0.05) (Figure 3h). Notably, early schema effects were rendered non-significant, while vertical location still predicted representations after 180 ms (peak: t[19] = 4.41, p<0.001, pcorr <0.05) (Figure 3i), suggesting a high degree of flexibility emerging at that time. Interestingly, across all analyses, vertical location information was exclusively found in OPA and always peaked shortly after 200 ms (see Supplementary file 2), suggesting that schematic coding occurs during early perceptual analysis of scenes.

Discussion

Together, our findings characterize a novel neural mechanism for contextualizing fragmented inputs during naturalistic vision. The mechanism exploits schemata to sort sensory inputs into meaningful representations of the environment. This sorting occurs during perceptual scene analysis in scene-selective OPA and within the first 200 ms of vision, and operates flexibly across changes in visual properties.

That schema-based coding can be localized to OPA is consistent with the region’s important role in visual scene processing. Transcranial magnetic stimulation studies suggest that OPA activation is crucial for various scene perception tasks, such as scene discrimination (Dilks et al., 2013; Ganaden et al., 2013), navigating through scenes (Julian et al., 2016) and anticipating upcoming scene information (Gandolfo and Downing, 2019). Functional MRI work suggest that computations in the OPA include the analysis of spatial scene layout (Dillon et al., 2018; Henriksson et al., 2019Lowe et al., 2017) and the parsing of local scene elements like objects and local surfaces (Kamps et al., 2016). Future studies are needed to clarify which of these computations mediate the schema-based coding described here.

As the current study is limited to a small set of scenes, more research is needed to explore whether schema-based coding generalizes to more diverse contents. It is conceivable that schema-based coding constitutes a more general coding strategy that may generalize to other visual contents (such as faces; Henriksson et al., 2015) and non-visual processing domains: when sensory information is fragmented and spatial information is unreliable, the brain may use schematic information to contextualize sensory inputs. This view is in line with Bayesian theories of perception where the importance of prior information for perceptual inference grows with the noisiness and ambiguity of the sensory information at hand (Ernst and Banks, 2002; Kersten et al., 2004).

The schema-based sorting of scene representations provides a mechanism for efficient communication between perceptual and cognitive systems: when scene information is formatted with respect to its role in the environment, it can be efficiently read out by downstream processes. This idea is consistent with the emerging view that cortical representations depend on functional interactions with the environment (Bonner and Epstein, 2017; Groen et al., 2018; Malcolm et al., 2016; Peelen and Downing, 2017). Under this view, formatting perceptual information according to real-world structure may allow cognitive and motor systems to efficiently read out visual information that is needed for different real-world tasks (e.g., immediate action versus future navigation). As the schema-based sorting of scene information happens already during early scene analysis, many high-level processes have access to this information.

Lastly, our results have implications for computational modelling of vision. While DNNs trained on categorization accurately capture the representational divide into different scene categories, they cannot explain the schema-based organization observed in the human visual system. Although this does not mean that visual features extracted by DNN models in principle are incapable of explaining schema-based brain representations, our results highlight that current DNN models of categorization do not use real-world structure in similar ways as the human brain. In the future, augmenting DNN training procedures with schematic information (Katti et al., 2019) may improve their performance on real-world tasks and narrow the gap between artificial and biological neural networks.

To conclude, our findings provide the first spatiotemporal characterization of a neural mechanism for contextualizing fragmented visual inputs. By rapidly organizing visual information according to its typical role in the world, this mechanism may contribute to the optimal use of perceptual information for guiding efficient real-world behaviors, even when sensory inputs are incomplete or dynamically changing.

Materials and methods

Key resources table.

Reagent type
(species) or resource
Designation Source or reference Identifiers Additional
information
Software, algorithm CoSMoMVPA Oosterhof et al., 2016 RRID:SCR_014519 For data analysis
Software, algorithm fieldtrip Oostenveld et al., 2011 RRID:SCR_004849 For EEG data preprocessing
Software, algorithm MATLAB Mathworks Inc. RRID:SCR_001622 For stimulus delivery and data analysis
Software, algorithm Psychtoolbox 3 Brainard, 1997 RRID:SCR_002881 For stimulus delivery
Software, algorithm SPM12 www.fil.ion.ucl.ac.uk/spm/software/spm12/ RRID:SCR_007037 For fMRI data preprocessing

Participants

Thirty adults (mean age 23.9 years, SD = 4.4; 26 females) completed the fMRI experiment and twenty (mean age 24.0 years, SD = 4.3; 15 females) completed the EEG experiment. All participants had normal or corrected-to-normal vision. They all provided informed consent and received monetary reimbursement or course credits for their participation. All procedures were approved by the ethical committee of the Department of Education and Psychology at Freie Universität Berlin (reference 140/2017) and were in accordance with the Declaration of Helsinki.

Stimuli

The stimulus set (Figure 1a) consisted of fragments taken from three images of indoor scenes (bakery, classroom, kitchen) and three images of outdoor scenes (alley, house, farm). Each image was split horizontally into two halves, and each of the halves was further split vertically in three parts, so that for each scene six fragments were obtained. Participants were not shown the full scene images prior to the experiment.

Experimental design

The fMRI and EEG designs were identical, unless otherwise noted. Stimulus presentation was controlled using the Psychtoolbox (Brainard, 1997; RRID:SCR_002881). In each trial, one of the 36 fragments was presented at central fixation (7° horizontal visual angle) for 200 ms (Figure 1b). Participants were instructed to maintain central fixation and categorize each stimulus as an indoor or outdoor scene image by pressing one of two buttons.

In the fMRI experiment, the inter-trial interval was kept constant at 2,300 ms, irrespective of the participant’s response time. In the EEG experiment, after each response a green or red fixation dot was presented for 300 ms to indicate response correctness; participants were instructed to only blink after the feedback had occurred. Trials were separated by a fixation interval randomly varying between 1500 ms and 2000 ms.

In the fMRI, participants performed six identical runs. Within each run, each of the 36 scene fragments was shown four times, resulting in 144 trials. Additionally, each run contained 29 fixation trials, where only the central fixation dot was shown. Runs started and ended with brief fixation periods; the total run duration was 7:30 min. In the EEG, each of the 36 fragments was presented 40 times during the experiment, for a total of 1440 trials, divided into 10 runs. Three participants performed a shorter version of the experiment, with only 20 repetitions of each image (720 trials in total).

In both experiments, participants performed very well in the indoor/outdoor categorization task (fMRI: 94% correct, 658 ms mean response time, EEG: 96%, 606 ms). Differences in task difficulty across fragments were not related to the neural effects of interest (Figure 2—figure supplement 8).

fMRI recording and preprocessing

MRI data was acquired using a 3T Siemens Tim Trio Scanner equipped with a 12-channel head coil. T2*-weighted gradient-echo echo-planar images were collected as functional volumes (TR = 2 s, TE = 30 ms, 70° flip angle, 3 mm3 voxel size, 37 slices, 20% gap, 192 mm FOV, 64 × 64 matrix size, interleaved acquisition). Additionally, a T1-weighted image (MPRAGE; 1 mm3 voxel size) was obtained as a high-resolution anatomical reference. During preprocessing, the functional volumes were realigned and coregistered to the T1 image, using MATLAB (RRID:SCR_014519) and SPM12 (www.fil.ion.ucl.ac.uk/spm/; RRID:SCR_014519).

fMRI region of interest definition

We restricted our analyses to three regions of interest (ROIs). We defined scene-selective occipital place area (OPA; Dilks et al., 2013) and parahippocampal place area (PPA; Epstein and Kanwisher, 1998) using a functional group atlas (Julian et al., 2012). As a control region, we defined early visual cortex (V1) using a probabilistic atlas (Wang et al., 2015). All ROIs were defined in standard space and then inverse-normalized into individual-participant space. For each ROI, we concatenated the left- and right-hemispheric masks and performed analyses on the joint ROI.

EEG recording and preprocessing

The EEG was recorded using an EASYCAP 64-channel system and a Brainvision actiCHamp amplifier. The electrodes were arranged in accordance with the standard 10–10 system. The data was recorded at a sampling rate of 1000 Hz and filtered online between 0.03 Hz and 100 Hz. All electrodes were referenced online to the Fz electrode. Offline preprocessing was performed in MATLAB, using the FieldTrip toolbox (Oostenveld et al., 2011; RRID:SCR_004849). The continuous EEG data were epoched into trials ranging from 200 ms before stimulus onset to 800 ms after stimulus onset, and baseline corrected by subtracting the mean of the pre-stimulus interval for each trial and channel separately. Trials containing movement-related artefacts were automatically identified and removed using the default automatic rejection procedure implemented in Fieldtrip. Channels containing excessive noise were removed based on visual inspection. Blinks and eye movement artifacts were identified and removed using independent components analysis and visual inspection of the resulting components. The epoched data were down-sampled to 200 Hz.

Representational similarity analysis

To model the representational structure of the neural activity related to our stimulus set, we used representational similarity analysis (RSA; Kriegeskorte et al., 2008). We first extracted neural RDMs separately for the fMRI and EEG experiments, and then used the same analyses to model their organization. To retrieve the fragments’ position within the original scene, as well their scene category, we used a regression approach, where we modeled neural dissimilarity as a linear combination of multiple predictors (Proklova et al., 2016; Proklova et al., 2019).

Constructing neural dissimilarity – fMRI

For the fMRI data, we used cross-validated correlations as a measure of pairwise neural dissimilarity. First, patterns for each ROI were extracted from the functional images corresponding to the trials of interest. After shifting the activation time course by 3 TRs (i.e., 6 s, accounting for the hemodynamic delay), we extracted voxel-wise activation values for each trial, from the TR that was closest to the stimulus onset on this trial (for results across 6 TRs with respect to trial onset, see Figure 2—figure supplement 2). To account for activation differences between runs, the mean activation across conditions was subtracted from each voxel’s values, separately for each run. For each ROI, response patterns across voxels were used to perform multivariate analyses using the CoSMoMVPA toolbox (Oosterhof et al., 2016; RRID:SCR_014519). For each TR separately, we performed correlation-based (Haxby et al., 2001) multi-voxel pattern analyses (MVPA) for each pair of fragments. These analyses were cross-validated by repeatedly splitting the data into two equally-sized sets (i.e., half of the runs per set). For this analysis, we correlated the patterns across the two sets, both within-condition (i.e., the patterns stemming from the two same fragments and from different sets) and between-conditions (i.e., the patterns stemming from the two different fragments and from different sets). These correlations were Fisher-transformed. Then, we subtracted the within- and between-correlations to obtain a cross-validated correlation measure, where above-zero values reflect successful discrimination. This procedure was repeated for all possible splits of the six runs. Performing this MVPA for all pairs of fragments yielded a 36 × 36 representational dissimilarity matrix (RDM) for each ROI. RDMs’ entries reflected the neural dissimilarity between pairs of fragments (the diagonal remained empty).

Constructing neural dissimilarity – EEG

For the EEG data, we used cross-validated classification accuracies as a measure of pairwise neural dissimilarity. We thus constructed RDMs across time by performing time-resolved multivariate decoding analyses (Contini et al., 2017). RDMs were built by computing pair-wise decoding accuracy for all possible combinations of the 36 stimuli, using the CoSMoMVPA toolbox (Oosterhof et al., 2016). As we expected the highest classification in sensors over visual cortex (Battistoni et al., 2018; Kaiser et al., 2016), only 17 occipital and posterior sensors (O1, O2, Oz, PO3, PO4, PO7, PO8, POz, P1, P2, P3, P4, P5, P6, P7, P8, Pz) were used in this analysis. We report results for other electrode groups in Figure 2—figure supplements 35. For each participant, classification was performed separately for each time point across the epoch (i.e., with 5 ms resolution). The analysis was performed in a pair-wise fashion: Linear discriminant analysis classifiers were always trained and tested on data from two conditions (e.g., the middle left part of the alley versus the top right part of the farm), using a leave-one-trial-out partitioning scheme. The training set consisted of all but one trials for each of the two conditions, while one trial for each of the two conditions was held back and used for classifier testing. This procedure was repeated until every trial was left out once. Classifier performance was averaged across these repetitions. The pairwise decoding analysis resulted in a 36-by-36 neural RDM for each time point. A schematic description of the RDM construction can be found in Figure 2—figure supplement 1.

Location and category predictors

We predicted the neural RDMs in a general linear model (GLM; see below) with three different predictor RDMs (36 × 36 entries each) (Figure 2c): In the vertical location RDM, each pair of conditions is assigned either a value of 0, if the fragments stem from the same vertical location, or the value 1, if they stem from different vertical locations (for results with an alternative predictor RDM using Euclidean distances see Figure 2—figure supplement 9). In the horizontal location RDM, each pair of conditions is assigned either a value of 0, if the fragments stem from the same horizontal location, or a value of 1, if they stem from different horizontal locations. In the category RDM, each pair of conditions is assigned either a value of 0, if the fragments stem from the same scene, or a value of 1, if they stem from different scenes.

In an additional analysis, we sought to eliminate properties specific to either the indoor or outdoor scenes, respectively. We therefore constructed RDMs for horizontal and vertical location information which only contained comparisons between the indoor and outdoor scenes. These RDMs were constructed in the same way as explained above, but all comparisons within the same scene type of scene were removed (Figure 3d).

Modelling neural dissimilarity

To reveal correspondences between the neural data and the predictor matrices, we used GLM analyses. Separately for each ROI (fMRI) or time point (EEG), we modelled the neural RDM as a linear function of the vertical location RDM, the horizontal location RDM, and the category RDM. Prior to each regression, the neural RDMs and predictor RDMs were vectorized by selecting all lower off-diagonal elements – the rest of the entries, including the diagonal, was discarded. Values for the neural RDMs were z-scored. Separately for each subject and each time point, three beta coefficients (i.e., regression weights) were estimated. By averaging across participants, we obtained time-resolved beta estimates for each predictor, showing how well each predictor explains the neural data over time.

Furthermore, we performed an additional GLM analysis with a vertical location predictor and a horizontal location predictor, where comparisons within indoor- and outdoor-scenes were removed (Figure 3d–f); these comparisons were also removed from the regression criterion. Using the same procedure as in the previous GLM analysis, we then estimated the beta coefficients for each predictor at each time point, separately for each subject. For this analysis, a category RDM could not be constructed, as all comparisons of fragments from the same scene were eliminated.

Controlling for deep neural network features

To control for similarity in categorization-related visual features, we used a deep neural network (DNN) model. DNNs have recently become the state-of-the-art model of visual categorization, as they tightly mirror the neural organization of object and scene representations (Cichy et al., 2016; Cichy et al., 2017; Cichy and Kaiser, 2019; Groen et al., 2018; Güçlü and van Gerven, 2015; Wen et al., 2018). DNNs are similar to the brain as they are trained using excessive training material while dynamically adjusting the ‘tuning’ of their connections. Here, we used a DNN (see below) that has been trained to categorize objects across a large number of images and categories, therefore providing us with a high-quality model of how visual features are extracted for efficient categorization. By comparing DNNs activations and brain responses to the scene fragments, we could quantify to which extent features routinely extracted for categorization purposes account for schema-based coding in the human visual system.

In a two-step approach, we re-performed our regression analysis after removing the representational organization emerging from the DNN. First, we used a regression model to remove the contribution of the dissimilarity structure in the DNN model. This model included one predictor for each layer extracted from the DNN (i.e., one RDM for each processing step along the DNN). Estimating this model allowed us to remove the neural organization explained by the DNN while retaining what remains unexplained (in the regression residuals). Second, we re-ran the previous regression analyses (see above), but now the residuals of the DNN regression were used as the regression criterion, so that only the organization that remained unexplained by the DNN was modeled.

As a DNN model, we used a pre-trained version (trained on image categorization for the ImageNet challenge) of the ResNet50 model (He et al., 2016), as implemented in MatConvNet (Vedaldi and Lenc, 2015). This model’s deeper, residual architecture outperforms shallower models in approximating visual cortex organization (Wen et al., 2018). ResNet50 consists of 16 blocks of residual layer modules, where information both passes through an aggregate of layers within the block, and bypasses the block; then the residual between the processed and the bypassing information is computed. Additionally, ResNet50 has one convolutional input layer, and one fully-connected output layer. Here, to not inflate the number of intercorrelated predictor variables, we only used the final layer of each residual block, and thus 18 layers in total (16 from the residual blocks, and the input and output layers). For each layer, an RDM was built using 1-correlation between the activations of all nodes in the layer, separately for each pair of conditions. For regressing out the DNN RDMs, we added one predictor for each available RDM. In Figure 3—figure supplement 1, we show that an analysis using the AlexNet architecture (Krizhevsky et al., 2012) yields comparable results; in Figure 3—figure supplement 2, we additionally provide information about the DNN model fit across regions and time points.

Statistical testing

For the fMRI data, we tested the regression coefficients against zero, using one-tailed, one-sample t-tests (i.e., testing the hypothesis that coefficients were greater than zero). Multiple-comparison correction was based on Bonferroni-corrections across ROIs. A complete report of all tests performed on the fMRI data can be found in Supplementary file 1. For the EEG data, we used a threshold-free cluster enhancement procedure (Smith and Nichols, 2009) to identify significant effects across time. Multiple-comparison correction was based on a sign-permutation test (with null distributions created from 10,000 bootstrapping iterations) as implemented in CoSMoMVPA (Oosterhof et al., 2016). The resulting statistical maps were thresholded at Z > 1.64 (i.e., p<0.05, one-tailed against zero). Additionally, we report the results of one-sided t-tests for all peaks effects. To estimate the reliability of onset and peak latencies we performed bootstrapping analyses, which are reported in Supplementary Items 2/3.

Data availability

Data are publicly available on OSF (DOI.ORG/10.17605/OSF.IO/H3G6V).

Acknowledgements

DK and RMC are supported by Deutsche Forschungsgemeinschaft (DFG) grants (KA4683/2-1, CI241/1-1, CI241/3-1). RMC is supported by a European Research Council Starting Grant (ERC-2018-StG).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Daniel Kaiser, Email: danielkaiser.net@gmail.com.

Huan Luo, Peking University, China.

Joshua I Gold, University of Pennsylvania, United States.

Funding Information

This paper was supported by the following grants:

  • Deutsche Forschungsgemeinschaft KA4683/2-1 to Daniel Kaiser.

  • Deutsche Forschungsgemeinschaft CI241/1-1 to Radoslaw M Cichy.

  • Deutsche Forschungsgemeinschaft CI241/3-1 to Radoslaw M Cichy.

  • H2020 European Research Council ERC-2018-StG 803370 to Radoslaw M Cichy.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing.

Formal analysis, Investigation, Methodology, Writing—review and editing.

Conceptualization, Supervision, Funding acquisition, Project administration, Writing—review and editing.

Ethics

Human subjects: All participants provided informed written consent. All procedures were approved by the ethical committee of the Department of Education and Psychology at Freie Universität Berlin (reference 140/2017) and were in accordance with the Declaration of Helsinki.

Additional files

Supplementary file 1. Complete statistical report for fMRI results.

The table shows test statistics and p-values for all tests performed in the fMRI experiment (Figures 2 and 3). Values reflect one-sided t-tests against zero. All p-values are uncorrected; in the main manuscript, only tests surviving Bonferroni-correction across the three ROIs (marked in color) are considered significant.

elife-48182-supp1.docx (94.1KB, docx)
DOI: 10.7554/eLife.48182.019
Supplementary file 2. Estimating peak latencies.

The table shows means and standard deviations (in brackets) of peak latencies in ms for vertical location and category information in the main analyses (Figures 2 and 3). To estimate the reliability of peaks and onsets (Supplementary file 3) of location and category information in the key analyses, we conducted a bootstrapping analysis. For this analysis, we choose 100 samples of 20 randomly chosen datasets (with possible repetitions). For each random sample, we computed peak and onset latencies; we then averaged the peak and onset latencies across the 100 samples. Peak latencies were defined as the highest beta estimate in the time course. Notably, the peak latency of vertical location information remained highly stable across analyses.

elife-48182-supp2.docx (31.4KB, docx)
DOI: 10.7554/eLife.48182.020
Supplementary file 3. Estimating onset latencies.

The table shows means and standard deviations (in brackets) of onset latencies in ms for vertical location and category information in the main analyses (Figures 2 and 3)). Onset latencies were quantified using the bootstrapping logic explained above (Supplementary file 2). Onsets were defined by first computing TFCE statistics for each random sample, with multiple-comparison correction based on 1000 null distributions. The onset latency for each sample was then defined as the first occurrence of three consecutive time points reaching significance (p<0.05, corrected for multiple comparisons).

elife-48182-supp3.docx (31.2KB, docx)
DOI: 10.7554/eLife.48182.021
Transparent reporting form
DOI: 10.7554/eLife.48182.022

Data availability

Data are publicly available on OSF (http://doi.org/10.17605/OSF.IO/H3G6V), as indicated in the Materials and Methods section of the manuscript.

The following dataset was generated:

Kaiser D, Turini J, Cichy RM. 2019. A neural mechanism for contextualizing fragmented information during naturalistic vision. Open Science Framwork. 10.17605/OSF.IO/H3G6V

References

  1. Baldassano C, Esteva A, Fei-Fei L, Beck DM. Two distinct Scene-Processing networks connecting vision and memory. Eneuro. 2016;3:ENEURO.0178-16.2016. doi: 10.1523/ENEURO.0178-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bar M. The proactive brain: memory for predictions. Philosophical Transactions of the Royal Society B: Biological Sciences. 2009;364:1235–1243. doi: 10.1098/rstb.2008.0310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Barlett FC. Remembering: A Study in Experimental and Social Psychology. Cambridge University Press; 1932. [Google Scholar]
  4. Battistoni E, Kaiser D, Hickey C, Peelen MV. The time course of spatial attention during naturalistic visual search. Cortex. 2018 doi: 10.1016/j.cortex.2018.11.018. [DOI] [PubMed] [Google Scholar]
  5. Biederman I, Mezzanotte RJ, Rabinowitz JC. Scene perception: detecting and judging objects undergoing relational violations. Cognitive Psychology. 1982;14:143–177. doi: 10.1016/0010-0285(82)90007-X. [DOI] [PubMed] [Google Scholar]
  6. Bonner MF, Epstein RA. Coding of navigational affordances in the human visual system. PNAS. 2017;114:4793–4798. doi: 10.1073/pnas.1618228114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Brainard DH. The psychophysics toolbox. Spatial Vision. 1997;10:433–436. doi: 10.1163/156856897X00357. [DOI] [PubMed] [Google Scholar]
  8. Brewer WF, Treyens JC. Role of schemata in memory for places. Cognitive Psychology. 1981;13:207–230. doi: 10.1016/0010-0285(81)90008-6. [DOI] [Google Scholar]
  9. Cichy RM, Pantazis D, Oliva A. Resolving human object recognition in space and time. Nature Neuroscience. 2014;17:455–462. doi: 10.1038/nn.3635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports. 2016;6:27755. doi: 10.1038/srep27755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cichy R M, Khosla A, Pantazis D, Oliva A. Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks. NeuroImage. 2017;153:346–358. doi: 10.1016/j.neuroimage.2016.03.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Cichy RM, Kaiser D. Deep neural networks as scientific models. Trends in Cognitive Sciences. 2019;23:305–317. doi: 10.1016/j.tics.2019.01.009. [DOI] [PubMed] [Google Scholar]
  13. Contini EW, Wardle SG, Carlson TA. Decoding the time-course of object recognition in the human brain: from visual features to categorical decisions. Neuropsychologia. 2017;105:165–176. doi: 10.1016/j.neuropsychologia.2017.02.013. [DOI] [PubMed] [Google Scholar]
  14. Davenport JL, Potter MC. Scene consistency in object and background perception. Psychological Science. 2004;15:559–564. doi: 10.1111/j.0956-7976.2004.00719.x. [DOI] [PubMed] [Google Scholar]
  15. Dilks DD, Julian JB, Paunov AM, Kanwisher N. The occipital place area is causally and selectively involved in scene perception. Journal of Neuroscience. 2013;33:1331–1336. doi: 10.1523/JNEUROSCI.4081-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Dillon MR, Persichetti AS, Spelke ES, Dilks DD. Places in the brain: bridging layout and object geometry in Scene-Selective cortex. Cerebral Cortex. 2018;28:2365–2374. doi: 10.1093/cercor/bhx139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Epstein RA. Neural systems for visual scene recognition. In: Bar M, Keveraga K, editors. Scene Vision. MIT Press; 2014. [Google Scholar]
  18. Epstein R, Kanwisher N. A cortical representation of the local visual environment. Nature. 1998;392:598–601. doi: 10.1038/33402. [DOI] [PubMed] [Google Scholar]
  19. Ernst MO, Banks MS. Humans integrate visual and haptic information in a statistically optimal fashion. Nature. 2002;415:429–433. doi: 10.1038/415429a. [DOI] [PubMed] [Google Scholar]
  20. Ganaden RE, Mullin CR, Steeves JK. Transcranial magnetic stimulation to the transverse occipital sulcus affects scene but not object processing. Journal of Cognitive Neuroscience. 2013;25:961–968. doi: 10.1162/jocn_a_00372. [DOI] [PubMed] [Google Scholar]
  21. Gandolfo M, Downing PE. Causal evidence for expression of perceptual expectations in Category-Selective extrastriate regions. Current Biology. 2019;29:2496–2500. doi: 10.1016/j.cub.2019.06.024. [DOI] [PubMed] [Google Scholar]
  22. Groen IIA, Greene MR, Baldassano C, Fei-Fei L, Beck DM, Baker CI. Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior. eLife. 2018;7:e32962. doi: 10.7554/eLife.32962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Güçlü U, van Gerven MA. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience. 2015;35:10005–10014. doi: 10.1523/JNEUROSCI.5023-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Harel A, Groen II, Kravitz DJ, Deouell LY, Baker CI. The temporal dynamics of scene processing: a multifaceted EEG investigation. Eneuro. 2016;3:ENEURO.0139-16.2016. doi: 10.1523/ENEURO.0139-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Haxby JV, Gobbini MI, Furey ML, Ishai A, Schouten JL, Pietrini P. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science. 2001;293:2425–2430. doi: 10.1126/science.1063736. [DOI] [PubMed] [Google Scholar]
  26. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 770–778. [DOI] [Google Scholar]
  27. Henderson JM. Gaze control as prediction. Trends in Cognitive Sciences. 2017;21:15–23. doi: 10.1016/j.tics.2016.11.003. [DOI] [PubMed] [Google Scholar]
  28. Henriksson L, Mur M, Kriegeskorte N. Faciotopy-A face-feature map with face-like topology in the human occipital face area. Cortex. 2015;72:156–167. doi: 10.1016/j.cortex.2015.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Henriksson L, Mur M, Kriegeskorte N. Rapid invariant encoding of scene layout in human OPA. Neuron. 2019;103:161–171. doi: 10.1016/j.neuron.2019.04.014. [DOI] [PubMed] [Google Scholar]
  30. Julian JB, Fedorenko E, Webster J, Kanwisher N. An algorithmic method for functionally defining regions of interest in the ventral visual pathway. NeuroImage. 2012;60:2357–2364. doi: 10.1016/j.neuroimage.2012.02.055. [DOI] [PubMed] [Google Scholar]
  31. Julian JB, Ryan J, Hamilton RH, Epstein RA. The occipital place area is causally involved in representing environmental boundaries during navigation. Current Biology. 2016;26:1104–1109. doi: 10.1016/j.cub.2016.02.066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kaiser D, Stein T, Peelen MV. Object grouping based on real-world regularities facilitates perception by reducing competitive interactions in visual cortex. PNAS. 2014;111:11217–11222. doi: 10.1073/pnas.1400559111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kaiser D, Oosterhof NN, Peelen MV. The neural dynamics of attentional selection in natural scenes. Journal of Neuroscience. 2016;36:10522–10528. doi: 10.1523/JNEUROSCI.1385-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kaiser D, Quek GL, Cichy RM, Peelen MV. Object vision in a structured world. Trends in Cognitive Sciences. 2019a;23:672–685. doi: 10.1016/j.tics.2019.04.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kaiser D, Häberle G, Cichy RM. Cortical sensitivity to natural scene structure. bioRxiv. 2019b doi: 10.1101/613885. [DOI] [PMC free article] [PubMed]
  36. Kamps FS, Julian JB, Kubilius J, Kanwisher N, Dilks DD. The occipital place area represents the local elements of scenes. NeuroImage. 2016;132:417–424. doi: 10.1016/j.neuroimage.2016.02.062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Kant I. Kritik Der Reinen Vernunf. Johann Friedrich Hartknoch; 1781. [Google Scholar]
  38. Katti H, Peelen MV, Arun SP. Machine vision benefits from human contextual expectations. Scientific Reports. 2019;9:2112. doi: 10.1038/s41598-018-38427-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Kersten D, Mamassian P, Yuille A. Object perception as bayesian inference. Annual Review of Psychology. 2004;55:271–304. doi: 10.1146/annurev.psych.55.090902.142005. [DOI] [PubMed] [Google Scholar]
  40. Kriegeskorte N, Mur M, Bandettini P. Representational similarity analysis - connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience. 2008;2:4. doi: 10.3389/neuro.06.004.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems; 2012. pp. 1097–1105. [DOI] [Google Scholar]
  42. Lowe MX, Rajsic J, Gallivan JP, Ferber S, Cant JS. Neural representation of geometry and surface properties in object and scene perception. NeuroImage. 2017;157:586–597. doi: 10.1016/j.neuroimage.2017.06.043. [DOI] [PubMed] [Google Scholar]
  43. Lowe MX, Rajsic J, Ferber S, Walther DB. Discriminating scene categories from brain activity within 100 milliseconds. Cortex. 2018;106:275–287. doi: 10.1016/j.cortex.2018.06.006. [DOI] [PubMed] [Google Scholar]
  44. Malcolm GL, Groen IIA, Baker CI. Making sense of Real-World scenes. Trends in Cognitive Sciences. 2016;20:843–856. doi: 10.1016/j.tics.2016.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Mandler JM. Stories, Scripts and Scenes: Aspects of Schema Theory. Taylor & Francis; 1984. [Google Scholar]
  46. Mandler JM, Johnson NS. Some of the thousand words a picture is worth. Journal of Experimental Psychology: Human Learning and Memory. 1976;2:529–540. doi: 10.1037/0278-7393.2.5.529. [DOI] [PubMed] [Google Scholar]
  47. Mandler JM, Parker RE. Memory for descriptive and spatial information in complex pictures. Journal of Experimental Psychology: Human Learning and Memory. 1976;2:38–48. doi: 10.1037/0278-7393.2.1.38. [DOI] [PubMed] [Google Scholar]
  48. Minsky M. A framework for representing knowledge. In: Winston P, editor. The Psychology of Computer Vision. McGraw-Hill; 1975. [Google Scholar]
  49. Oliva A, Torralba A. Modelling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision. 2001;42:145–175. doi: 10.1023/A:1011139631724. [DOI] [Google Scholar]
  50. Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational Intelligence and Neuroscience. 2011;2011:1–9. doi: 10.1155/2011/156869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Oosterhof NN, Connolly AC, Haxby JV. CoSMoMVPA: multi-modal multivariate pattern analysis of neuroimaging data in matlab/GNU octave. Frontiers in Neuroinformatics. 2016;10:20. doi: 10.3389/fninf.2016.00027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Peelen MV, Downing PE. Category selectivity in human visual cortex: beyond visual object recognition. Neuropsychologia. 2017;105:177–183. doi: 10.1016/j.neuropsychologia.2017.03.033. [DOI] [PubMed] [Google Scholar]
  53. Piaget J. In: The Language and Thought of the Child. Keagan Paul C, editor. Trench Trubner & Co; 1926. [Google Scholar]
  54. Proklova D, Kaiser D, Peelen MV. Disentangling representations of object shape and object category in human visual cortex: the Animate-Inanimate distinction. Journal of Cognitive Neuroscience. 2016;28:680–692. doi: 10.1162/jocn_a_00924. [DOI] [PubMed] [Google Scholar]
  55. Proklova D, Kaiser D, Peelen MV. MEG sensor patterns reflect perceptual but not categorical similarity of animate and inanimate objects. NeuroImage. 2019;193:167–177. doi: 10.1016/j.neuroimage.2019.03.028. [DOI] [PubMed] [Google Scholar]
  56. Rumelhart DE. Schemata: the building blocks of cognition. In: Spiro JR, editor. Theoretical Issues in Reading Comprehension. CRC press; 1980. [Google Scholar]
  57. Smith SM, Nichols TE. Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. NeuroImage. 2009;44:83–98. doi: 10.1016/j.neuroimage.2008.03.061. [DOI] [PubMed] [Google Scholar]
  58. Stein T, Kaiser D, Peelen MV. Interobject grouping facilitates visual awareness. Journal of Vision. 2015;15:10. doi: 10.1167/15.8.10. [DOI] [PubMed] [Google Scholar]
  59. Torralba A, Oliva A, Castelhano MS, Henderson JM. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review. 2006;113:766–786. doi: 10.1037/0033-295X.113.4.766. [DOI] [PubMed] [Google Scholar]
  60. Torralba A, Oliva A. Statistics of natural image categories. Network: Computation in Neural Systems. 2003;14:391–412. doi: 10.1088/0954-898X_14_3_302. [DOI] [PubMed] [Google Scholar]
  61. Vedaldi A, Lenc K. MatConvNet – convolutional neural networks for Matlab. ACM International Conference on Multimedia.2015. [Google Scholar]
  62. Võ ML, Boettcher SE, Draschkow D. Reading scenes: how scene grammar guides attention and aids perception in real-world environments. Current Opinion in Psychology. 2019;29:205–210. doi: 10.1016/j.copsyc.2019.03.009. [DOI] [PubMed] [Google Scholar]
  63. Walther DB, Caddigan E, Fei-Fei L, Beck DM. Natural scene categories revealed in distributed patterns of activity in the human brain. Journal of Neuroscience. 2009;29:10573–10581. doi: 10.1523/JNEUROSCI.0559-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Wang L, Mruczek RE, Arcaro MJ, Kastner S. Probabilistic maps of visual topography in human cortex. Cerebral Cortex. 2015;25:3911–3931. doi: 10.1093/cercor/bhu277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Wen H, Shi J, Chen W, Liu Z. Deep residual network predicts cortical representation and organization of visual features for rapid categorization. Scientific Reports. 2018;8:3752. doi: 10.1038/s41598-018-22160-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Wolfe JM, Võ ML, Evans KK, Greene MR. Visual search in scenes involves selective and nonselective pathways. Trends in Cognitive Sciences. 2011;15:77–84. doi: 10.1016/j.tics.2010.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Huan Luo1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "A neural mechanism for contextualizing fragmented inputs during naturalistic vision" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Joshua Gold as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

This work combined fMRI, EEG and deep neural network model to investigate the neural basis of encoding of fragment location information (e.g., vertical, horizontal positions) across different scene categories. It addresses an important question about how the 'abstract' spatial layout that is not explicitly presented in the stimulus itself is represented in neural activities. The paper is clearly written and the approach and results are interesting and novel. However, there are some major issues brought up by the reviewers that need the authors to address and do additional analysis.

For a revision to be successful, you must address the following major issues:

Essential revisions:

1) The thirty-six figure segments come from only six natural images (3 indoor and 3 outdoor), which means the figure fragments would be repeatedly presented and be learned or memorized gradually. It is therefore hard to distinguish two interpretations – do the results reflect a true representation of spatial layout knowledge that would be automatically formed and could generalize to any natural images or do they derive from a learning and familiarization process after repeated exposure? A possible way to assess this is to divide the data into various stages and compare the early- and late-stage results. If the spatial layout knowledge is automatically represented regardless of learning, we would expect to see the same results in the early part. On the other hand, if it is indeed learning or memory process that induces the results, we would expect to see the pattern only in the late part but not in the early part.

2) Each scene picture was split into two halves horizontally and three parts vertically. Thus, there are confounding factors with regards to why the effect only occurred for vertical locations but not for horizontal locations. The authors should either collect new data or perform new analysis to address the issue.

3) The authors used DNN regression to confirm that the vertical position effect is not due to category-related information. However, the involvement of low-level features in discriminating vertical locations is still quite possible and could not be completely ruled out from the current analysis. For example, image segments at different vertical locations of natural scenes (upper, middle, lower) seem to be also associated with different low-level features (e.g., low spatial frequency for upper part, such as sky or ceiling, etc.). The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA.

4) It is hard to understand what exactly DNN features were removed. ResNet50 and AlexNet were used widely but these DNN models were trained by a very large set of images, whereas the present study only compares 6 specific images. The specific features to differentiate the 6 specific images may not be the same as the removed DNN features. The authors could show reconstructed images with DNN features removed and it is quite possible that human observes would still differentiate the 6 reconstructed images even when the DNN features are removed.

5) It is difficult to figure out what the authors were arguing was the mechanism or the consequence: Does knowledge/schema help sort incomplete information, or does the brain sort incomplete information so that we can extract knowledge? The paper is motivated by the former ("…the brain uses prior knowledge about where information typically appears in a scene to meaningfully sort incoming information") but then ends by stating the latter ("This mechanism empowers the visual brain to efficiently extract meaning from dynamic real-world environments, where it is confronted with sequences of incomplete visual snapshots"). Please add clarifications.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "A neural mechanism for contextualizing fragmented inputs during naturalistic vision" for further consideration at eLife.

The manuscript has been improved but there was still one issue that was misunderstood or not clearly addressed.

In the original third comment, the reviewer pointed out that the DNN regression analysis cannot convincingly rule out the involvement of low-level properties in vertical position effect and suggested to directly test the DNN features in V1 ("The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA."). The authors have performed additional analysis by using 3 low-level control models, but again just examined whether the vertical position effect existed after regression using these models. The results indeed are not quite convincing and it seems that the regression even could not disrupt category representations as the previous DNN model showed. To explicitly address the concern, as the reviewer originally suggested, the authors should perform a new analysis to show that the DNN model does account for low-level property representation in V1, which would then support the claim that DNN regression could remove the low-level features.

eLife. 2019 Oct 9;8:e48182. doi: 10.7554/eLife.48182.027

Author response


Essential revisions:

1) The thirty-six figure segments come from only six natural images (3 indoor and 3 outdoor), which means the figure fragments would be repeatedly presented and be learned or memorized gradually. It is therefore hard to distinguish two interpretations – do the results reflect a true representation of spatial layout knowledge that would be automatically formed and could generalize to any natural images or do they derive from a learning and familiarization process after repeated exposure? A possible way to assess this is to divide the data into various stages and compare the early- and late-stage results. If the spatial layout knowledge is automatically represented regardless of learning, we would expect to see the same results in the early part. On the other hand, if it is indeed learning or memory process that induces the results, we would expect to see the pattern only in the late part but not in the early part.

This is an important point. We deliberately chose few scene fragments for the experiment to compute reliable neural RDMs from multiple image presentations. We are aware of this choice posing a limitation to the current study and we now explicitly acknowledge this limitation in the Discussion section:

“As the current study is limited to a small set of scenes, more research is needed to explore whether schema-based coding generalizes to more diverse contents.”

Given the limited number of stimuli, it is in principle possible that the fragments’ location-specific representations only emerged after many stimulus repetitions across the experiment. To exclude this possibility, we performed the suggested analysis and separately analyzed data from the first and second halves of both experiments. For these halves (fMRI: first three versus last three runs, EEG: first versus second half of trials), we re-performed the main analysis for these halves. Critically, we found a very similar pattern of results with no statistical differences between the first and second half of each experiment, suggesting that the effect cannot be explained by excessive learning during the experiment. The results of this analysis are reported in Figure 2—figure supplement 6.

2) Each scene picture was split into two halves horizontally and three parts vertically. Thus, there are confounding factors with regards to why the effect only occurred for vertical locations but not for horizontal locations. The authors should either collect new data or perform new analysis to address the issue.

We now performed analyses where we analyzed all pairwise comparisons along the vertical axis (i.e., top versus bottom, top versus middle, and middle versus bottom), so that in each analysis there were only fragments from two different vertical locations. These analyses reveal a consistent vertical location organization, replicating the overall effect for each pairwise comparison, while suggesting that some comparisons (the ones including the top fragments) may contribute more to the effect. Results from these analyses are reported in Figure 2—figure supplement 7.

3) The authors used DNN regression to confirm that the vertical position effect is not due to category-related information. However, the involvement of low-level features in discriminating vertical locations is still quite possible and could not be completely ruled out from the current analysis. For example, image segments at different vertical locations of natural scenes (upper, middle, lower) seem to be also associated with different low-level features (e.g., low spatial frequency for upper part, such as sky or ceiling, etc.). The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA.

Thank you for this suggestion. Although regressing out deep neural networks should also control for low-level features to a substantial degree (previous research has shown that early DNN layers correspond well with early visual processing and activations in V1; see Cichy et al., 2016; Güclü and van Gerven, 2015), we now added new analyses where we used three additional models that explicitly control for low-level features: a pixel dissimilarity model, GIST descriptors (Oliva and Torralba, 2001), and V1 dissimilarity (as a neural approximation of low-level features). We re-performed our regression analyses after regressing out RDMs obtained from each of these low-level models. These analyses show that also the low-level models could not explain the fragments’ vertical location organization; neither could they explain their categorical organization. This indicates that schematic coding cannot be accounted for by low-level features. The new control analyses are summarized in Figure 3—figure supplement 2.

4) It is hard to understand what exactly DNN features were removed. ResNet50 and AlexNet were used widely but these DNN models were trained by a very large set of images, whereas the present study only compares 6 specific images. The specific features to differentiate the 6 specific images may not be the same as the removed DNN features. The authors could show reconstructed images with DNN features removed and it is quite possible that human observes would still differentiate the 6 reconstructed images even when the DNN features are removed.

This is an interesting point. First, we would like to point out that the reason we used categorization DNNs here was primarily to have a good measure of features that are routinely extracted for successful categorization, which includes low-level images properties as well as high-level category-defining features. It is true that the DNNs used here were trained to extract these features from large image databases, of whose properties the few images used in the study only capture a small fraction. However, it is worth noting that this qualitatively resembles processing in the visual system: The types of features the visual system routinely extracts are defined by excessive experience with large varieties of inputs, but at any single moment, we only have a very limited amount of input from which we extract a subset of these features. We now stress the reasons for using pre-trained DNNs in the Materials and methods section:

“DNNs are similar to the brain as they are trained using excessive training material while dynamically adjusting the “tuning” of their connections. […] By comparing DNNs activations and brain responses to the scene fragments, we could quantify to which extent features routinely extracted for categorization purposes account for schema-based coding in the human visual system.”

Second, we would certainly predict that after removing DNN features from the images, participants would still be able to visually discriminate between the images. We conceive of this as a feature rather than as a problem – ultimately the location information observed in brain responses must stem from visual features of the images. The key finding here is that the feature organization in DNN models, despite their similarity to visual cortex representations, cannot account for the schematic coding observed in the brain. We now bring up this point in the Discussion:

“While DNNs trained on categorization accurately capture the representational divide into different scene categories, they cannot explain the schema-based organization observed in the human visual system. Although this does not mean that visual features extracted by DNN models in principle are incapable of explaining schema-based brain representations, our results highlight that current DNN models of categorization do not use real-world structure in similar ways as the human brain.”

Finally, we agree that the current analysis does not provide information on the exact features accounted for by the DNN. Reconstruction images after controlling for DNN features could indeed provide a useful avenue for delineating the features that are uniquely extracted by the DNN and scene-selective cortex, respectively. However, we think that this is beyond the scope of the current paper, because reliably defining these features would ultimately require a larger, more diverse set of images and the acquisition of new experimental data.

5) It is difficult to figure out what the authors were arguing was the mechanism or the consequence: Does knowledge/schema help sort incomplete information, or does the brain sort incomplete information so that we can extract knowledge? The paper is motivated by the former ("…the brain uses prior knowledge about where information typically appears in a scene to meaningfully sort incoming information") but then ends by stating the latter ("This mechanism empowers the visual brain to efficiently extract meaning from dynamic real-world environments, where it is confronted with sequences of incomplete visual snapshots"). Please add clarifications.

The ending did indeed not reflect our interpretation very clearly. We re-worded the concluding paragraph to make it more consistent with the Introduction:

“To conclude, our findings provide the first spatiotemporal characterization of a neural mechanism for contextualizing fragmented visual inputs. By rapidly organizing visual information according to its typical role in the world, this mechanism may contribute to the optimal use of perceptual information for guiding efficient real-world behaviors, even when sensory inputs are incomplete or dynamically changing.”

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The manuscript has been improved but there was still one issue that was misunderstood or not clearly addressed.

In the original third comment, the reviewer pointed out that the DNN regression analysis cannot convincingly rule out the involvement of low-level properties in vertical position effect and suggested to directly test the DNN features in V1 ("The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA."). The authors have performed additional analysis by using 3 low-level control models, but again just examined whether the vertical position effect existed after regression using these models. The results indeed are not quite convincing and it seems that the regression even could not disrupt category representations as the previous DNN model showed. To explicitly address the concern, as the reviewer originally suggested, the authors should perform a new analysis to show that the DNN model does account for low-level property representation in V1, which would then support the claim that DNN regression could remove the low-level features.

Thanks for giving us the opportunity to address the remaining comment in more detail. If we understand correctly, the main worry is that the DNN models used in our study are not good models of low-level feature coding in cortex (i.e., they so not match well with V1-level representations), and thus regressing out DNN features is not a convincing way of ruling out low-level features as the explanation behind the main effect of interest, i.e. the vertical location effect.

We strongly agree that controlling for low-level features is an important issue in the current study. We see that in the previous revision we did not make sufficiently clear the comprehensive way in which we control for the effect of low-level features, DNN feature regression being only one of them. We therefore expose our perspective in detail below and made changes to the manuscript that make our argumentation more explicit.

Our pieces of evidence against low-level features behind the main effect of interest are the following. Most directly, we show that cortical area V1 does not exhibit vertical location effects. The absence of the effect of interest in the core region in cortex coding for low-level features is often taken as strong evidence against an involvement of low-level features. This thinking applies in our study as well. We make this point explicit in the current version of the manuscript:

“To efficiently support vision in dynamic natural environments, schematic coding needs to be flexible with respect to visual properties of specific scenes. The absence of vertical location effects in V1 indeed highlights that schematic coding is not tied to the analysis of simple visual features.”

Naturally, we agree that more comprehensive and corroborative evidence is desirable. Therefore, we go beyond this single piece of evidence with a series of control analyses.

First, we show that only taking visually very different indoor and outdoor scenes into account does not remove vertical location effect. Thus, it is unlikely that the vertical location effect simply reflects low-level features.

Second, we remove DNN features (or features extracted by low-level control models such as the GIST descriptor) and find that this does not abolish the vertical location effects. The reviewer questions in particular whether our DNN model accurately captures low-level feature representations. We believe that it is a fair model – we show that the DNN does explain variance in all regions examined, including V1 (Figure 3—figure supplement 2).

The reviewer further notes that the V1 data (category and vertical location effects) look somewhat comparable before and after regressing out the control models and takes this as suggesting that these are bad models of V1. However, given that we show that the DNN does explain variance in V1, this is also parsimoniously explained by there being no meaningful category or location organization in the V1 data to begin with.

That being said, we fully agree with the reviewer that although the DNN used might be a fair model of V1-level features, it is certainly not the best possible model. Thus, one could still assume that if DNNs approximated low-level feature representations in V1 more faithfully (with more variance explained), removing these DNNs' features might abolish the vertical location organization in higher-level cortex (i.e., the OPA), too.

To exclude this possibility (i.e., that it's simply a very good match to low-level feature representations in V1 that's needed to remove the OPA organization), we conducted a third control analysis. In this analysis, we used the best possible match to the neural V1 data as a low-level model – the V1 data itself! Naturally, the empirical neural V1 data explains the data in V1 (Figure 3—figure supplement 3). Crucially, after removing the neural organization in V1, the vertical location effect in OPA still persisted, showing that even if we had a model that predicted V1 exceptionally well (as well as V1 predicts itself), this model could not account for the vertical location effect.

We believe that this analysis already offers both properties that the reviewer asks for: a very good model of V1-level representation (arguably the best possible match to the neural organization in V1), and a strong vertical location effect in higher-level regions when the model is regressed out.

Together, we thus believe that the remaining open question boils down to how well DNNs can, in principle, model V1 organization. This is an interesting topic, which surely needs investigation – but from our point of view without immediate implications for the current paper. After all, we empirically show that even if the DNN perfectly captured the V1 organization, our conclusions would hold.

We laid out this argument in the revised manuscript:

“DNN features are a useful control for flexibility towards visual features, because they cover both low-level and high-level visual features, explaining variance across fMRI regions and across EEG processing time (see Figure 3—figure supplement 2; see also Cichy et al., 2016; Gücli and van Gerven, 2015). […] Together, these results provide converging evidence that low-level feature processing cannot explain the schematic coding effects reported here.”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Kaiser D, Turini J, Cichy RM. 2019. A neural mechanism for contextualizing fragmented information during naturalistic vision. Open Science Framwork. 10.17605/OSF.IO/H3G6V [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Supplementary file 1. Complete statistical report for fMRI results.

    The table shows test statistics and p-values for all tests performed in the fMRI experiment (Figures 2 and 3). Values reflect one-sided t-tests against zero. All p-values are uncorrected; in the main manuscript, only tests surviving Bonferroni-correction across the three ROIs (marked in color) are considered significant.

    elife-48182-supp1.docx (94.1KB, docx)
    DOI: 10.7554/eLife.48182.019
    Supplementary file 2. Estimating peak latencies.

    The table shows means and standard deviations (in brackets) of peak latencies in ms for vertical location and category information in the main analyses (Figures 2 and 3). To estimate the reliability of peaks and onsets (Supplementary file 3) of location and category information in the key analyses, we conducted a bootstrapping analysis. For this analysis, we choose 100 samples of 20 randomly chosen datasets (with possible repetitions). For each random sample, we computed peak and onset latencies; we then averaged the peak and onset latencies across the 100 samples. Peak latencies were defined as the highest beta estimate in the time course. Notably, the peak latency of vertical location information remained highly stable across analyses.

    elife-48182-supp2.docx (31.4KB, docx)
    DOI: 10.7554/eLife.48182.020
    Supplementary file 3. Estimating onset latencies.

    The table shows means and standard deviations (in brackets) of onset latencies in ms for vertical location and category information in the main analyses (Figures 2 and 3)). Onset latencies were quantified using the bootstrapping logic explained above (Supplementary file 2). Onsets were defined by first computing TFCE statistics for each random sample, with multiple-comparison correction based on 1000 null distributions. The onset latency for each sample was then defined as the first occurrence of three consecutive time points reaching significance (p<0.05, corrected for multiple comparisons).

    elife-48182-supp3.docx (31.2KB, docx)
    DOI: 10.7554/eLife.48182.021
    Transparent reporting form
    DOI: 10.7554/eLife.48182.022

    Data Availability Statement

    Data are publicly available on OSF (DOI.ORG/10.17605/OSF.IO/H3G6V).

    Data are publicly available on OSF (http://doi.org/10.17605/OSF.IO/H3G6V), as indicated in the Materials and Methods section of the manuscript.

    The following dataset was generated:

    Kaiser D, Turini J, Cichy RM. 2019. A neural mechanism for contextualizing fragmented information during naturalistic vision. Open Science Framwork. 10.17605/OSF.IO/H3G6V


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES