Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 Oct 2;19(10):e1011506. doi: 10.1371/journal.pcbi.1011506

Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation

Aran Nayebi 1,2,3,*,#, Nathan C L Kong 1,4,5,*,#, Chengxu Zhuang 1,3,4, Justin L Gardner 1,4, Anthony M Norcia 1,4, Daniel L K Yamins 1,4,6
Editor: Tim Christian Kietzmann7
PMCID: PMC10569538  PMID: 37782673

Abstract

Studies of the mouse visual system have revealed a variety of visual brain areas that are thought to support a multitude of behavioral capacities, ranging from stimulus-reward associations, to goal-directed navigation, and object-centric discriminations. However, an overall understanding of the mouse’s visual cortex, and how it supports a range of behaviors, remains unknown. Here, we take a computational approach to help address these questions, providing a high-fidelity quantitative model of mouse visual cortex and identifying key structural and functional principles underlying that model’s success. Structurally, we find that a comparatively shallow network structure with a low-resolution input is optimal for modeling mouse visual cortex. Our main finding is functional—that models trained with task-agnostic, self-supervised objective functions based on the concept of contrastive embeddings are much better matches to mouse cortex, than models trained on supervised objectives or alternative self-supervised methods. This result is very much unlike in primates where prior work showed that the two were roughly equivalent, naturally leading us to ask the question of why these self-supervised objectives are better matches than supervised ones in mouse. To this end, we show that the self-supervised, contrastive objective builds a general-purpose visual representation that enables the system to achieve better transfer on out-of-distribution visual scene understanding and reward-based navigation tasks. Our results suggest that mouse visual cortex is a low-resolution, shallow network that makes best use of the mouse’s limited resources to create a light-weight, general-purpose visual system—in contrast to the deep, high-resolution, and more categorization-dominated visual system of primates.

Author summary

Studies of mouse visual behavior have revealed a multitude of visual abilities, ranging from stimulus-reward associations, to goal-directed navigation, and object-centric discriminations. A principled system-wide model of mouse visual cortex would be useful both in providing an organizing theory for this wide spectrum of behaviors, and enabling practical technology for many model-driven studies of mouse neuroscience more broadly. However, the standard approach to creating quantitatively accurate models of primate visual cortex has been less successful with the mouse system. Here we identify critical computational features needed to capture mouse-specific neural stimulus-response patterns, and illustrate how these features can be interpreted as giving the highly resource-limited mouse brain a comparative advantage in self-learning a task-general visual representation.

Introduction

The mouse has become an indispensable model organism in systems neuroscience, allowing unprecedented genetic and experimental control at the level of cell-type specificity in individual circuits [1]. Moreover, studies of mouse visual behavior have revealed a multitude of abilities, ranging from stimulus-reward associations, to goal-directed navigation, and object-centric discriminations [2]. Thus, the mouse animal model bridges fine-grained, low-level experimental control with high-level behavior. Understanding the principles underlying the structure and function of the mouse visual system and how it relates to more commonly studied visual organisms such as macaque, is therefore important.

Computational models are an effective tool for understanding how mouse visual cortex is capable of supporting such behaviors and for providing a normative account for its structure and function. They allow us to identify the key ingredients leading to the model with the best quantitative agreement with the neural data. We can also assess the functional similarities and differences between rodent and primate visual cortex, which otherwise would be hard to capture in the absence of a model from prior literature, beyond failures to be homologous in higher visual areas past V1. Furthermore, these models provide a natural starting point for understanding higher-level processing downstream of the visual system, such as in memory and its role during the navigation of rich visual environments [37]. Without an explicit model of a visual system, it is difficult to disentangle the contributions to neural response variance of the visual system from those of higher-level cognitive phenomena. Understanding the computations underlying higher cognition and motor control in rodents will therefore critically depend on an understanding of the upstream sensory areas that they depend on.

Deep convolutional neural networks (CNNs) are a class of models that have had success as predictive models of the human and non-human primate ventral visual stream (e.g., [813]). In contrast, such models have been poor predictors of neural responses in mouse visual cortex [14, 15]. Our hypothesis is that this failure can be understood via and be remedied by the goal-driven modeling approach [16]. This approach posits that normative models in neuroscience should pay careful attention to the objective functions (i.e., behavior), architectures (i.e., neural circuit), and data stream (i.e., visual input). These structural and functional ingredients should be finely tuned to the biology and the ecology of the organism under study.

In this work, we build a substantially improved model of mouse visual cortex by better aligning the objective function, architecture, and visual input with those of the mouse visual system. Firstly, from an objective function point of view, the primate ventral stream models are trained in a supervised manner on ImageNet [17, 18], which is an image set containing over one million images belonging to one thousand, mostly human-relevant, semantic categories [19]. While such a dataset is an important technical tool for machine learning, it is highly implausible as a biological model particularly for rodents, who do not receive such category labels over development. Instead, we find that models trained using self-supervised, contrastive algorithms provide the best correspondence to mouse visual responses. Interestingly, this situation is different than in primates, where prior worked showed that the two were roughly equivalent [20]. Secondly, in terms of the architecture, these primate ventral visual stream models are too deep to be plausible models of the mouse visual system, since mouse visual cortex is known to be more parallel and much shallower than primate visual cortex [2124]. By varying the number of linear-nonlinear layers in the models, we find that models with fewer linear-nonlinear layers can achieve neural predictivity performance that is better or on par with very deep models. Finally, mice are known to have lower visual acuity than that of primates [25, 26], suggesting that the resolution of the inputs to mouse models should be lower than that of the inputs to primate models. Indeed, we find that model fidelity can be improved by training them on lower-resolution images. Ultimately, the confluence of these ingredients, leads to a model, known as “Contrastive AlexNet” (first four layers), that best matches mouse visual cortex thus far.

We then address the question of why the Contrastive AlexNet is better at neural predictivity from an ecological point of view, especially the role of contrastive, self-supervised learning, which is novel and not expected by the known physiology and behavioral experiments in mouse. To address this question, we use Contrastive AlexNet to assess out-of-distribution generalization from its original training environment, including using the visual encoder as the front-end of a bio-mechanically realistic virtual rodent operating in an environment that supports spatially-extended reward-based navigation. We show that visual representations of this model lead to improved transfer performance over its supervised counterpart across environments, illustrating the congruence between the task-transfer performance and improved neural fidelity of the computational model.

Taken together, our best models of the mouse visual system suggest that it is a shallower, general-purpose system operating on comparatively low-resolution inputs. These identified factors therefore provide interpretable insight into the confluence of constraints that may have given rise to the system in the first place, suggesting that these factors were crucially important given the ecological niche in which the mouse is situated, and the resource limitations to which it is subject.

Results

Determining the animal-to-animal mapping transform

Prior models of mouse visual cortex can be improved by varying three ingredients to better match the biology and the ecology of mouse visual cortex. Before model development, however, we must determine the appropriate procedure by which to evaluate models. As in prior work on modeling primate visual cortex, we “map” model responses to biological responses and the ability of the model responses to recapitulate biological responses determines the model’s neural fidelity [8, 9, 17, 20].

How should artificial neural network responses be mapped to biological neural responses? What firing patterns of mouse visual areas are common across multiple animals, and thus worthy of computational explanation? A natural approach would be to map artificial neural network features to mouse neural responses in the same manner that different animals can be mapped to each other. Specifically, we aimed to identify the best performing class of similarity transforms needed to map the firing patterns of one animal’s neural population to that of another, which we denote as the “inter-animal consistency”. We took inspiration from methods that have proven useful in modeling human and non-human primate visual, auditory, and motor cortices [16, 2729]. As with other cortical areas, this transform class likely cannot be so strict as to require fixed neuron-to-neuron mappings between cells. However, the transform class for each visual area also cannot be so loose as to allow an unconstrained nonlinear mapping, since the model already yields an image-computable nonlinear response.

We explored a variety of linear mapping transform classes (fit with different constraints) between the population responses for each mouse visual area (Fig 1A). The mouse visual responses to natural scenes were collected previously using both two-photon calcium imaging and Neuropixels by the Allen Institute [15, 22] from areas V1 (VISp), LM (VISl), AL (VISal), RL (VISrl), AM (VISam), and PM (VISpm) in mouse visual cortex (see number of units and specimens for each dataset in Table 1 and further details in the “Neural Response Datasets” section).

Fig 1. Inter-animal neural response consistency across mouse visual areas.

Fig 1

A. Inter-animal consistency was computed using different linear maps, showing that PLS regression provides the highest consistency. Horizontal bars at the top are the median and s.e.m. of the internal consistencies of the neurons in each visual area. Refer to Table 1 for N units per visual area. B. The fraction of maximum split-half reliability is plotted as a function of time (in 10-ms time bins) for each visual area.

Table 1. Descriptive statistics of the neural datasets.

Total number of units and specimens for each visual area for the calcium imaging and Neuropixels datasets.

Dataset Type Visual Area Total Units Total Specimens
Calcium Imaging V1 7080 29
LM 4393 24
AL 2064 9
RL 1116 8
AM 847 9
PM 1844 19
Neuropixels V1 442 8
LM 162 6
AL 396 6
RL 299 7
AM 257 7
PM 175 5

We focused on the natural scene stimuli, consisting of 118 images, each presented 50 times (i.e., 50 trials per image). For all methods, the corresponding mapping was trained on 50% of all the natural scene images, and evaluated on the remaining held-out set of images. We also included representational similarity analyses (RSA; [30]) as a baseline measure of population-wide similarity across animals, corresponding to no selection of individual units, unlike the other mapping transforms. For the strictest mapping transform (One-to-One), each target unit was mapped to the single most correlated unit in the source animal. Overall, the One-to-One mapping tended to yield the lowest inter-animal consistency among the maps considered. However, Ridge regression (L2-regularized) and Partial Least Squares (PLS) regression were more effective at the inter-animal mapping, yielding the most consistent fits across visual areas, with PLS regression providing the highest inter-animal consistency. We therefore used PLS regression in the evaluation of a candidate model’s ability to predict neural responses. This mapping transform confers the additional benefit of enabling direct comparison to prior primate ventral stream results (which also used this mapping [8, 17]) in order to better understand ecological differences between the two visual systems across species.

Under all mapping transforms, a log-linear extrapolation analysis (S5 Fig) reveals that as the number of units increases, the inter-animal consistency approaches 1.0 more rapidly for the Neuropixels dataset, than the calcium imaging dataset, which were obtained from the average of the Δf/F trace—indicating the higher reliability of the Neuropixels data across all visual areas. We further noticed a large difference between the inter-animal consistency obtained via RSA and the consistencies achieved by any of the other mapping transforms for the responses in RL of the calcium imaging dataset (green in Fig 1A). This difference, however, was not observed for responses in RL in the Neuropixels dataset. This discrepancy suggested that there was a high degree of population-level heterogeneity in the responses collected from the calcium imaging dataset, which may be attributed to the fact that the two-photon field-of-view for RL spanned the boundary between the visual and somatosensory cortex, as originally noted by de Vries et al. [15]. We therefore excluded RL in the calcium imaging dataset from further analyses, following Siegle et al. [31], who systematically compared these two datasets. Thus, this analysis provided insight into the experiments from which the data were collected, and allowed us to ascertain the level of neural response variance that is common across animals and that therefore should be “explained” by candidate models. For these reasons above, we present our main results on the newer and more reliable Neuropixels dataset, as it is in a better position to separate models—with similar results on the calcium imaging dataset presented in the Supporting Information.

Modeling mouse visual cortex

Building quantitatively accurate models

With this mapping and evaluation procedure, we can then develop models to better match mouse visual responses. The overall conclusion that the mouse visual system is most consistent with an artificial neural network model that is self-supervised, low-resolution, and comparatively shallow. This conclusion holds more generally as well on the earlier calcium imaging dataset, along with non-regression-based comparisons like RSA (cf. S2, S3 and S4 Figs). Our best models attained neural predictivity of 90% of the inter-animal consistency, much better than the prior high-resolution, deep, and task-specific model (VGG16), which attained 56.27% of this ceiling (Fig 2A; cf. rightmost column of Table 2 for the unnormalized quantities).

Fig 2. Substantially improving neural response predictivity of models of mouse visual cortex.

Fig 2

A. The median and s.e.m. (across units) neural predictivity difference with the prior-used, primate model of Supervised VGG16 trained on 224 px inputs (“Primate Model Baseline”, used in [14, 15, 36]), under PLS regression, across units in all mouse visual areas (N = 1731 units in total). Absolute neural predictivity (always computed from the model layer that best predicts a given visual area) for each model can be found in Table 2. Our best model is denoted as “AlexNet (IR)” on the far left. “Single Stream”, “Dual Stream”, and “Six Stream” are novel architectures we developed based on the first four layers of AlexNet, but additionally incorporates dense skip connections, known from the feedforward connectivity of the mouse connectome [34, 35], as well as multiple parallel streams (schematized in S1 Fig). CPC denotes contrastive predictive coding [32, 37]. All models, except for the “Primate Model Baseline”, are trained on 64 px inputs. We also note that all the models are trained using ImageNet, except for CPC (purple), Depth Prediction (orange), and the CIFAR-10-labeled black bars. B. Training a model on a contrastive objective improves neural predictivity across all visual areas. For each visual area, the neural predictivity values are plotted across all model layers for an untrained AlexNet, Supervised (ImageNet) AlexNet and Contrastive AlexNet (ImageNet, instance recognition)—the latter’s first four layers form the best model of mouse visual cortex. Shaded regions denote mean and s.e.m. across units.

Table 2. ImageNet top-1 validation set accuracy via linear transfer or via supervised training and neural predictivity for each model.

We summarize here the top-1 accuracy for each self-supervised and supervised model on ImageNet as well as their noise-corrected neural predictivity obtained via the PLS map (aggregated across all visual areas). These values are plotted in Fig 2C and S2 Fig. Unless otherwise stated, each model is trained and validated on 64 × 64 pixels images.

Architecture Objective Function ImageNet Transfer Top-1 Accuracy Neural Predictivity Neuropixels; Calcium Imaging
Single Stream Untrained 9.28% 32.76%; 28.65%
Supervised 34.87% 36.21%; 29.73%
Autoencoder 10.37% 35.99%; 28.69%
Depth Prediction 18.04% 33.79%; 27.54%
RotNet 19.72% 35.63%; 29.27%
Instance Recognition 21.22% 38.01%; 30.88%
SimSiam 26.48% 39.19%; 30.48%
MoCov2 27.63% 39.17%; 30.30%
SimCLR 22.84% 39.45%; 29.50%
Dual Stream Untrained 10.85% 33.58%; 29.24%
Supervised 38.68% 36.07%; 29.43%
Autoencoder 10.26% 34.97%; 28.74%
Depth Prediction 19.81% 32.81%; 27.20%
RotNet 23.29% 35.37%; 29.15%
Instance Recognition 22.55% 40.07%; 30.64%
SimSiam 29.21% 40.20%; 30.60%
MoCov2 31.00% 38.64%; 30.33%
SimCLR 26.25% 38.03%; 29.08%
Six Stream Untrained 11.12% 33.74%; 29.26%
Supervised 34.15% 36.64%; 29.79%
Autoencoder 9.27% 37.34%; 31.12%
Depth Prediction 18.27% 33.12%; 27.63%
RotNet 22.78% 35.49%; 28.97%
Instance Recognition 26.49% 37.67%; 31.18%
SimSiam 30.52% 38.17%; 30.46%
MoCov2 32.70% 37.96%; 30.44%
SimCLR 28.42% 38.92%; 29.19%
AlexNet Supervised 36.22% 37.28%; 30.34%
AlexNet Instance Recognition 16.09% 41.33%; 31.60%
AlexNet VICReg 32.31% 39.20%; 30.56%
AlexNet Barlow Twins 12.38% 40.04%; 31.42%
AlexNet SimSiam 6.84% 39.68%; 30.95%
AlexNet MoCov2 15.67% 39.83%; 31.20%
AlexNet SimCLR 26.85% 39.39%; 30.27%
ResNet-18 Supervised 53.31% 35.82%; 28.93%
ResNet-18 Instance Recognition 30.75% 38.99%; 30.11%
ResNet-34 Supervised 58.24% 36.38%; 28.42%
ResNet-34 Instance Recognition 32.91% 39.26%; 30.93%
ResNet-50 Supervised 63.57% 37.72%; 30.70%
ResNet-50 Instance Recognition 38.49% 39.45%; 31.12%
ResNet-101 Supervised 64.53% 38.15%; 31.05%
ResNet-101 Instance Recognition 39.43% 39.94%; 31.60%
ResNet-152 Supervised 65.21% 37.74%; 30.67%
ResNet-152 Instance Recognition 40.08% 39.07%; 31.30%
VGG16 Supervised 58.32% 33.19%; 27.09%
VGG16 (224 px) Supervised 71.59% 26.03%; 20.40%
VGG16 Instance Recognition 25.33% 37.78%; 28.66%
MouseNet of [33] Supervised 37.14% 31.05%; 25.89%
MouseNet of [33] Instance Recognition 21.21% 35.96%; 30.06%
MouseNet Variant Supervised 39.37% 33.02%; 26.53%
MouseNet Variant Instance Recognition 26.85% 36.89%; 28.80%
CPC of [32] Contrastive Predictive Coding N/A 36.28%; 29.67%

We also attain neural predictivity improvements over prior work [32, 33] (cf. the purple and green bars in Fig 2A), especially the latter “MouseNet” of Shi et al. [33], which attempts to map details of the mouse connectome [34, 35] onto a CNN architecture. There are two green bars in Fig 2A since we also built our own variant of MouseNet where everything is the same except that image categories are read off of at the penultimate layer of the model (rather than the concatenation of the earlier layers as originally proposed). We thought this might aid the original MouseNet’s task performance and neural predictivity, since it can be difficult to train linear layers when the input dimensionality is very large. Our best models also outperform the neural predictivity of “MouseNet” even when it is trained with a self-supervised objective (leftmost red vs. “MouseNet” teal bars in Fig 2A). This is another conceptual motivation for our structural-and-functional goal-driven approach, as the higher-level constraints are easier to interrogate than to incorporate and assume individual biological details, as this can be a very under constrained procedure.

In the subsequent subsections, we distill the three factors that contributed to models with improved correspondence to the mouse visual areas: objective function, input resolution, and architecture.

Objective function: Training models on self-supervised, contrastive objectives, instead of supervised objectives, improves correspondence to mouse visual areas

The success in modeling the human and the non-human primate visual system has largely been driven by convolutional neural networks trained in a supervised manner on ImageNet [19] to perform object categorization [9, 13]. This suggests that models trained with category-label supervision learn useful visual representations that are well-matched to those of the primate ventral visual stream [17, 20]. Thus, although biologically implausible for the rodent, category-label supervision is a useful starting point for building baseline models and indeed, models trained in this manner are much improved over the prior primate model of VGG16 (black bars in Fig 2A). We refer to this model as the “Primate Model Baseline”, since although many different models have been used to predict neural activity in the primate ventral stream, this model in particular was the de facto CNN used in the initial goal-driven modeling studies of mouse visual cortex [14, 15, 36]. This choice also helps explicitly illustrate how the three factors that we study, which deviate in visual acuity, model depth, and functional objective from the primate ventral stream, quantitatively improve these models greatly.

These improved, category-label-supervised models, however, cannot be an explanation for how the mouse visual system developed in the first place. In particular, it is unclear whether or not rodents can perform well on large-scale object recognition tasks when trained, such as tasks where there are hundreds of labels. For example, they obtain approximately 70% on a two-alternative forced-choice object classification task [38]. Furthermore, the categories of the ImageNet dataset are human-centric and therefore not relevant for rodents In fact, the affordances of primates involve being able to manipulate objects flexibly with their hands (unlike rodents). Therefore, the ImageNet categories may be more relevant to non-human primates as an ethological proxy than it is to rodents.

We therefore turned to more ecological supervision signals and self-supervised objective functions, where high-level category labeling is not necessary. These objectives could possibly lead to models with improved biological plausibility and may provide more general goals for the models, based on natural image statistics, beyond the semantic specifics of (human-centric) object categorization. Among the more ecological supervised signals, we consider categorization of comparatively lower-variation and lower-resolution images with fewer labels (CIFAR-10-labeled black bars in Fig 2A; [39]), and depth prediction (orange bars in Fig 2A; [40]), as a visual proxy for whisking [41]. Here, we note that in Fig 2A, all models except for those indicated by CIFAR-10 (black), CPC (purple), or Depth Prediction (orange), are trained on ImageNet images. Therefore, even if we remove the category labels from training, many of the images themselves (mainly from both ImageNet and CIFAR-10) are human-centric so future models could be trained using images that are more plausible for mice.

Turning to self-supervision, early self-supervised objectives include sparse autoencoding (pink bars in Fig 2A; [42]), instantiated as a sparsity penalty in the latent space of the image reconstruction loss, which has been shown to be successful at producing Gabor-like functions reminiscent of the experimental findings in the work of Hubel and Wiesel [43]. These objectives, however, have not led to quantitatively accurate models of higher visual cortex [8, 20]. Further developments in computer vision have led to other self-supervised algorithms, motivated by the idea that “non-semantic” features are highly related to the higher-level, semantic features (i.e., category labels), such as predicting the rotation angle of an image (blue bars in Fig 2A; [44]). Although these objectives are very simple, models optimized on them do not result in “powerful” visual representations for downstream tasks. We trained models on these objectives and showed that although they improve neural predictivity over the prior primate model (VGG16), they do not outperform category-label-supervised models (Fig 2A; compare pink, blue, and orange with black).

Further developments in self-supervised learning provided a new class of contrastive objectives. These objectives are much more powerful than prior self-supervised objectives described above, as it has been shown that models trained with contrastive objectives leads to visual representations that can support strong performance on downstream object categorization tasks. At a high level, the goal of these contrastive objectives is to learn a representational space where embeddings of augmentations for one image (i.e., embeddings for two transformations of the same image) are more “similar” to each other than to embeddings of other images. We trained models using the family of contrastive objective functions including: Instance Recognition (IR; [45]), a Simple Framework for Contrastive Learning (SimCLR; [46]), Momentum Contrast (MoCov2; [47]), Simple Siamese representation learning (SimSiam; [48]), Barlow Twins [49], and Variance-Invariance-Covariance Regularization (VICReg; [50]). Note that we are using the term “contrastive” broadly to encompass methods that learn embeddings which are robust to augmentations, even if they do not explicitly rely on negative batch examples—as they have to contrast against something to avoid representational collapse. For example, SimSiam relies on asymmetric representations via a stop gradient; Barlow Twins relies on regularizing with the cross correlation matrix’s off diagonal elements; and VICReg uses the variance and covariance of each embedding to ensure samples in the batch are different. Models trained with these contrastive objectives (red bars in Fig 2A) resulted in higher neural predictivity across all the visual areas than models trained on supervised object categorization, depth prediction, and less-powerful self-supervised algorithms (black, orange, purple, pink, and blue bars in Fig 2A).

We hone in on the contribution of the contrastive objective function (over a supervised objective, and with a fixed dataset of ImageNet) to neural predictivity by fixing the architecture to be AlexNet, while varying the objective function (shown in S9 Fig left). We find that across all objective functions, training AlexNet using instance recognition leads to the highest neural predictivity. We additionally note that the improvement in neural predictivity extends beyond the augmentation used for each objective function, as shown in S7 Fig. When the image augmentations intended for contrastive losses is used with supervised losses, neural predictivity does not improve. Across all the visual areas, there is an improvement in neural predictivity simply by using a powerful contrastive algorithm (red vs. black in Fig 2B and S9 Fig left). Not only is there an improvement in neural predictivity, but also an improvement in hierarchical correspondence to the mouse visual hierarchy. Using the brain hierarchy score developed by Nonaka et al. [51], we observe that contrastive models outperform their supervised counterparts in matching the mouse visual hierarchy (S10 Fig). The first four layers of Contrastive AlexNet (red; Fig 2B), where neural predictivity is maximal across visual areas, forms our best model for mouse visual cortex.

Data stream: Training models on images of lower resolution improves correspondence to mouse visual areas

The visual acuity of mice is known to be lower than the visual acuity of primates [25, 26]. Thus, more accurate models of mouse visual cortex must be trained and evaluated at image resolutions that are lower than those used in the training of models of the primate visual system. We investigated how neural predictivity of two strong contrastive models varied as a function of the image resolution at which they were trained.

Two models were used in the exploration of image resolution’s effects on neural predictivity. Contrastive AlexNet was trained with image resolutions that varied from 64 × 64 pixels to 224 × 224 pixels, as 64 × 64 pixels was the minimum image size for AlexNet due to its architecture. The image resolution upper bound of 224 × 224 pixels is the image resolution that is typically used to train neural network models of the primate ventral visual stream [17]. We also investigated image resolution’s effects using a novel model architecture we developed, known as “Contrastive StreamNet”, as its architecture enables us to explore a lower range of image resolutions than the original AlexNet. This model was based on the first four layers of AlexNet, but additionally incorporates dense skip connections, known from the feedforward connectivity of the mouse connectome [34, 35], as well as multiple parallel streams (schematized in S1 Fig). We trained it using a contrastive objective function (instance recognition) at image resolutions that varied from 32 × 32 pixels to 224 × 224 pixels.

Training models using resolutions lower than what is used for models of primate visual cortex improves neural predictivity across all visual areas, but not beyond a certain resolution, where neural predictivity decreases (Fig 3). Although the input resolution of 64 × 64 pixels may not be optimal for every architecture, it was the resolution that we used to train all the models. This was motivated by the observation that the upper bound on mouse visual acuity is 0.5 cycles / degree [25], corresponding to 2 pixels / cycle × 0.5 cycles / degree = 1 pixel / degree. Prior retinotopic map studies [52] estimate a visual coverage range in V1 of 60–90 degrees, and we found 64 × 64 pixels to be roughly optimal for the models (Fig 3) and was also used by Shi et al. [36], by Bakhtiari et al. [32], and in the MouseNet of Shi et al. [33]. Although downsampling training images is a reasonable proxy for the mouse retina, as has been done in prior modeling work, more investigation into appropriate image transformations may be needed.

Fig 3. Lower-resolution training leads to improved neural response predictivity (Neuropixels dataset).

Fig 3

Contrastive AlexNet and Contrastive StreamNet (Dual Stream) were trained using images of increasing resolutions. For AlexNet, the minimum resolution was 64 × 64 pixels. For Dual StreamNet, the minimum resolution was 32 × 32 pixels. The median and s.e.m. neural predictivity across all units of all visual areas is plotted against the image resolution at which the models were trained. Reducing the image resolution (but not beyond a certain point) during representation learning improves the match to all the visual areas. The conversion from image resolution in pixels to visual field coverage in degrees is under the assumption that the upper bound on mouse visual acuity is 0.5 cycles per degree [25] and that the Nyquist limit is 2 pixels per cycle, leading to a conversion ratio of 1 pixel per degree.

Overall, these data show that optimization using a simple change in the image statistics (i.e., data stream) is crucial to obtain improved models of mouse visual encoding. This suggests that mouse visual encoding is the result of “task-optimization” at a lower resolution than what is typically used for primate ventral stream models.

Architecture: Shallow models suffice for improving correspondence to mouse visual cortex

Anatomically, the mouse visual system has a shallower hierarchy relative to that of the primate visual system (see, e.g., [2124]). Furthermore, the reliability of the neural responses for each visual area provides additional support for the relatively shallow functional hierarchy (Fig 1B). Thus, more biologically-plausible models of mouse visual cortex should have fewer linear-nonlinear layers than those of primate visual cortex. By plotting a model’s neural predictivity against its number of linear-nonlinear operations, we indeed found that very deep models do not outperform shallow models (i.e., models with less than 12 linear-nonlinear operations), despite changes across loss functions and input resolutions (Fig 4). In addition, if we fix the objective function to be “instance recognition”, we can clearly observe that AlexNet, which consists of eight linear-nonlinear layers, has the highest neural predictivity compared to ResNets and VGG16 (S9 Fig right). Furthermore, models with only four convolutional layers (StreamNets; Single, Dual, or Six Streams) perform as well as or better than models with many more convolutional layers (e.g., compare Dual Stream with ResNet101, ResNet152, VGG16, or MouseNets in S9 Fig right). This observation is in direct contrast with prior observations in the primate visual system whereby networks with fewer than 18 linear-nonlinear operations predict macaque visual responses less well than models with at least 50 linear-nonlinear operations [17].

Fig 4. Neural predictivity vs. model depth.

Fig 4

A model’s median neural predictivity across all units from each visual area is plotted against its depth (number of linear-nonlinear layers; in log-scale). Models with fewer linear-nonlinear layers can achieve neural predictivity performances that outperform or are on par with those of models with many more linear-nonlinear layers. The “Primate Model Baseline” denotes a supervised VGG16 trained on 224 px inputs, used in prior work [14, 15, 36].

Mouse visual cortex as a general-purpose visual system

Our results show that a model optimized on a self-supervised contrastive objective the most quantitatively accurate model of the mouse visual system. However, unlike supervised object categorization or other more-classical forms of self-supervised learning such as sparse autoencoding, whose ecological function is directly encoded in the loss function (e.g., predator recognition or metabolically-efficient dimension reduction), the functional utility of self-supervision via a contrastive objective is not as apparent. This naturally raises the question of what behavioral function(s) optimizing a contrastive self-supervision objective might enable for the mouse from an ecological fitness viewpoint.

Analyzing the spectrum of models described in the above section, we first observed that performance on ImageNet categorization is uncorrelated with improved neural predictivity for mouse visual cortex (Fig 5; cf. middle column of Table 2 for the exact ImageNet performance values), unlike the well-known correlation in primates (Fig 5; inset). In seeking an interpretation of the biological function of the contrastive self-supervised objective, we were thus prompted to consider behaviors beyond object-centric categorization tasks. We hypothesized that since self-supervised loss functions are typically most effective in task-agnostic stimulus domains where goal-specific labels are unavailable, optimizing for such an objective might enable the rodent to transfer well to complex multi-faceted ecological tasks in novel domains affording few targeted opportunities for hyper-specialization.

Fig 5. Neural predictivity is not correlated with object categorization performance on ImageNet.

Fig 5

Each model’s (transfer or supervised) object categorization performance on ImageNet is plotted against its median neural predictivity across all units from all visual areas. All ImageNet performance values can be found in Table 2. Inset. Primate ventral visual stream neural predictivity from BrainScore is correlated with ImageNet categorization accuracy (adapted from Schrimpf et al. [17]). This relationship is in stark contrast to our finding in mouse visual cortex where higher ImageNet categorization accuracy is not associated with higher neural predictivity. Color scheme as in Fig 2A. See S2 Fig. for neural predictivity on the calcium imaging dataset.

To test this hypothesis, we used a recently-developed “virtual rodent” framework (adapted from Merel et al. [53] and also used by Lindsay et al. [54]), in which a biomechanically-validated mouse model is placed in a simulated 3D maze-like environment (Fig 6A). The primary purpose of the experiments with the virtual rodent are not to necessarily make a specific statement about rodent movement repertoires, but mainly to how well our self-supervised visual encoder enables control of a high-dimensional body with high-dimensional continuous inputs—a problem that many (if not all) animals have to solve. Of course, given that we were trying to better understand why self-supervised methods are better predictors of specifically mouse visual cortical neurons, we wanted a reasonable ecological task for that species (e.g., navigation) and that its affordances were somewhat similar to that of an actual rodent via the biomechanical realism of its body. In particular, if we used our shallow, lower acuity, self-supervised visual encoder for controlling a virtual monkey simulation for a task that a monkey is adapted to (e.g., object manipulation), we would not expect this to work as well, given that such a task likely requires high visual acuity and good object recognition abilities at a minimum.

Fig 6. Evaluating the generality of learned visual representations.

Fig 6

A. Each row shows an example of an episode used to train the reinforcement learning (RL) policy in an offline fashion [56]. The episodes used to train the virtual rodent (adapted from Merel et al. [53]) were previously generated and are part of a larger suite of RL tasks for benchmarking algorithms [57]. In this task (“DM Locomotion Rodent”), the goal of the agent is to navigate the maze to collect as many rewards as possible (blue orbs shown in the first row). B. A schematic of the RL agent. The egocentric visual input is fed into the visual backbone of the model, which is fixed to be the first four convolutional layers of either the contrastive or the supervised variants of AlexNet. The output of the visual encoder is then concatenated with the virtual rodent’s proprioceptive inputs before being fed into the (recurrent) critic and policy heads. The parameters of the visual encoder are not trained, while the parameters of the critic head and the policy head are trained. Virtual rodent schematic from Fig 1B of Merel et al. [53]. C. A schematic of the out-of-distribution generalization procedure. Visual encoders are either trained in a supervised or self-supervised manner on ImageNet [19] or the Maze environment [57], and then evaluated on reward-based navigation or datasets consisting of object properties (category, pose, position, and size) from Hong et al. [55], and different textures [58].

We used the model that best corresponds to mouse visual responses as the visual system of the simulated mouse, coupled to a simple actor-critic reinforcement-learning architecture (Fig 6B). We then trained the simulated mouse in several visual contexts with varying objective functions, and evaluated those models’ ability to transfer to a variety of tasks in novel environments, including both a reward-based navigation task, as well as several object-centric visual categorization and estimation tasks (Fig 6C).

We first evaluated a simulated mouse whose visual system was pretrained with ImageNet images in terms of its ability to transfer to reward-based navigation in the Maze environment, training just the reinforcement learning portion of the network on the navigation task with the pretrained visual system held constant. As a supervised control, we performed the same transfer training procedure using a visual front-end created using supervised (ImageNet) object categorization pretraining. We found that the simulated mouse with the contrastive self-supervised visual representation was able to reliably obtain substantially higher navigation rewards than its counterpart with category-supervised visual representation (Fig 7A; cf. Table 3 for the exact quantities of rewards attained).

Fig 7. Self-supervised, contrastive visual representations better support transfer performance on downstream, out-of-distribution tasks.

Fig 7

Models trained in a contrastive manner (using either ImageNet or the egocentric maze inputs; red and blue respectively) lead to better transfer on out-of-distribution downstream tasks than models trained in a supervised manner (i.e., models supervised on labels or on rewards; black and purple respectively). A. Models trained on ImageNet, tested on reward-based navigation. B. Models trained on egocentric maze inputs (“Contrastive Maze”, blue) or supervised on rewards (i.e., reward-based navigation; “Supervised Maze”, purple), tested on visual scene understanding tasks: pose, position, and size estimation, and object and texture classification. C. Median and s.e.m. neural predictivity across units in the Neuropixels dataset.

Table 3. Average reward and s.e.m. across 600 episodes obtained by the RL agent using each of the visual backbones.

Visual Backbone Reward
Contrastive ImageNet 297 ± 6
Supervised ImageNet 262 ± 6
Contrastive Maze 412 ± 5
Supervised Maze 462 ± 6

Conversely, we trained the simulated mouse directly in the Maze environment. In one variant, we trained the visual system on the contrastive self-supervised objective (but on the images of the Maze environment). In a second variant, we trained the agent end-to-end on the reward navigation task itself—the equivalent of “supervision” in the Maze environment. We then tested both models on their ability to transfer to out-of-sample visual categorization and estimation tasks. Again, we found that the self-supervised variant transfers significantly better than its “supervised” counterpart (blue vs. purple bars in Fig 7B).

Returning to analyses of neural predictivity, we found that both self-supervised models (trained in either environment) were better matches to mouse visual cortex neurons than the supervised counterparts in their respective environments (red and blue vs. black and purple in Fig 7C). This result illustrates the congruence between general out-of-distribution task-transfer performance and improved neural fidelity of the computational model.

Furthermore, Lindsay et al. [54] found that less powerful self-supervised representation learners such as CPC and autoencoding did not match mouse visual responses as well as their RL-trained counterpart (in terms of representational similarity). This is consistent with our finding that CPC and autoencoding in Fig 3A themselves do not match neural responses as well as contrastive self-supervised methods (red vs. pink and purple bars).

It is also noteworthy that, comparing models by training environment rather than objective function, the ImageNet-trained models produce more neurally-consistent models than the Maze-trained models, both for supervised and self-supervised objectives (red vs. blue and black vs. purple in Fig 7C). Using the contrastive self-supervised objective function is by itself enough to raise the Maze-trained model to the predictivity level of the ImageNet-trained supervised model, but the training environment makes a significant contribution. This suggests that while the task domain and biomechanical model of the Maze environment are realistic, future work will likely need to improve the realism of the simulated image distribution.

Overall, these results suggest that contrastive embedding methods have achieved a generalized improvement in the quality of the visual representations they create, enabling a diverse range of visual behaviors, providing evidence for their potential as computational models of mouse visual cortex. While we expect the calculation of rewards to be performed outside of the visual system, we expect the mouse visual system would support the visual aspects of the transfer tasks we consider, as we find that higher model areas best support these scene understanding transfer tasks (S8 Fig). This is not unlike the primate, where downstream visual areas support a variety of visual recognition tasks [55].

Discussion

In this work, we showed that comparatively shallow architectures trained with contrastive objectives operating on lower-resolution images most accurately predict static-image-evoked neural responses across multiple mouse visual areas, surpassing the predictive power of supervised methods and approaching the inter-animal consistency. The fact that these goal-driven constraints lead to a better match to visual responses, even in “behaviorally-free” data where mice are passively viewing stimuli, suggests that these constraints may be good descriptions of the evolutionary and the developmental drivers of mouse visual cortical structure and function.

In the primate ventral visual stream, models trained on contrastive objectives led to neural predictivity performance that was on par with that of supervised models [20], suggesting they are a more ecologically-valid proxy for a categorization-specialized system. This is in stark contrast with our findings in our models of mouse visual cortex—we found that models trained on contrastive objectives substantially surpassed the neural predictivity of their supervised counterparts. We investigated the advantages that contrastive objectives might confer over supervised objectives for mouse visual representations and found that they provide representations that are generally improved over those obtained by supervised methods. The improvement in generality of contrastive models enabled better transfer to a diverse range of downstream behaviors in novel, out of distribution environments, including reward-based navigation in egocentric maze environments and visual scene understanding.

As mentioned previously, the goal-driven modeling approach allows us to understand the principles that govern the system under study and further allows for direct comparisons across systems. Our high-fidelity model of mouse visual cortex and the principles underlying its construction can be compared with models of primate visual cortex. While the primate ventral visual stream is well-modeled by a deep hierarchical system and object-category learning, mouse as a model visual system has not had such a coherent account heretofore. Our results demonstrate that both primate and rodent visual systems are highly constrained, albeit for different functional purposes. The results summarized above further suggest that mouse visual cortex is a light-weight, shallower, low-resolution, and general-purpose visual system in contrast to the deep, high-resolution, and more categorization-dominated visual system of primates, suggested by prior work [20].

Although we have made progress in modeling the mouse visual system in three core ways (the choice of objective function, data stream, and architecture class), some limitations remain both in the modeling and in the neural data.

On the architectural front, our focus in this work was on feedforward models, but there are many feedback connections from higher visual areas to lower visual areas [21]. Incorporating these architectural motifs into our models and training these models using dynamic inputs may be useful for modeling temporal dynamics in mouse visual cortex, as has been recently done in primates [29, 59, 60].

By incorporating recurrent connections in the architectures, we can probe the functionality of these feedback connections using self-supervised loss functions in scenarios with temporally-varying, dynamic inputs. For example, given that powerful self-supervised methods obtained good visual representations of static images, it would be interesting to explore a larger spectrum of self-supervised signals operating on dynamic inputs, such as in the context of forward prediction (e.g., [6163]).

Constraining the input data so that they are closer to those received by the mouse visual system was important for improved neural fidelity. Our resizing (i.e., downsampling) of the images to be smaller during training acted as a proxy for low-pass filtering. We believe that future work could investigate other appropriate low-pass filters and ecologically-relevant pixel-level transformations to apply to the original image or video stream [64, 65].

Our inter-animal consistency analyses make recommendations for qualities of experiments that are likely to provide data that will be helpful in more sharply differentiating models. Under all mapping functions, a log-linear extrapolation analysis revealed that as the number of units increases, the inter-animal consistency approaches one more rapidly for the Neuropixels dataset, than for the calcium imaging dataset, indicating the higher reliability of the Neuropixels data (S5 Fig). Moreover, when assessing inter-animal consistency, correlation values between animals were significantly higher for the training set than for the test set, indicating that the number of stimuli could be enlarged to close this generalization gap (S6A Fig). As a function of the number of stimuli, the test set inter-animal consistencies steadily increased, and would likely continue to increase substantially if the dataset had more stimuli (S6B Fig). Thus, while much focus in experimental methods has been on increasing the number of neurons in a dataset [66], our analyses indicate that increasing the number of stimuli may drastically improve model identification. Doing so would likely raise the inter-animal consistency, providing substantially more dynamic range for separating models in terms of their ability to match the data, potentially allowing us to obtain more specific conclusions about which circuit structure(s) [67, 68] and which (combinations of) objectives (e.g., [45, 46, 48]) best describe mouse visual cortex.

We endeavor that our work in modeling mouse visual cortex will meaningfully drive future experimental and computational studies in mice of other sensory systems and of visually-guided behaviors. The input-domain-agnostic nature of these contrastive objectives suggest the tantalizing possibility that they might be used in other sensory systems, such as barrel cortex or the olfactory system. By building high-fidelity computational models of sensory cortex, we believe that they can be integrated with models of higher-order systems (e.g., medial temporal lobe), with the goal of providing us with greater insight into how sensory experience contributes to adaptive or maladaptive behaviors.

Methods

Neural response datasets

We used the Allen Brain Observatory Visual Coding dataset [15, 22] collected using both two-photon calcium imaging and Neuropixels from areas V1 (VISp), LM (VISl), AL (VISal), RL (VISrl), AM (VISam), and PM (VISpm) in mouse visual cortex. We focused on the natural scene stimuli, consisting of 118 images, each presented 50 times (i.e., 50 trials per image).

We list the number of units and specimens for each dataset in Table 1, after units are selected, according to the following procedure: For the calcium imaging data, we used a similar unit selection criterion as in Conwell et al. [18], where we sub-selected units that attain a Spearman-Brown corrected split-half consistency of at least 0.3 (averaged across 100 bootstrapped trials), and whose peak responses to their preferred images are not significantly modulated by the mouse’s running speed during stimulus presentation (p > 0.05).

For the Neuropixels dataset, we separately averaged, for each specimen and each visual area, the temporal response (at the level of 10-ms bins up to 250 ms) on the largest contiguous time interval when the median (across the population of units in that specimen) split-half consistency reached at least 0.3. This procedure helps to select the most internally-consistent units in their temporally-averaged response, and accounts for the fact that different specimens have different time courses along which their population response becomes reliable.

Finally, after subselecting units according to the above criteria for both datasets, we only keep specimens that have at least the 75th percentile number of units among all specimens for that given visual area. This final step helped to ensure we have enough internally-consistent units per specimen for the inter-animal consistency estimation (derived in the “Inter-Animal Consistency Derivation” section).

Noise-corrected neural predictivity

Linear regression

When we perform neural fits, we choose a random 50% set of natural scene images (59 images in total) to train the regression, and the remaining 50% to use as a test set (59 images in total), across ten train-test splits total. For Ridge regression, we use an α = 1, following the sklearn.linear_model convention. PLS regression was performed with 25 components, as in prior work (e.g., [8, 17]). When we perform regression with the One-to-One mapping, as in Fig 1B, we identify the top correlated (via Pearson correlation on the training images) unit in the source population for each target unit. Once that source unit has been identified, we then fix it for that particular train-test split, evaluated on the remaining 50% of images.

Motivated by the justification given in the “Inter-Animal Consistency Derivation” section for the noise correction in the inter-animal consistency, the noise correction of the model to neural response regression is a special case of the quantity defined in the “Multiple Animals” section, where now the source animal is replaced by model features, separately fit to each target animal (from the set of available animals A). Let L be the set of model layers, let r be the set of model responses at model layer L, M be the mapping, and let s be the trial-averaged pseudo-population response

maxLmedianBACorr(M(rtrain;s1,trainB)test,s2,testB)Corr˜(M(rtrain;s1,trainB)test,M(rtrain;s2,trainB)test)×Corr˜(s1,testB,s2,testB),

where the average is taken over 100 bootstrapped split-half trials, ⊕ denotes concatenation of units across animals BA followed by the median value across units, and Corr(,) denotes the Pearson correlation of the two quantities. Corr˜(,) denotes the Spearman-Brown corrected value of the original quantity (see the “Spearman-Brown Correction” section).

Prior to obtaining the model features of the stimuli for linear regression, we preprocessed each stimulus using the image transforms used on the validation set during model training, resizing the shortest edge of the stimulus in both cases to 64 pixels, preserving the aspect ratio of the input stimulus. Specifically, for models trained using the ImageNet dataset, we first resized the shortest edge of the stimulus to 256 pixels, center-cropped the image to 224 × 224 pixels, and finally resized the stimulus to 64 × 64 pixels. For models trained using the CIFAR-10 dataset, this resizing yielded a 64 × 81 pixels stimulus.

Representational Similarity Analysis (RSA)

In line with prior work [18, 36], we also used representational similarity analysis (RSA; [30]) to compare models to neural responses, as well as to compare animals to each other. Specifically, we compared (via Pearson correlation) only the upper-right triangles of the representational dissimilarity matrices (RDMs), excluding the diagonals to avoid illusory effects [69].

For each visual area and a given model, we defined the predictivity of the model for that area to be the maximum RSA score across model layers after the suitable noise correction is applied, which is defined as follows. Let r be the model responses at model layer and let s be the trial-averaged pseudo-population response (i.e., responses aggregated across specimens). The metric used here is a specific instance of Eq (10), where the single source animal A is the trial-wise, deterministic model features (which have a mapping consistency of 1 as a result) and a single target animal B, which is the pseudo-population response:

maxLRSA(r,s2)RSA˜(s1,s2),RSA˜(s1,s2)2RSA(s1,s2)1+RSA(s1,s2), (1)

where L is the set of model layers, {si}i=12 are the animal’s responses for two halves of the trials (and averaged across the trials dimension), the average is computed over 100 bootstrapped split-half trials, and RSA˜(s1,s2) denotes Spearman-Brown correction applied to the internal consistency quantity, RSA(s1,s2), defined in the “Spearman-Brown Correction” section.

If the fits are performed separately for each animal, then B corresponds to each animal among those for a given visual area (defined by the set A), and we compute the median across animals BA:

maxLmedianBARSA(r,s2B)RSA˜(s1B,s2B). (2)

Similar to the above, Spearman-Brown correction is applied to the internal consistency quantity, RSA(s1B,s2B).

Inter-animal consistency derivation

Single animal pair

Suppose we have neural responses from two animals A and B. Let tip be the vector of true responses (either at a given time bin or averaged across a set of time bins) of animal pA={A,B,} on stimulus set i{train,test}. Of course, we only receive noisy observations of tip, so let sj,ip be the jth set of n trials of tip. Finally, let M(x; y)i be the predictions of a mapping M (e.g., PLS) when trained on input x to match output y and tested on stimulus set i. For example, M(ttrainA;ttrainB)test is the prediction of mapping M on the test set stimuli trained to match the true neural responses of animal B given, as input, the true neural responses of animal A on the train set stimuli. Similarly, M(s1,trainA;s1,trainB)test is the prediction of mapping M on the test set stimuli trained to match the trial-average of noisy sample 1 on the train set stimuli of animal B given, as input, the trial-average of noisy sample 1 on the train set stimuli of animal A.

With these definitions in hand, the inter-animal mapping consistency from animal A to animal B corresponds to the following true quantity to be estimated:

Corr(M(ttrainA;ttrainB)test,ttestB), (3)

where Corr(,) is the Pearson correlation across a stimulus set. In what follows, we will argue that Eq (3) can be approximated with the following ratio of measurable quantities, where we split in half and average the noisy trial observations, indexed by 1 and by 2:

Corr(M(ttrainA;ttrainB)test,ttestB)Corr(M(s1,trainA;s1,trainB)test,s2,testB)Corr(M(s1,trainA;s1,trainB)test,M(s2,trainA;s2,trainB)test)×Corr(s1,testB,s2,testB). (4)

In words, the inter-animal consistency (i.e., the quantity on the left side of Eq (4)) corresponds to the predictivity of the mapping on the test set stimuli from animal A to animal B on two different (averaged) halves of noisy trials (i.e., the numerator on the right side of Eq (4)), corrected by the square root of the mapping reliability on animal A’s responses to the test set stimuli on two different halves of noisy trials multiplied by the internal consistency of animal B.

We justify the approximation in Eq (4) by gradually replacing the true quantities (t) by their measurable estimates (s), starting from the original quantity in Eq (3). First, we make the approximation that:

Corr(M(ttrainA;ttrainB)test,s2,testB)Corr(M(ttrainA;ttrainB)test,ttestB)×Corr(ttestB,s2,testB), (5)

by the transitivity of positive correlations (which is a reasonable assumption when the number of stimuli is large). Next, by transitivity and normality assumptions in the structure of the noisy estimates and since the number of trials (n) between the two sets is the same, we have that:

Corr(s1,testB,s2,testB)Corr(s1,testB,ttestB)×Corr(ttestB,s2,testB)Corr(ttestB,s2,testB)2. (6)

In words, Eq (6) states that the correlation between the average of two sets of noisy observations of n trials each is approximately the square of the correlation between the true value and average of one set of n noisy trials. Therefore, combining Eqs (5) and (6), it follows that:

Corr(M(ttrainA;ttrainB)test,ttestB)Corr(M(ttrainA;ttrainB)test,s2,testB)Corr(s1,testB,s2,testB). (7)

From the right side of Eq (7), we can see that we have removed ttestB, but we still need to remove the M(ttrainA;ttrainB)test term, as this term still contains unmeasurable (i.e., true) quantities. We apply the same two steps, described above, by analogy, though these approximations may not always be true (they are, however, true for Gaussian noise):

Corr(M(s1,trainA;s1,trainB)test,s2,testB)Corr(s2,testB,M(ttrainA;ttrainB)test)×Corr(M(ttrainA;ttrainB)test,M(s1,trainA;s1,trainB)test)Corr(M(s1,trainA;s1,trainB)test,M(s2,trainA;s2,trainB)test)Corr(M(s1,trainA;s1,trainB)test,M(ttrainA;ttrainB)test)2,

which taken together implies the following:

Corr(M(ttrainA;ttrainB)test,s2,testB)Corr(M(s1,trainA;s1,trainB)test,s2,testB)Corr(M(s1,trainA;s1,trainB)test,M(s2,trainA;s2,trainB)test). (8)

Eqs (7) and (8) together imply the final estimated quantity given in Eq (4).

Multiple animals

For multiple animals, we consider the average of the true quantity for each target in B in Eq (3) across source animals A in the ordered pair (A, B) of animals A and B:

Corr(M(ttrainA;ttrainB)test,ttestB)AA:(A,B)A×ACorr(M(s1,trainA;s1,trainB)test,s2,testB)Corr˜(M(s1,trainA;s1,trainB)test,M(s2,trainA;s2,trainB)test)×Corr˜(s1,testB,s2,testB)AA:(A,B)A×A.

We also bootstrap across trials, and have multiple train/test splits, in which case the average on the right hand side of the equation includes averages across these as well.

Note that each neuron in our analysis will have this single average value associated with it when it was a target animal (B), averaged over source animals/subsampled source neurons, bootstrapped trials, and train/test splits. This yields a vector of these average values, which we can take median and standard error of the mean (s.e.m.) over, as we do with standard explained variance metrics.

RSA

We can extend the above derivations to other commonly used metrics for comparing representations that involve correlation. Since RSA(x,y)Corr(RDM(x),RDM(y)), then the corresponding quantity in Eq (4) analogously (by transitivity of positive correlations) becomes:

RSA(M(ttrainA;ttrainB)test,ttestB)AA:(A,B)A×ARSA(M(s1,trainA;s1,trainB)test,s2,testB)RSA˜(M(s1,trainA;s1,trainB)test,M(s2,trainA;s2,trainB)test)×RSA˜(s1,testB,s2,testB)AA:(A,B)A×A. (9)

Note that in this case, each animal (rather than neuron) in our analysis will have this single average value associated with it when it was a target animal (B) (since RSA is computed over images and neurons), where the average is over source animals/subsampled source neurons, bootstrapped trials, and train/test splits. This yields a vector of these average values, which we can take median and s.e.m. over, across animals BA.

For RSA, we can use the identity mapping (since RSA is computed over neurons as well, the number of neurons between source and target animal can be different to compare them with the identity mapping). As parameters are not fit, we can choose train=test, so that Eq (9) becomes:

RSA(tA,tB)AA:(A,B)A×ARSA(s1A,s2B)RSA˜(s1A,s2A)×RSA˜(s1B,s2B)AA:(A,B)A×A. (10)

Pooled source animal

Often times, we may not have enough neurons per animal to ensure that the estimated inter-animal consistency in our data closely matches the “true” inter-animal consistency. In order to address this issue, we holdout one animal at a time and compare it to the pseudo-population aggregated across units from the remaining animals, as opposed to computing the consistencies in a pairwise fashion. Thus, B is still the target heldout animal as in the pairwise case, but now the average over A is over a sole “pooled” source animal constructed from the pseudo-population of the remaining animals.

Spearman-brown correction

The Spearman-Brown correction can be applied to each of the terms in the denominator individually, as they are each correlations of observations from half the trials of the same underlying process to itself (unlike the numerator). Namely,

Corr˜(X,Y)2Corr(X,Y)1+Corr(X,Y).

Analogously, since RSA(X,Y)Corr(RDM(x),RDM(y)), then we define

RSA˜(X,Y)Corr˜(RDM(x),RDM(y))=2RSA(X,Y)1+RSA(X,Y).

StreamNet architecture variants

We developed shallower, multiple-streamed architectures for mouse visual cortex, shown in Fig 2A. There are three main modules in our architecture: shallow, intermediate, and deep. The shallow and deep modules each consist of one convolutional layer and the intermediate module consists of a block of two convolutional layers. Thus, the longest length of the computational graph, excluding the readout module, is four (i.e., 1 + 2 + 1). Depending on the number of parallel streams in the model, the intermediate module would contain multiple branches (in parallel), each receiving input from the shallow module. The outputs of the intermediate modules are then passed through one convolutional operation (deep module). Finally, the outputs of each parallel branch would be summed together, concatenated across the channels dimension, and used as input for the readout module.

The readout module consists of an (adaptive) average pooling operation that upsamples the inputs to 6 × 6 feature maps. These feature maps are then flattened, so that each image only has a single feature vector. These feature vectors are then fed into a linear layer (i.e., fully-connected layer) either for classification or for embedding them into a lower-dimensional space for the contrastive losses.

The table below describes the parameters of three model variants, each containing one (N = 1), two (N = 2), or six (N = 6) parallel branches. Note that one convolutional layer is denoted by a tuple: (number of filters, filter size, stride, padding). The first convolutional layer has max pooling of stride 2, as in AlexNet. A block of convolutional layers is denoted by a list of tuples, where each tuple in the list corresponds to a single convolutional layer. When a list of tuples is followed by “×N”, this means that the convolutional parameters for each of the N parallel branches are the same.

Neural network training objectives

In this section, we briefly describe the supervised and self-supervised objectives that were used to train our models.

Module Name Output Size Single (N = 1) Dual (N = 2) Six (N = 6)
Input 64 × 64 N/A N/A N/A
Shallow 7 × 7 (64, 11, 4, 2) (64, 11, 4, 2) (64, 11, 4, 2)
Intermediate 3 × 3 [(192,5,1,2)(384,3,1,1)] [(192,5,1,2)(384,3,1,1)]×2 [(192,5,1,2)(384,3,1,1)]×6
Deep 3 × 3 If inputs are from intermediate: (256, 3, 1, 1), otherwise: (256, 3, 2, 0) If inputs are from intermediate: (256, 3, 1, 1), otherwise: (256, 3, 2, 0) If inputs are from intermediate: (256, 3, 1, 1), otherwise: (256, 3, 2, 0)

Supervised training objective

The loss function L used in supervised training is the cross-entropy loss, defined as follows:

L(X;θ)=-1Ni=1Nlog(exp(Xi[ci])j=0C-1exp(Xi[j])), (11)

where N is the batch size, C is the number of categories for the dataset, XN×C are the model outputs (i.e., logits) for the N images, XiC are the logits for the ith image, ci ∈ [0, C − 1] is the category index of the ith image (zero-indexed), and θ are the model parameters. Eq (11) was minimized using stochastic gradient descent (SGD) with momentum [70].

ImageNet [19]. This dataset contains approximately 1.3 million images in the train set and 50 000 images in the validation set. Each image was previously labeled into C = 1000 distinct categories.

CIFAR-10 [39]. This dataset contains 50 000 images in the train set and 10 000 images in the validation set. Each image was previously labeled into C = 10 distinct categories.

Depth prediction [40]. The goal of this objective is to predict the depth map of an image. We used a synthetically generated dataset of images known as PBRNet [40]. It contains approximately 500 000 images and their associated depth maps. Similar to the loss function used in the sparse autoencoder objective, we used a mean-squared loss to train the models. The output (i.e., depth map) was generated using a mirrored version of each of our StreamNet variants. In order to generate the depth map, we appended one final convolutional layer onto the output of the mirrored architecture in order to downsample the three image channels to one image channel. During training, random crops of size 224 × 224 pixels were applied to the image and depth map (which were both subsequently resized to 64 × 64 pixels). In addition, both the image and depth map were flipped horizontally with probability 0.5. Finally, prior to the application of the loss function, each depth map was normalized such that the mean and standard deviation across pixels were zero and one respectively.

Each of our single-, dual-, and six-stream variants were trained using a batch size of 256 for 50 epochs using SGD with momentum of 0.9, and weight decay of 0.0001. The initial learning rate was set to 10−4 and was decayed by a factor of 10 at epochs 15, 30, and 45.

Self-supervised training objectives

Sparse autoencoder [42]. The goal of this objective is to reconstruct an image from a sparse image embedding. In order to generate an image reconstruction, we used a mirrored version of each of our StreamNet variants. Concretely, the loss function was defined as follows:

L(x;θ)=12·642f(x)-x22+λ128v1, (12)

where v128 is the image embedding, f is the (mirrored) model, f(x) is the image reconstruction, x is a 64 × 64 pixels image, λ is the regularization coefficient, and θ are the model parameters.

Our single-, dual-, and six-stream variants were trained using a batch size of 256 for 100 epochs using SGD with momentum of 0.9 and weight decay of 0.0005. The initial learning rate was set to 0.01 for the single- and dual-stream variants and was set to 0.001 for the six-stream variant. The learning rate was decayed by a factor of 10 at epochs 30, 60, and 90. For all the StreamNet variants, the embedding dimension was set to 128 and the regularization coefficient was set to 0.0005.

RotNet [44]. The goal of this objective is to predict the rotation of an image. Each image of the ImageNet dataset was rotated four ways (0°, 90°, 180°, 270°) and the four rotation angles were used as “pseudo-labels” or “categories”. The cross-entropy loss was used with these pseudo-labels as the training objective (i.e., Eq (11) with C = 4).

Our single-, dual-, and six-stream variants were trained using a batch size of 192 (which is effectively a batch size of 192 × 4 = 768 due to the four rotations for each image) for 50 epochs using SGD with Nesterov momentum of 0.9, and weight decay of 0.0005. An initial learning rate of 0.01 was decayed by a factor of 10 at epochs 15, 30, and 45.

Instance recognition [45]. The goal of this objective is to be able to differentiate between embeddings of augmentations of one image from embeddings of augmentations of other images. Thus, this objective function is an instance of the class of contrastive objective functions.

A random image augmentation is first performed on each image of the ImageNet dataset (random resized cropping, random grayscale, color jitter, and random horizontal flip). Let x be an image augmentation, and f(⋅) be the model backbone composed with a one-layer linear multi-layer perceptron (MLP) of size 128. The image is then embedded onto a 128-dimensional unit-sphere as follows:

z=f(x)/f(x)2,zR128.

Throughout model training, a memory bank containing embeddings for each image in the train set is maintained (i.e., the size of the memory bank is the same as the size of the train set). The embedding z will be “compared” to a subsample of these embeddings. Concretely, the loss function L for one image x is defined as follows:

h(u)=exp(u·z/τ)/Zexp(u·z/τ)/Z+(m/N),L(x;θ)=-logh(v)-j=1mlog(1-h(vj)), (13)

where v128 is the embedding for image x that is currently stored in the memory bank, N is the size of the memory bank, m = 4096 is the number of “negative” samples used, {vj}j=1m are the negative embeddings sampled from the memory bank uniformly, Z is some normalization constant, τ = 0.07 is a temperature hyperparameter, and θ are the parameters of f. From Eq (13), we see that we want to maximize h(v), which corresponds to maximizing the similarity between v and z (recall that z is the embedding for x obtained using f). We can also see that we want to maximize 1 − h(vj) (or minimize h(vj)). This would correspond to minimizing the similarity between vj and z (recall that vj are the negative embeddings).

After each iteration of training, the embeddings for the current batch are used to update the memory bank (at their corresponding positions in the memory bank) via a momentum update. Concretely, for image x, its embedding in the memory bank v is updated using its current embedding z as follows:

vλv+(1-λ)z,vv/v2,

where λ = 0.5 is the momentum coefficient. The second operation on v is used to project v back onto the 128-dimensional unit sphere.

Our single-, dual-, and six-stream variants were trained using a batch size of 256 for 200 epochs using SGD with momentum of 0.9, and weight decay of 0.0005. An initial learning rate of 0.03 was decayed by a factor of 10 at epochs 120 and 160.

MoCov2 [47, 71]. The goal of this objective is to be able to distinguish augmentations of one image (i.e., by labeling them as “positive”) from augmentations of other images (i.e., by labeling them as “negative”). Intuitively, embeddings of different augmentations of the same image should be more “similar” to each other than to embeddings of augmentations of other images. Thus, this algorithm is another instance of the class of contrastive objective functions and is similar conceptually to instance recognition.

Two image augmentations are first generated for each image in the ImageNet dataset by applying random resized cropping, color jitter, random grayscale, random Gaussian blur, and random horizontal flips. Let x1 and x2 be the two augmentations for one image. Let fq(⋅) be a query encoder, which is a model backbone composed with a two-layer non-linear MLP of dimensions 2048 and 128 respectively and let fk(⋅) be a key encoder, which has the same architecture as fq. x1 is encoded by fq and x2 is encoded by fk as follows:

v=fq(x1),k0=fk(x2),v,k0R128.

During each iteration of training, a dictionary of size K of image embeddings obtained from previous iterations is maintained (i.e., the dimensions of the dictionary are K × 128). The image embeddings in this dictionary are used as “negative” samples. The loss function L for one image of a batch is defined as follows:

L(x1,x2;θq)=-logexp(v·k0/τ)i=0Kexp(v·ki/τ), (14)

where θq are the parameters of fq, τ = 0.2 is a temperature hyperparameter, K = 65 536 is the number of “negative” samples, and {ki}i=1K are the embeddings of the negative samples (i.e., the augmentations for other images which are encoded using fk, and are stored in the dictionary). From Eq (14), we see that we want to maximize vk0, which corresponds to maximizing the similarity between the embeddings of the two augmentations of an image.

After each iteration of training, the dictionary of negative samples is enqueued with the embeddings from the most recent iteration, while embeddings that have been in the dictionary for the longest are dequeued. Finally, the parameters θk of fk are updated via a momentum update, as follows:

θkλθk+(1-λ)θq,

where λ = 0.999 is the momentum coefficient. Note that only θq are updated with back-propagation.

Our single-, dual-, and six-stream variants were trained using a batch size of 512 for 200 epochs using SGD with momentum of 0.9, and weight decay of 0.0005. An initial learning rate of 0.06 was used, and the learning rate was decayed to 0.0 using a cosine schedule (with no warm-up).

SimCLR [46]. The goal of this objective is conceptually similar to that of MoCov2, where the embeddings of augmentations of one image should be distinguishable from the embeddings of augmentations of other images. Thus, SimCLR is another instance of the class of contrastive objective functions.

Similar to other contrastive objective functions, two image augmentations are first generated for each image in the ImageNet dataset (by using random cropping, random horizontal flips, random color jittering, random grayscaling and random Gaussian blurring). Let f(⋅) be the model backbone composed with a two-layer non-linear MLP of dimensions 2048 and 128 respectively. The two image augmentations are first embedded into a 128-dimensional space and normalized:

z1=f(x1)/f(x1)2,z2=f(x2)/f(x2)2,z1,z2R128.

The loss function L for a single pair of augmentations of an image is defined as follows:

L(x1,x2;θ)=-logexp(z1·z2/τ)i=12N1[i1]exp(z1·zi/τ), (15)

where τ = 0.1 is a temperature hyperparameter, N is the batch size, 1[i1] is equal to 1 if i ≠ 1 and 0 otherwise, and θ are the parameters of f. The loss defined in Eq (15) is computed for every pair of images in the batch (including their augmentations) and subsequently averaged.

Our single-, dual-, and six-stream variants were trained using a batch size of 4096 for 200 epochs using layer-wise adaptive rate scaling (LARS; [72]) with momentum of 0.9, and weight decay of 10−6. An initial learning rate of 4.8 was used and decayed to 0.0 using a cosine schedule. A linear warm-up of 10 epochs was used for the learning rate with warm-up ratio of 0.0001.

SimSiam [48]. The goal of this objective is to maximize the similarity between the embeddings of two augmentations of the same image. Thus, SimSiam is another instance of the class of contrastive objective functions.

Two random image augmentations (i.e., random resized crop, random horizontal flip, color jitter, random grayscale, and random Gaussian blur) are first generated for each image in the ImageNet dataset. Let x1 and x2 be the two augmentations of the same image, f(⋅) be the model backbone, g(⋅) be a three-layer non-linear MLP, and h(⋅) be a two-layer non-linear MLP. The three-layer MLP has hidden dimensions of 2048, 2048, and 2048. The two-layer MLP has hidden dimensions of 512 and 2048 respectively. Let θ be the parameters for f, g, and h. The loss function L for one image x of a batch is defined as follows (recall that x1 and x2 are two augmentations of one image):

p1=hgf(x1),p2=hgf(x2),z1=gf(x1),z2=gf(x2),L(x1,x2;θ)=-12(z1·p2z12p22+z2·p1z22p12), (16)

where z1,z2,p1,p22048. Note that z1 and z2 are treated as constants in this loss function (i.e., the gradients are not back-propagated through z1 and z2). This “stop-gradient” method was key to the success of this objective function.

Our single-, dual-, and six-stream variants were trained using a batch size of 512 for 100 epochs using SGD with momentum of 0.9, and weight decay of 0.0001. An initial learning rate of 0.1 was used, and the learning rate was decayed to 0.0 using a cosine schedule (with no warm-up).

Barlow twins [49]. This method is inspired by Horace Barlow’s theory that sensory systems reduce redundancy in their inputs [73]. Let x1 and x2 be the two augmentations (random crops and color distortions) of the same image, f(⋅) be the model backbone, and let h(⋅) be a three-layer non-linear MLP (each of output dimension 8192). Given z1,z28192, where z1 = hf(x1) and z2 = hf(x2), this method proposes an objective function which tries to make the cross-correlation matrix computed from the twin embeddings z1 and z2 as close to the identity matrix as possible:

z1=hf(x1),z2=hf(x2),L(x1,x2;θ)=i(1-Cii2)+λijiCij2,Cijbz1b,iz2b,j(bz1b,i)2(bz2b,j)2, (17)

where b indexes batch examples and i, j index the embedding output dimension.

We trained AlexNet (with 64 × 64 image inputs) with the recommended hyperparameters of λ = 0.0051, weight decay of 10−6, and batch size of 2048 with the LARS [72] optimizer employing learning rate warm-up of 10 epochs under a cosine schedule. We found that training stably completed after 58 epochs for this particular model architecture.

VICReg [50]. Let x1 and x2 be the two augmentations (random crops and color distortions) of the same image, f(⋅) be the model backbone, and let h(⋅) be a three-layer non-linear MLP (each of output dimension 8192). Given z1,z28192, where z1 = hf(x1) and z2 = hf(x2), this method proposes an objective function that contains three terms:

  1. Invariance: minimizes the mean square distance between the embedding vectors.

  2. Variance: enforces the embedding vectors of samples within a batch to be different via a hinge loss to keep the standard deviation of each embedding variable to be above a given threshold (set to 1).

  3. Covariance: prevents informational collapse through highly correlated variables by attracting the covariances between every pair of embedding variables towards zero.

We trained AlexNet (with 64 × 64 image inputs) with the recommended hyperparameters of weight decay of 10−6 and batch size of 2048 with the LARS [72] optimizer employing learning rate warm-up of 10 epochs under a cosine schedule, for 1000 training epochs total.

Top-1 validation set performance

Performance of primate models on 224 × 224 pixels and 64 × 64 pixels ImageNet

Here we report the top-1 validation set accuracy of models trained in a supervised manner on 64 × 64 pixels and 224 × 224 pixels ImageNet.

Architecture Image Size Objective Function Top-1 Accuracy
AlexNet 224 × 224 Supervised (ImageNet) 56.52%
64 × 64 36.22%
VGG16 224 × 224 71.59%
64 × 64 58.32%
ResNet-18 224 × 224 69.76%
64 × 64 53.31%
ResNet-34 224 × 224 73.31%
64 × 64 58.24%
ResNet-50 224 × 224 76.13%
64 × 64 63.57%
ResNet-101 224 × 224 77.37%
64 × 64 64.53%
ResNet-152 224 × 224 78.31%
64 × 64 65.21%

Performance of StreamNet variants on 64 × 64 pixels CIFAR-10 and 64 × 64 pixels ImageNet

Here we report the top-1 validation set accuracy of our model variants trained in a supervised manner on 64 × 64 pixels CIFAR-10 and ImageNet.

Architecture Dataset Objective Function Top-1 Accuracy
Single Stream CIFAR-10 Supervised 76.52%
ImageNet 34.87%
Dual Stream CIFAR-10 81.13%
ImageNet 38.68%
Six Stream CIFAR-10 78.73%
ImageNet 34.15%

Transfer performance of StreamNet variants on 64 × 64 pixels ImageNet under linear evaluation for models trained with self-supervised objectives

In this subsection, we report the top-1 ImageNet validation set performance under linear evaluation for models trained with self-supervised objectives. After training each model on a self-supervised objective, the model backbone weights are then held fixed and a linear readout head is trained on top of the fixed model backbone. In the case where the objective function is “untrained”, model parameters were randomly initialized and held fixed while the linear readout head was trained. The image augmentations used during transfer learning were random cropping and random horizontal flipping. The linear readout for every self-supervised model was trained with the cross-entropy loss function (i.e., Eq (11) with C = 1000) for 100 epochs, which was minimized using SGD with momentum of 0.9, and weight decay of 10−9. The initial learning rate was set to 0.1 and reduced by a factor of 10 at epochs 30, 60, and 90.

Reinforcement learning task

A set of state-action-reward-state tuples (i.e., (st, at, rt, st+1)) were generated in prior work [53] and were used to train (in an offline fashion) the reinforcement learning (RL) agent. We used an offline RL algorithm known as critic-regularized regression [56]. Except for the visual encoder, the architecture of the RL agent was identical to that used by Wang et al. [56] (cf. Fig 3 in Wang et al. [56]). Four different visual encoders based on the AlexNet architecture were used:

  • Contrastive ImageNet: AlexNet trained using ImageNet in a contrastive manner (Instance Recognition). Up to the first four layers are implanted into the virtual rodent as its visual system, as these were the layers that best matched mouse visual areas. This visual encoder is the best model of mouse visual cortex (Fig 2). Its weights were held fixed during training of the RL agent.

  • Supervised ImageNet: AlexNet trained using ImageNet in a supervised manner. Up to the first four layers are implanted into the virtual rodent as its visual system, as these were the layers that best matched mouse visual areas. Its weights were held fixed during training of the RL agent.

  • Contrastive Maze: AlexNet trained using the egocentric maze inputs from the virtual rodent reward-based navigation task. Up to the first four layers are implanted into the virtual rodent as its visual system, as these were the layers that best matched mouse visual areas. Its weights were then held fixed during training of the RL agent.

  • Supervised Maze: The first four convolutional layers of AlexNet (ShallowNet) trained end-to-end on the virtual rodent reward-based navigation task.

After training the agent until policy loss convergence by 10 000 steps (once), the agent was evaluated on 300 episodes. Each model was trained twice (i.e., two different random seeds), so each model was evaluated on 300 × 2 = 600 episodes in total and we report the average reward across all 600 episodes.

Below, we report the average rewards (and s.e.m. across the 600 episodes) obtained when each of the visual backbones are used by the RL agent to perform the task.

Evaluating model performance on downstream visual tasks

To evaluate transfer performance on downstream visual tasks, we used the activations from the outputs of the shallow, intermediate, and deep modules of our StreamNet variants. We also included the average-pooling layer in all the variants (the model layer prior to the fully-connected readout layer). The dimensionality of the activations was then reduced to 1000 dimensions using principal components analysis (PCA), if the number of features exceeded 1000. PCA was not used if the number of features was less than or equal to 1000. A linear readout on these features was then used to perform five transfer visual tasks.

For the first four object-centric visual tasks (object categorization, pose estimation, position estimation, and size estimation), we used a stimulus set that was used previously in the evaluation of neural network models of the primate visual system [17, 20, 74]. The stimulus set consists of objects in various poses (object rotations about the x, y, and z axes), positions (vertical and horizontal coordinates of the object), and sizes, each from eight categories. We then performed five-fold cross-validation on the training split of the medium and high variation image subsets (“Var3” and “Var6”, defined by Majaj et al. [75]) consisting of 3840 images, and computed the performance (metrics defined below) on the test split of the medium and high variation sets (“Var3” and “Var6”) consisting of 1280 images. Ten different category-balanced train-test splits were randomly selected, and the performance of the best model layer (averaged across train-test splits) was reported for each model. All images were resized to 64 × 64 pixels prior to fitting, to account for the visual acuity adjustment. The final non-object-centric task was texture recognition, using the Describable Textures Dataset [58].

Object categorization

We fit a linear support vector classifier to each model layer activations that were transformed via PCA. The regularization parameter,

C[10-8,5×10-8,10-7,5×10-7,10-6,5×10-6,10-5,5×10-5,10-4,5×10-4,10-3,5×10-3,10-2,5×10-2,10-1,5×10-1,1,5,102,5×102,103,5×103,104,5×104,105,5×105,106,5×106,107,5×107,108,5×108], (18)

was chosen by five-fold cross validation. The categories are Animals, Boats, Cars, Chairs, Faces, Fruits, Planes, and Tables. We reported the classification accuracy average across the ten train-test splits.

Position estimation

We predicted both the vertical and the horizontal locations of the object center in the image. We used Ridge regression where the regularization parameter was selected from:

α=1/C, (19)

where C was selected from the list defined in (18). For each network, we reported the correlation averaged across both locations for the best model layer.

Pose estimation

This task was similar to the position prediction task except that the prediction target were the z-axis (vertical axis) and the y-axis (horizontal axis) rotations, both of which ranged between −90 degrees and 90 degrees. The (0, 0, 0) angle was defined in a per-category basis and was chosen to make the (0, 0, 0) angle “semantically” consistent across different categories. We refer the reader to Hong et al. [55] for more details. We used Ridge regression with α chosen from the range in (19).

Size estimation

The prediction target was the three-dimensional object scale, which was used to generate the image in the rendering process. This target varied between 0.625 to 1.6, which was a relative measure to a fixed canonical size of 1. When objects were at the canonical size, they occluded around 40% of the image on the longest axis. We used Ridge regression with α chosen from the range in (19).

Texture classification

We trained linear readouts of the model layers on texture recognition using the Describable Textures Dataset [58], which consists of 5640 images organized according to 47 categories, with 120 images per category. We used ten category-balanced train-test splits, provided by their benchmark. Each split consisted of 3760 train-set images and 1880 test-set images. A linear support vector classifier was then fit with C chosen in the range (18). We reported the classification accuracy average across the ten train-test splits.

Supporting information

S1 Fig. StreamNet architecture schematic.

The first four convolutional layers of AlexNet best corresponded to all the mouse visual areas) These convolutional layers were used as the basis for our StreamNet architecture variants. The number of parallel streams, N, was varied to be one (single-stream), two (dual-stream) or six (six-stream).

(TIF)

S2 Fig. Shallow architectures trained with contrastive objective functions yield the best matches to the neural data (calcium imaging dataset).

As in Fig 2, but for the calcium imaging dataset. A. The median and s.e.m. neural predictivity, using PLS regression, across neurons in all mouse visual areas except RL. N = 16228 units in total (RL is excluded, as mentioned in the “Neural Response Datasets” section). Actual neural predictivity performance can be found in Table 2. The “Primate Model Baseline” denotes a supervised VGG16 trained on 224 px inputs, used in prior work [14, 15, 36]. All models, except for the “Primate Model Baseline”, are trained on 64 px inputs. B. Each model’s performance on ImageNet is plotted against its median neural predictivity across all units from each visual area. Inset. Primate ventral visual stream neural predictivity from BrainScore is correlated with ImageNet categorization accuracy (adapted from Schrimpf et al. [17]). All ImageNet performance numbers can be found in Table 2. Color scheme as in A and Fig 2A.

(TIF)

S3 Fig. Shallow architectures trained with contrastive objective functions yield the best matches to the neural data (RSA).

A. The median and s.e.m. noise-corrected neural predictivity, using RSA, across N = 39 and N = 90 animals for the Neuropixels and calcium imaging dataset respectively (across all visual areas, with RL excluded for the calcium imaging dataset, as mentioned in the “Neural Response Datasets” section). The “Primate Model Baseline” denotes a supervised VGG16 trained on 224 px inputs, used in prior work [14, 15, 36]). All models, except for the “Primate Model Baseline”, are trained on 64 px inputs. B. We plot each model’s performance on ImageNet against its median neural predictivity, using RSA, across visual areas. All ImageNet performance numbers can be found in Table 2. Color scheme as in A and Fig 2A.

(TIF)

S4 Fig. Structural and functional factors leading to improved neural response predictivity (calcium imaging dataset).

A. As in Fig 3, our dual stream variant (red) and Contrastive AlexNet (brown) were trained using lower resolution ImageNet images and in a contrastive manner. Each image was downsampled from 224 × 224 pixels, the image size typically used to train primate ventral stream models, to various image sizes. Training models on resolutions lower than 224 × 224 pixels generally led to improved neural predictivity. The median and s.e.m. across neurons in each visual area is reported. As mentioned in the “Neural Response Datasets” section, visual area RL was removed from the calcium imaging neural predictivity results. Refer to Table 1 for N units per visual area. B. As in Fig 2B, AlexNet was either untrained, trained in a supervised manner (ImageNet) or trained in an self-supervised manner (instance recognition). We observe that the first four convolutional layers provide the best fits to the neural responses for all the visual areas while the latter three layers are not very predictive for any visual area. As mentioned in the “Neural Response Datasets” section, visual area RL was removed from the calcium imaging neural predictivity results.

(TIF)

S5 Fig. Inter-animal consistency as a function of the number of units in the datasets for different mapping transforms.

Inter-animal consistency was extrapolated across the number of units in each dataset using a log-linear function (f(n) = a log10(n) + b, where a and b are fit as parameters via least squares, and n is the sample size factor). This analysis reveals that inter-animal consistency of the Neuropixels dataset approaches 1.0 more rapidly than it does for the calcium imaging dataset. Inter-animal consistency evaluated at a sample size factor of one indicates the consistency when all the existing units in the datasets are used (i.e., inter-animal consistency values reported in Fig 1A).

(TIF)

S6 Fig. Inter-animal consistency can increase with more stimuli.

A. Inter-animal consistency under PLS regression evaluated on the train set (left bars for each visual area) and test set (right bars for each visual area), for both Neuropixels and calcium imaging datasets. The horizontal lines are the internal consistency (split-half reliability). B. Inter-animal consistency under PLS regression on the train set (dotted lines) and test set (straight lines), aggregated across visual areas. Each dot corresponds to the inter-animal consistency evaluated across 10 train-test splits, where each split is a sample of the natural scene image set corresponding to the percentage (x-axis). Note that RL is excluded for calcium imaging, as explained in the text (the “Neural Response Datasets” section). The median and s.e.m. across neurons is reported for both panels. Refer to Table 1 for N units per visual area.

(TIF)

S7 Fig. Data augmentations alone do not lead to improved neural predictivity.

Here we compare the neural predictivity of AlexNet trained in three different ways. Contrastive ImageNet is an AlexNet trained using instance recognition on ImageNet with augmentations that are part of the contrastive algorithm (random crop, random color jitter, random grayscale, random horizontal flip). Supervised ImageNet is an AlexNet trained on ImageNet in a supervised manner with a smaller set of augmentations (random crop and random horizontal flip). Supervised ImageNet (contrastive augmentations) is an AlexNet trained on ImageNet in a supervised manner with the augmentations used in the instance recognition algorithm. This control model allows us to ascertain whether the improved neural predictivity of the Contrastive ImageNet model (red) is due to the contrastive loss function itself or due to the larger set of image augmentations used during model training. In both neural response datasets, we can conclude that data augmentations alone do not contribute to improved correspondence with the mouse visual areas.

(TIF)

S8 Fig. Out-of-distribution task performance across model layers.

For the models trained on images from the maze environment, we plot their transfer performance on a set of out-of-distribution tasks (described in lower right panel of Fig 6C) across model layers. We find that intermediate model areas are better able to perform the transfer tasks and that the model layers that attain peak performance on the tasks correspond to those that best predict neural responses in the intermediate/higher mouse visual areas (see Fig 2B).

(TIF)

S9 Fig. Neural predictivity on the Neuropixels dataset while fixing architecture (left) and objective function (right).

The data in the left panel shows how neural predictivity varies when the architecture is fixed to be AlexNet, but the objective function is varied between supervised (object categorization) and self-supervised, contrastive objectives. The data in the right panel shows how neural predictivity varies when the objective function is fixed to be instance recognition, but the architecture is varied, including StreamNets, MouseNets, VGG16, ResNets, and AlexNet.

(TIF)

S10 Fig. Brain hierarchy score.

The brain hierarchy score metric of Nonaka et al. [51] was computed for a set of feedforward CNNs. Self-supervised, contrastive models (and shallower models) have a higher brain hierarchy score, computed using the mapping from model features to electrophysiological responses.

(TIF)

Acknowledgments

We thank Shahab Bakhtiari, Katherine L. Hermann, and Akshay Jagadeesh for helpful discussions, and Eshed Margalit and Xiaoxuan Jia for helpful feedback on an initial draft of the manuscript.

Data Availability

All relevant neural data is accessible at: https://github.com/AllenInstitute/AllenSDK All source code is available at: https://github.com/neuroailab/mouse-vision.

Funding Statement

A.N. is supported by the K. Lisa Yang Integrative Computational Neuroscience (ICoN) Center at MIT. N.C.L.K. is supported by the Stanford University Ric Weiland Graduate Fellowship. J.L.G. acknowledges support from the Wu Tsai Neurosciences Institute and Institute for Human-Centered AI. A.M.N. is supported by the Stanford Institute for Human Centered Artificial Intelligence. D.L.K.Y. is supported by the James S. McDonnell Foundation (Understanding Human Cognition Award Grant No. 220020469), the Simons Foundation (Collaboration on the Global Brain Grant No. 543061), the Sloan Foundation (Fellowship FG-2018-10963), the National Science Foundation (RI 1703161 and CAREER Award 1844724), the DARPA Machine Common Sense program, and hardware donation from the NVIDIA Corporation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Huberman AD, Niell CM. What can mice tell us about how vision works? Trends in Neurosciences. 2011;34(9):464–473. doi: 10.1016/j.tins.2011.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Glickfeld LL, Olsen SR. Higher-order areas of the mouse visual cortex. Annual Review of Vision Science. 2017;3:251–273. doi: 10.1146/annurev-vision-102016-061331 [DOI] [PubMed] [Google Scholar]
  • 3. O’Keefe J, Dostrovsky J. The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain research. 1971;. [DOI] [PubMed] [Google Scholar]
  • 4. Hafting T, Fyhn M, Molden S, Moser MB, Moser EI. Microstructure of a spatial map in the entorhinal cortex. Nature. 2005;436(7052):801–806. doi: 10.1038/nature03721 [DOI] [PubMed] [Google Scholar]
  • 5. Kropff E, Carmichael JE, Moser MB, Moser EI. Speed cells in the medial entorhinal cortex. Nature. 2015;523(7561):419–424. doi: 10.1038/nature14622 [DOI] [PubMed] [Google Scholar]
  • 6. Solstad T, Boccara CN, Kropff E, Moser MB, Moser EI. Representation of geometric borders in the entorhinal cortex. Science. 2008;322(5909):1865–1868. doi: 10.1126/science.1166466 [DOI] [PubMed] [Google Scholar]
  • 7. Sargolini F, Fyhn M, Hafting T, McNaughton BL, Witter MP, Moser MB, et al. Conjunctive representation of position, direction, and velocity in entorhinal cortex. Science. 2006;312(5774):758–762. doi: 10.1126/science.1125572 [DOI] [PubMed] [Google Scholar]
  • 8. Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences. 2014;111(23):8619–8624. doi: 10.1073/pnas.1403112111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Khaligh-Razavi SM, Kriegeskorte N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Computational Biology. 2014;10(11):e1003915. doi: 10.1371/journal.pcbi.1003915 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Güçlü U, van Gerven MA. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience. 2015;35(27):10005–10014. doi: 10.1523/JNEUROSCI.5023-14.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports. 2016;6(1):1–13. doi: 10.1038/srep27755 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Cadena SA, Denfield GH, Walker EY, Gatys LA, Tolias AS, Bethge M, et al. Deep convolutional models improve predictions of macaque V1 responses to natural images. PLoS Computational Biology. 2019;15(4):e1006897. doi: 10.1371/journal.pcbi.1006897 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bashivan P, Kar K, DiCarlo JJ. Neural population control via deep image synthesis. Science. 2019;364 (6439). doi: 10.1126/science.aav9436 [DOI] [PubMed] [Google Scholar]
  • 14. Cadena SA, Sinz FH, Muhammad T, Froudarakis E, Cobos E, Walker EY, et al. How well do deep neural networks trained on object recognition characterize the mouse visual system? NeurIPS Neuro AI Workshop. 2019;. [Google Scholar]
  • 15. de Vries SE, Lecoq JA, Buice MA, Groblewski PA, Ocker GK, Oliver M, et al. A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature Neuroscience. 2020;23(1):138–151. doi: 10.1038/s41593-019-0550-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Yamins DL, DiCarlo JJ. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience. 2016;19(3):356–365. doi: 10.1038/nn.4244 [DOI] [PubMed] [Google Scholar]
  • 17. Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, Issa EB, et al. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv. 2018; p. 407007. [Google Scholar]
  • 18. Conwell C, Buice M, Barbu A, Alvarez G. Model Zoology and Neural Taskonomy for Better Characterizing Mouse Visual Cortex. ICLR Bridging AI and Cognitive Science (BAICS) Workshop. 2020;. [Google Scholar]
  • 19.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2009. p. 248–255.
  • 20. Zhuang C, Yan S, Nayebi A, Schrimpf M, Frank MC, DiCarlo JJ, et al. Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences. 2021;118(3). doi: 10.1073/pnas.2014196118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Harris JA, Mihalas S, Hirokawa KE, Whitesell JD, Choi H, Bernard A, et al. Hierarchical organization of cortical and thalamic connectivity. Nature. 2019;575(7781):195–202. doi: 10.1038/s41586-019-1716-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Siegle JH, Jia X, Durand S, Gale S, Bennett C, Graddis N, et al. Survey of spiking in the mouse visual system reveals functional hierarchy. Nature. 2021; p. 1–7. doi: 10.1038/s41586-020-03171-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex. 1991;1(1):1–47. [DOI] [PubMed] [Google Scholar]
  • 24. Gămănuţ R, Shimaoka D. Anatomical and functional connectomes underlying hierarchical visual processing in mouse visual system. Brain Structure and Function. 2021; p. 1–19. [DOI] [PubMed] [Google Scholar]
  • 25. Prusky GT, West PW, Douglas RM. Behavioral assessment of visual acuity in mice and rats. Vision Research. 2000;40(16):2201–2209. doi: 10.1016/S0042-6989(00)00081-X [DOI] [PubMed] [Google Scholar]
  • 26. Kiorpes L. Understanding the development of amblyopia using macaque monkey models. Proceedings of the National Academy of Sciences. 2019;116(52):26217–26223. doi: 10.1073/pnas.1902285116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kell AJ, Yamins DL, Shook EN, Norman-Haignere SV, McDermott JH. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron. 2018;98(3):630–644. doi: 10.1016/j.neuron.2018.03.044 [DOI] [PubMed] [Google Scholar]
  • 28. Michaels JA, Schaffelhofer S, Agudelo-Toro A, Scherberger H. A goal-driven modular neural network predicts parietofrontal neural dynamics during grasping. Proceedings of the National Academy of Sciences. 2020;117(50):32124–32135. doi: 10.1073/pnas.2005087117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Nayebi A, Sagastuy-Brena J, Bear DM, Kar K, Kubilius J, Ganguli S, et al. Recurrent Connections in the Primate Ventral Visual Stream Mediate a Tradeoff Between Task Performance and Network Size During Core Object Recognition. Neural Computation. 2022;34:1652–1675. doi: 10.1162/neco_a_01506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Kriegeskorte N, Mur M, Bandettini PA. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience. 2008;2:4. doi: 10.3389/neuro.06.004.2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Siegle JH, Ledochowitsch P, Jia X, Millman D, Ocker GK, Caldejon S, et al. Reconciling functional differences in populations of neurons recorded with two-photon imaging and electrophysiology. BioRxiv. 2020;. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Bakhtiari S, Mineault P, Lillicrap T, Pack C, Richards B. The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning. bioRxiv. 2021;. [Google Scholar]
  • 33. Shi J, Buice MA, Shea-Brown E, Mihalas S, Tripp B. A Convolutional Network Architecture Driven by Mouse Neuroanatomical Data. bioRxiv. 2020;. [Google Scholar]
  • 34. Harris KD, Mihalas S, Shea-Brown E. High resolution neural connectivity from incomplete tracing data using nonnegative spline regression. In: Advances in Neural Information Processing Systems. vol. 29; 2016. [Google Scholar]
  • 35. Knox JE, Harris KD, Graddis N, Whitesell JD, Zeng H, Harris JA, et al. High-resolution data-driven model of the mouse connectome. Network Neuroscience. 2018;3(1):217–236. doi: 10.1162/netn_a_00066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Shi J, Shea-Brown E, Buice M. Comparison against task driven artificial neural networks reveals functional properties in mouse visual cortex. Advances in Neural Information Processing Systems. 2019;32:5764–5774. [Google Scholar]
  • 37.Oord Avd, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:180703748. 2018;.
  • 38. Froudarakis E, Cohen U, Diamantaki M, Walker EY, Reimer J, Berens P, et al. Object manifold geometry across the mouse cortical visual hierarchy. bioRxiv. 2020;. [Google Scholar]
  • 39. Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. 2009;. [Google Scholar]
  • 40.Zhang Y, Song S, Yumer E, Savva M, Lee JY, Jin H, et al. Physically-based rendering for indoor scene understanding using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 5287–5295.
  • 41. Zhuang C, Kubilius J, Hartmann MJ, Yamins D. Toward goal-driven neural network models for the rodent Whisker-Trigeminal system. Advances in Neural Information Processing Systems. 2017;2017:2556–2566. [Google Scholar]
  • 42. Olshausen BA, Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381(6583):607–609. doi: 10.1038/381607a0 [DOI] [PubMed] [Google Scholar]
  • 43. Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology. 1962;160(1):106–154. doi: 10.1113/jphysiol.1962.sp006837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Gidaris S, Singh P, Komodakis N. Unsupervised Representation Learning by Predicting Image Rotations. In: International Conference on Learning Representations; 2018.
  • 45.Wu Z, Xiong Y, Yu SX, Lin D. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 3733–3742.
  • 46.Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning; 2020. p. 1597–1607.
  • 47.Chen X, Fan H, Girshick R, He K. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:200304297. 2020;.
  • 48.Chen X, He K. Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:201110566. 2020;.
  • 49.Zbontar J, Jing L, Misra I, LeCun Y, Deny S. Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning. PMLR; 2021. p. 12310–12320.
  • 50.Bardes A, Ponce J, LeCun Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:210504906. 2021;.
  • 51. Nonaka S, Majima K, Aoki SC, Kamitani Y. Brain hierarchy score: Which deep neural networks are hierarchically brain-like? IScience. 2021;24(9). doi: 10.1016/j.isci.2021.103013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Zhuang J, Ng L, Williams D, Valley M, Li Y, Garrett M, et al. An extended retinotopic map of mouse cortex. elife. 2017;6:e18372. doi: 10.7554/eLife.18372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Merel J, Aldarondo D, Marshall J, Tassa Y, Wayne G, Ölveczky B. Deep neuroethology of a virtual rodent. International Conference on Learning Representations. 2020;.
  • 54.Lindsay GW, Merel J, Mrsic-Flogel T, Sahani M. Divergent representations of ethological visual inputs emerge from supervised, unsupervised, and reinforcement learning. arXiv preprint arXiv:211202027. 2021;.
  • 55. Hong H, Yamins DL, Majaj NJ, DiCarlo JJ. Explicit information for category-orthogonal object properties increases along the ventral stream. Nature Neuroscience. 2016;19(4):613. doi: 10.1038/nn.4247 [DOI] [PubMed] [Google Scholar]
  • 56. Wang Z, Novikov A, Zolna K, Merel JS, Springenberg JT, Reed SE, et al. Critic regularized regression. Advances in Neural Information Processing Systems. 2020;33:7768–7778. [Google Scholar]
  • 57. Gulcehre C, Wang Z, Novikov A, Paine T, Gómez S, Zolna K, et al. RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning. In: Advances in Neural Information Processing Systems. vol. 33; 2020. p. 7248–7259. [Google Scholar]
  • 58.Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing Textures in the Wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR); 2014.
  • 59. Nayebi A, Bear DM, Kubilius J, Kar K, Ganguli S, Sussillo D, et al. Task-Driven Convolutional Recurrent Models of the Visual System. Advances in Neural Information Processing Systems. 2018;31:5295–5306. [Google Scholar]
  • 60. Kubilius J, Schrimpf M, Kar K, Rajalingham R, Hong H, Majaj N, et al. Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs. Advances in Neural Information Processing Systems. 2019;32:12805–12816. [Google Scholar]
  • 61.Mrowca D, Zhuang C, Wang E, Haber N, Fei-Fei L, Tenenbaum JB, et al. Flexible neural representation for physics prediction. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems; 2018. p. 8813–8824.
  • 62.Haber N, Mrowca D, Wang S, Fei-Fei L, Yamins DL. Learning to play with intrinsically-motivated, self-aware agents. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems; 2018. p. 8398–8409.
  • 63.Lingelbach M, Mrowca D, Haber N, Fei-Fei L, Yamins DL. Towards curiosity-driven learning of physical dynamics. ICLR Bridging AI and Cognitive Science (BAICS) Workshop. 2020;.
  • 64. Dapello J, Marques T, Schrimpf M, Geiger F, Cox D, DiCarlo JJ. Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. Advances in Neural Information Processing Systems. 2020;33:13073–13087. [Google Scholar]
  • 65. Cao J, Bao C, Hao Q, Cheng Y, Chen C. LPNet: Retina Inspired Neural Network for Object Detection and Recognition. Electronics. 2021;10(22):2883. doi: 10.3390/electronics10222883 [DOI] [Google Scholar]
  • 66. Steinmetz NA, Aydin C, Lebedeva A, Okun M, Pachitariu M, Bauza M, et al. Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science. 2021;372 (6539). doi: 10.1126/science.abf4588 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Collins J, Sohl-Dickstein J, Sussillo D. Capacity and trainability in recurrent neural networks. In: ICLR; 2017. [Google Scholar]
  • 68. Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD. Hyperopt: a Python library for model selection and hyperparameter optimization. Computational Science & Discovery. 2015;8(1). [Google Scholar]
  • 69. Ritchie JB, Bracci S, Op de Beeck H. Avoiding illusory effects in representational similarity analysis: What (not) to do with the diagonal. NeuroImage. 2017;148:197–200. doi: 10.1016/j.neuroimage.2016.12.079 [DOI] [PubMed] [Google Scholar]
  • 70. Bottou L. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer; 2010. p. 177–186. [Google Scholar]
  • 71.He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2020. p. 9729–9738.
  • 72.You Y, Gitman I, Ginsburg B. Large batch training of convolutional networks. arXiv preprint arXiv:170803888. 2017;.
  • 73. Barlow HB, et al. Possible principles underlying the transformation of sensory messages. Sensory communication. 1961;1(01):217–233. [Google Scholar]
  • 74. Rajalingham R, Issa EB, Bashivan P, Kar K, Schmidt K, DiCarlo JJ. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience. 2018;38(33):7255–7269. doi: 10.1523/JNEUROSCI.0388-18.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Majaj NJ, Hong H, Solomon EA, DiCarlo JJ. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. Journal of Neuroscience. 2015;35(39):13402–13418. doi: 10.1523/JNEUROSCI.5181-14.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011506.r001

Decision Letter 0

Daniele Marinazzo, Tim Christian Kietzmann

16 May 2023

Dear Dr. Nayebi,

Thank you very much for submitting your manuscript "Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

The three reviewers have done a great job at providing in-depth constructive feedback (while agreeing on many points raised). The addressing of their comments will certainly improve this submission and we look forward to reading the resubmission. Please note that "All code can be accessed by emailing the corresponding authors." is not an acceptable form of code/data sharing for this journal (see PLOS Data policy). Code and data need to be shared via public repositories.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Tim Christian Kietzmann, Dr. rer. nat.

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************

The three reviewers have done a great job at providing in-depth constructive feedback (while agreeing on many points raised). The addressing of their comments will certainly improve this submission and I look forward to reading the resubmission. Please note that "All code can be accessed by emailing the corresponding authors." is not an acceptable form of code/data sharing for this journal (see PLOS Data policy). Code and data need to be shared via public repositories.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Nayabi, Kong et al. compared the performance of models trained with supervised and self-supervised learning in predicting the responses of mouse visual areas to natural stimuli. Their results showed an outperformance of Shallow architectures trained with SSL compared to other models. In addition, the SSL-trained models showed better out-of-distribution generalization, based on which the authors suggest that the mouse visual cortex can be understood as a shallow general purpose visual system. I enjoyed reading this paper, and believe it offers a strong contribution to the comparison of visual systems across different species using deep neural networks. I have a few concerns that I elaborate below.

Major comments:

1) If the focus of the paper is on evaluating the importance of contrastive methods for predicting the responses of mouse visual areas, it would be important to include comparisons with non-contrastive SSL techniques, such as Barlow Twins, VICReg, and others. Alternatively, if such comparisons are not included, it may be more appropriate to use a different terminology rather than 'contrastive' for SSL models.

2) Regarding Section 2.1, I found it unclear why the mapping method that resulted in the highest inter-animal similarity (PLS) was chosen as the similarity measure, especially considering that the correlation values across different metrics are not that different. It would be helpful to understand the range of these differences and why a higher value is considered indicative of true similarity. Additionally, it would be interesting to know how the results might differ if a different metric, such as CKA, were used instead. It is important to explicitly explain the differences in results obtained with different similarity measures in the main text.

3) The authors employed the Contrastive ShallowNet architecture, which comprises the first four layers of AlexNet trained with contrastive SSL. I wonder why the authors did not solely use a 4-layer convolutional ANN and train it with contrastive SSL if a shallow architecture is sufficient for the purpose of explaining the responses of mouse visual areas.

4) I find the argument presented in Figure 3 for the role of depth in predicting mouse visual cortex responses unconvincing, as there are several confounding factors that make the interpretation difficult. Eg, some models listed in fig 3 are based on AlexNet architecture, while others use ResNet architecture (also some architectures other than AlexNet are listed in fig 1 despite the claim that only AlexNet was used in the experiments). Additionally, some models were trained with contrastive objectives, while others were trained with supervised objectives. Even within the contrastive methods, there are different contrastive objectives. It would be more compelling if the effect of architecture depth was demonstrated by controlling for other factors, such as the objective function.

5) The reward-based navigation task presented in fig 6A requires further explanation. What is the goal of the task? What kind of visual features are informative in solving the task effectively? It is hard to conceive how a model trained on static images would be useful for a dynamic behavior like navigation. Moreover, to better understand the differences between SSL and supervised models in relation to this task, it would be helpful to compare them with a model directly trained with RL on the task in terms of the mean episode return.

6) I found the purpose of the experiments with the virtual rat unclear. If similar experiments were conducted using a virtual monkey simulation, would we expect to observe similar results? How would the interpretations differ? Is there anything specific in the rat's movement repertoire or visual experience that necessitates the use of representations learned with SSL?

7) I noticed that the results of the Neuropixel and calcium imaging data are not entirely consistent. Specifically, the differences between the performance of the best SSL and supervised models versus autoencoding models do not appear to be significant with calcium imaging based on fig S3. Could the authors provide any explanation for these discrepancies? I am curious whether the deconvolved spike timings or calcium traces were used in these experiments.

8) Could the intrinsic dimensionality of the learned representations account for the transferability of the SSL model?

Minor comments:

1) What are error bars when showing models’ predictivity difference e.g. fig 1A)?I suggest that the authors provide more clarity in reporting the statistical differences between the various models.

2) The Color scheme needs to be added to fig 3.

3) The details of the mouse visual cortex data, stimuli etc needs to be included in the results.

Reviewer #2: This work identifies several important features that make a vision model a better fit for mouse versus primate vision. These features reflect known differences between the two systems such as architecture depth and image resolution. My concerns are mainly around clarity and presentation, with some requests for extra analyses.

Clarity:

Given that aspects of interanimal consistency as a metric are discussed throughout, I think it would make sense to include fig S1 or some variant of it in the main text in order to clarify the metrics and to motivate the focus on neuropixels over calcium imaging.

I found the choice to refer to supervised VGG-16 as the "prior primate model" confusing, as many different models have been used to predict neural activity in the primate and giving it this title suggested to me it was somehow more fit to primate data than a standard VGG16. I'm fine with the authors explaining this choice as the baseline due to its common use as a primate model, but I think in the figures and discussion of results it should just be labeled as supervised VGG-16.

In section 2.2, there was not an explicit enough distinction between image set and objective. For example, ImageNet could be considered "human-centric" by the nature of the images themselves, not just the fact that category labels are applied to them, and this aspect of the data would impact both supervised and self-supervised methods. The idea of using fewer categories is also conflated with the type of images (CIFAR-10 is not just fewer categories, it is smaller and lower quality images as well). I would like the description of the models in this section to acknowledge when different image sets are used in addition to different training objectives.

I did not find Figure 3 very valuable. The depth of the overall network is not the same as the number of LNL operations that occur before the units responsible for the neural prediction (as 1B shows much of the predictive power is coming from early layers). It seems the latter should be used if the goal is to show that not many LNLs are needed to predict mouse visual responses.

Relatedly, I am confused as to why the authors choose to refer to their best-performing model as "Contrastive ShallowNet" versus simply Contrastive AlexNet. The architecture of the trained model is AlexNet. The fact that the first four layers are most predictive of mouse response is useful to know, but it does not make the model as it was built any shallower. I imagine many of the other networks also rely primarily on earlier layers for their predictions, but they are not referred to as a separate model as result of this. If the authors showed that a model trained with a shallower architecture was better than I would understand, but as it stands now this name change is confusing and is not applied evenly across all models.

In figure 6, why are not all models evaluated on all tasks? At the very least, it is important to include the reward performance of the 'supervised' maze model in Fig 6A in order to contextualize these numbers. Relatedly, I'd like to see the authors discuss how these results contrast with Lindsay et al, which shows somewhat better match to mouse data with the an RL-trained model over supervised/unsupervised methods (knowing how well the RL trained visual system here works may be important for this comparison)

The authors state "as we find that higher model areas best support these scene understanding transfer tasks". Was this shown somewhere?

Fig 4 inset. Are those all supervised? (they are all black)

"We believe that future work could investigate other appropriate low-pass filters and ecologically-relevant pixel-level transformations to apply to the original image or video stream" - The authors may want to cite some existing work in this direction such as https://www.mdpi.com/2079-9292/10/22/2883 and https://www.biorxiv.org/content/10.1101/2020.06.16.154542v2

I assume the authors will be publishing their code upon publication of the paper (as per PlosCB policy) but right now it says it is only available via email.

Additional analyses:

Looking at the hierarchy score (e.g. https://www.sciencedirect.com/science/article/pii/S2589004221009810), based on the rough hierarchical categorization the authors find in S1B would be a helpful additional analysis that may reveal other trends in the impact of architecture and training. From what has already been plotted it seems possible that self-supervised learning better recapitulates the visual hierarchy as well.

Why is there no self-supervised MouseNet? (also the plotting of MouseNet performance in Figure 1 has some other bar under it)

Reviewer #3: # Summary Of Contributions

The paper presents a computational model of the mouse visual cortex and finds that a shallow network with a low-resolution input is optimal for modeling mouse visual cortex. The authors find that models trained with self-supervised contrastive objectives are better matches to mouse cortex than models trained on supervised objectives or non-contrastive self-supervised methods. They show that the self-supervised, contrastive objective builds a general-purpose visual representation that transfers better to out-of-distribution visual scene understanding and reward-based navigation tasks.

# Strengths And Weaknesses

## Strengths

+ Paper is overall well organized

+ Convincing evidence presented for the main claims

+ Provides a nice comparative view on mouse vs. monkey visual system

## Weaknesses

- There may be an issue with image size and resolution

- Writing and presentation of results can be improved to make the paper more accessible

# Detailed Comments

The paper presents a thoroughly executed study whose main claims are well supported by the evidence presented. I am generally very supportive of the paper and have only two main concerns which I consider crucial to be addressed in a revision:

1. You make a claim about 64 px being the optimal image size for training. This sounds reasonable given the visual acuity and field of view of the mouse you cite in section 2.3. However, the argument cites a visual field of 60–90°, but the images spanned 120° in the experiments (according to the technical white paper on the dataset). Thus, the image is highly distorted, the resolution depends a lot on the eccentricity and I am not quite sure if the math actually works out. It would be great to look at this a bit more carefully.

2. While the paper is overall very well organized, I found the results more difficult to read and digest than necessary and the methods sometimes incomplete. The following examples illustrate my issues, and I would highly encourage the authors to revise sections 2, 3 and 5 from the perspective of a reader who is not already familiar with the work and all the details.

- Section 2.1 is quite confusing. It's not clear to me what this section wants to accomplish. The claim "We identified the best-performing mapping function by assessing a variety of functions to map responses from one animal to those of another" is unclear to me. Why do you find the best-performing mapping function that way? Why would one want to map responses from one animal to another? If this is so central that you start discussing it as the first point after the intro, why don't you show any data supporting your claim (only something hidden in Fig. S1, but I couldn't figure out what exactly it shows and how it supports the claim).

- The text sometimes refers to the "prior primate model" (e.g. Section 2.2, first paragraph), but I don't think this concept has been clearly introduced. I can only guess "prior primate model" means VGG-16?

- The results of Fig. 1A could be presented in a more systematic way. Rather than just sorting by performance and some rough grouping, I think it would be better to systematically investigate individual factors while keeping all others constant. For instance, the best-performing is ShallowNet trained with IR. What happens you replace IR by other objectives? I found a few other contrastive objectives, but where is ImageNet-supervised ShallowNet? Is that the same as supervised AlexNet? If so, why do you call them differently? Similarly, keeping IR as the objective, how does architecture affect performance? I think in this case the relevant data points are shown in Fig. 1A, but it's quite tedious to find them.

- The discussion of objectives in 2.2 is nice, but again I found it difficult to map to Fig. 1A. Categorization with 10 classes (Krizhevsky 2009) probably refers to CIFAR-10, depth prediction (Zhang 2017) probably to the orange bars, sparse autoencoding (Olshausen and Field 1996) probably pink. Why don't you state these things? What are the single-, two- and six-stream architectures and how do they relate to Olshausen and Field 1996 (which doesn't actually use a CNN or even an explicit encoder)?

- It's not quite clear to me whether "Contrastive ShallowNet" refers to just taking the first four layers of an entire AlexNet trained using IR or whether you take only the first four layers of AlexNet and add a global average pooling + fully connected layer in order to train IR. Please make the description a bit more explicit in Section 2.2 and the caption of Fig. 1.

- Related to previous point: I couldn't find a description in the methods how the StreamNet architecture (and ShallowNet) get from (3x3) feature maps to a single latent vector that is needed for, e.g., the contrastive losses or the classification objective. Global average pooling over space? Followed by a fully-connected layer? The sentence "Finally, the outputs of each parallel branch would be summed together, concatenated across the channels dimension, and used as input for the readout module" is very ambiguous? Summed across which dimension? What's the "readout module"?

- The StreamNet architecture is first mentioned in section 2.3, and was not at all clear to me from this section. I had to search for it in the methods to understand the architecture (somewhat). I'm still confused about what you mean by "StreamNet incorporates dense skip connections" - where are these skip connections? They are not mentioned in the methods section 5.4.

- It didn't become clear to me why you can train AlexNet only on 64 px upwards by Contrastive StreamNet can be trained on 32 px. You could simply remove, e.g., the last max pooling layer.

## Other Minor Issues:

- Abstract: second sentence appears broken: "However, an overall understanding of the mouse’s visual cortex, and how it supports a range [OF?] behaviors, remains unknown."

- The claim "We also attain neural predicitivity improvements over prior work...": Where is the data supporting this claim? The sentence references five papers, but I can only find MouseNet in Fig. 1. If some of the models in Fig. 1 map onto the references in this sentence, please state it explicitly somewhere so your readers can make the link.

- Fig. 1 caption: "All models are trained on 64 px inputs unless otherwise stated in figure." It's no where stated otherwise, only explicitly 64 px for VGG-16. Does this mean all models trained on 64 or did you forget to state it somewhere? CIFAR-10 has 32x32 images. Does that mean you're upsampling them to 64?

- You repeatedly make the point that the categories of ImageNet are human-centric and not relevant for mice. But if that's the case, why would ImageNet be a good approach for monkeys, for whom ImageNet categories should be equally irrelevant?

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: The models and the analysis codes are not released as supporting information or deposited to a public repository. It is only mentioned that "All code can be accessed by emailing the corresponding authors."

Reviewer #2: No: The code is listed as available by email

Reviewer #3: No: Statement on code availability reads "All code can be accessed by emailing the corresponding authors." I do not consider that publicly available, as it is not deposited in a publicly accessible repository.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011506.r003

Decision Letter 1

Daniele Marinazzo, Tim Christian Kietzmann

11 Sep 2023

Dear Dr. Nayebi,

We are pleased to inform you that your manuscript 'Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Moreover, we would like to ask you to include the hierarchy score results, as requested by reviewer 2, in the supplement of the paper.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Tim Christian Kietzmann, Dr. rer. nat.

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I thank the authors for their thorough responses to my comments! All my concerns are addressed. Based on the revisions made, I recommend this manuscript for publication.

Reviewer #2: I am mostly content with the changes, but I would strongly suggest the authors include the hierarchy score results at least in the supplemental material. Capturing the hierarchical relationship is important for use of these networks as *mechanistic models* of visual cortex rather than just predictors of neural activity and it is therefore good to know which models perform well on this metric.

Reviewer #3: I would like to thank the authors for the thorough revision. They have addressed my comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011506.r004

Acceptance letter

Daniele Marinazzo, Tim Christian Kietzmann

25 Sep 2023

PCOMPBIOL-D-23-00393R1

Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation

Dear Dr Nayebi,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Timea Kemeri-Szekernyes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. StreamNet architecture schematic.

    The first four convolutional layers of AlexNet best corresponded to all the mouse visual areas) These convolutional layers were used as the basis for our StreamNet architecture variants. The number of parallel streams, N, was varied to be one (single-stream), two (dual-stream) or six (six-stream).

    (TIF)

    S2 Fig. Shallow architectures trained with contrastive objective functions yield the best matches to the neural data (calcium imaging dataset).

    As in Fig 2, but for the calcium imaging dataset. A. The median and s.e.m. neural predictivity, using PLS regression, across neurons in all mouse visual areas except RL. N = 16228 units in total (RL is excluded, as mentioned in the “Neural Response Datasets” section). Actual neural predictivity performance can be found in Table 2. The “Primate Model Baseline” denotes a supervised VGG16 trained on 224 px inputs, used in prior work [14, 15, 36]. All models, except for the “Primate Model Baseline”, are trained on 64 px inputs. B. Each model’s performance on ImageNet is plotted against its median neural predictivity across all units from each visual area. Inset. Primate ventral visual stream neural predictivity from BrainScore is correlated with ImageNet categorization accuracy (adapted from Schrimpf et al. [17]). All ImageNet performance numbers can be found in Table 2. Color scheme as in A and Fig 2A.

    (TIF)

    S3 Fig. Shallow architectures trained with contrastive objective functions yield the best matches to the neural data (RSA).

    A. The median and s.e.m. noise-corrected neural predictivity, using RSA, across N = 39 and N = 90 animals for the Neuropixels and calcium imaging dataset respectively (across all visual areas, with RL excluded for the calcium imaging dataset, as mentioned in the “Neural Response Datasets” section). The “Primate Model Baseline” denotes a supervised VGG16 trained on 224 px inputs, used in prior work [14, 15, 36]). All models, except for the “Primate Model Baseline”, are trained on 64 px inputs. B. We plot each model’s performance on ImageNet against its median neural predictivity, using RSA, across visual areas. All ImageNet performance numbers can be found in Table 2. Color scheme as in A and Fig 2A.

    (TIF)

    S4 Fig. Structural and functional factors leading to improved neural response predictivity (calcium imaging dataset).

    A. As in Fig 3, our dual stream variant (red) and Contrastive AlexNet (brown) were trained using lower resolution ImageNet images and in a contrastive manner. Each image was downsampled from 224 × 224 pixels, the image size typically used to train primate ventral stream models, to various image sizes. Training models on resolutions lower than 224 × 224 pixels generally led to improved neural predictivity. The median and s.e.m. across neurons in each visual area is reported. As mentioned in the “Neural Response Datasets” section, visual area RL was removed from the calcium imaging neural predictivity results. Refer to Table 1 for N units per visual area. B. As in Fig 2B, AlexNet was either untrained, trained in a supervised manner (ImageNet) or trained in an self-supervised manner (instance recognition). We observe that the first four convolutional layers provide the best fits to the neural responses for all the visual areas while the latter three layers are not very predictive for any visual area. As mentioned in the “Neural Response Datasets” section, visual area RL was removed from the calcium imaging neural predictivity results.

    (TIF)

    S5 Fig. Inter-animal consistency as a function of the number of units in the datasets for different mapping transforms.

    Inter-animal consistency was extrapolated across the number of units in each dataset using a log-linear function (f(n) = a log10(n) + b, where a and b are fit as parameters via least squares, and n is the sample size factor). This analysis reveals that inter-animal consistency of the Neuropixels dataset approaches 1.0 more rapidly than it does for the calcium imaging dataset. Inter-animal consistency evaluated at a sample size factor of one indicates the consistency when all the existing units in the datasets are used (i.e., inter-animal consistency values reported in Fig 1A).

    (TIF)

    S6 Fig. Inter-animal consistency can increase with more stimuli.

    A. Inter-animal consistency under PLS regression evaluated on the train set (left bars for each visual area) and test set (right bars for each visual area), for both Neuropixels and calcium imaging datasets. The horizontal lines are the internal consistency (split-half reliability). B. Inter-animal consistency under PLS regression on the train set (dotted lines) and test set (straight lines), aggregated across visual areas. Each dot corresponds to the inter-animal consistency evaluated across 10 train-test splits, where each split is a sample of the natural scene image set corresponding to the percentage (x-axis). Note that RL is excluded for calcium imaging, as explained in the text (the “Neural Response Datasets” section). The median and s.e.m. across neurons is reported for both panels. Refer to Table 1 for N units per visual area.

    (TIF)

    S7 Fig. Data augmentations alone do not lead to improved neural predictivity.

    Here we compare the neural predictivity of AlexNet trained in three different ways. Contrastive ImageNet is an AlexNet trained using instance recognition on ImageNet with augmentations that are part of the contrastive algorithm (random crop, random color jitter, random grayscale, random horizontal flip). Supervised ImageNet is an AlexNet trained on ImageNet in a supervised manner with a smaller set of augmentations (random crop and random horizontal flip). Supervised ImageNet (contrastive augmentations) is an AlexNet trained on ImageNet in a supervised manner with the augmentations used in the instance recognition algorithm. This control model allows us to ascertain whether the improved neural predictivity of the Contrastive ImageNet model (red) is due to the contrastive loss function itself or due to the larger set of image augmentations used during model training. In both neural response datasets, we can conclude that data augmentations alone do not contribute to improved correspondence with the mouse visual areas.

    (TIF)

    S8 Fig. Out-of-distribution task performance across model layers.

    For the models trained on images from the maze environment, we plot their transfer performance on a set of out-of-distribution tasks (described in lower right panel of Fig 6C) across model layers. We find that intermediate model areas are better able to perform the transfer tasks and that the model layers that attain peak performance on the tasks correspond to those that best predict neural responses in the intermediate/higher mouse visual areas (see Fig 2B).

    (TIF)

    S9 Fig. Neural predictivity on the Neuropixels dataset while fixing architecture (left) and objective function (right).

    The data in the left panel shows how neural predictivity varies when the architecture is fixed to be AlexNet, but the objective function is varied between supervised (object categorization) and self-supervised, contrastive objectives. The data in the right panel shows how neural predictivity varies when the objective function is fixed to be instance recognition, but the architecture is varied, including StreamNets, MouseNets, VGG16, ResNets, and AlexNet.

    (TIF)

    S10 Fig. Brain hierarchy score.

    The brain hierarchy score metric of Nonaka et al. [51] was computed for a set of feedforward CNNs. Self-supervised, contrastive models (and shallower models) have a higher brain hierarchy score, computed using the mapping from model features to electrophysiological responses.

    (TIF)

    Attachment

    Submitted filename: Mouse_Vision_PLoS_Response.pdf

    Data Availability Statement

    All relevant neural data is accessible at: https://github.com/AllenInstitute/AllenSDK All source code is available at: https://github.com/neuroailab/mouse-vision.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES