Abstract
Deep neural network models of sensory systems are often proposed to learn representational transformations with invariances like those in the brain. To reveal these invariances, we generated ‘model metamers’, stimuli whose activations within a model stage are matched to those of a natural stimulus. Metamers for state-of-the-art supervised and unsupervised neural network models of vision and audition were often completely unrecognizable to humans when generated from late model stages, suggesting differences between model and human invariances. Targeted model changes improved human recognizability of model metamers but did not eliminate the overall human–model discrepancy. The human recognizability of a model’s metamers was well predicted by their recognizability by other models, suggesting that models contain idiosyncratic invariances in addition to those required by the task. Metamer recognizability dissociated from both traditional brain-based benchmarks and adversarial vulnerability, revealing a distinct failure mode of existing sensory models and providing a complementary benchmark for model assessment.
Subject terms: Sensory processing, Object vision, Auditory system
The authors test artificial neural networks with stimuli whose activations are matched to those of a natural stimulus. These ‘model metamers’ are often unrecognizable to humans, demonstrating a discrepancy between human and model sensory systems.
Main
A central goal of neuroscience is to build models that reproduce brain responses and behavior. The hierarchical nature of biological sensory systems1 has motivated the use of hierarchical neural network models that transform sensory inputs into task-relevant representations2,3. As such models have become the top-performing machine perception systems over the last decade, they have also emerged as the leading models of both the visual and auditory systems4,5.
One hypothesis for why artificial neural network models might replicate computations found in biological sensory systems is that they instantiate invariances that mirror those in such systems6,7. For instance, visual object recognition must often be invariant to pose and to the direction of illumination. Similarly, speech recognition must be invariant to speaker identity and to details of the prosodic contour. Sensory systems are hypothesized to build up invariances8,9 that enable robust recognition. Such invariances plausibly arise in neural network models as a consequence of optimization for recognition tasks or other training objectives.
Although biological and artificial neural networks might be supposed to have similar internal invariances, there are some known human–model discrepancies that suggest that the invariances of the two systems do not perfectly match. For instance, model judgments are often impaired by stimulus manipulations to which human judgments are invariant, such as additive noise10,11 or small translations of the input12,13. Another such discrepancy is the vulnerability to adversarial perturbations (small changes to stimuli that alter model decisions despite being imperceptible to humans14,15). Although these findings illustrate that current task-optimized models lack some of the invariances of human perception, they leave many questions unresolved. For instance, because the established discrepancies rely on only the model’s output decisions, they do not reveal where in the model the discrepancies arise. It also remains unclear whether observed discrepancies are specific to supervised learning procedures that are known to deviate from biological learning. Finally, because we have lacked a general method to assess model invariances in the absence of a specific hypothesis, it remains possible that current models possess many other invariances that humans lack.
Here, we present a general test of whether the invariances present in computational models of the auditory and visual systems are also present in human perception. Rather than target particular known human invariances, we visualize or sonify model invariances by synthesizing stimuli that produce approximately the same activations in a model. We draw inspiration from human perceptual metamers (stimuli that are physically distinct but that are indistinguishable to human observers because they produce the same response at some stage of a sensory system), which have previously been characterized in the domains of color perception16,17, texture18–20, cue combination21, Bayesian decision-making22 and visual crowding23,24. We call the stimuli we generate ‘model metamers’ because they are metameric for a computational model25.
We generated model metamers from a variety of deep neural network models of vision and audition by synthesizing stimuli that yielded the same activations in a model stage as particular natural images or sounds. We then evaluated human recognition of the model metamers. If the model invariances match those of humans, humans should be able to recognize the model metamer as belonging to the same class as the natural signal to which it is matched.
Across both visual and auditory task-optimized neural networks, metamers from late model stages were nearly always misclassified by humans, suggesting that many of their invariances are not present in human sensory systems. The same phenomenon occurred for models trained with unsupervised learning, demonstrating that the model failure is not specific to supervised classifiers. Model metamers could be made more recognizable to humans with selective changes to the training procedure or architecture. However, late-stage model metamers remained much less recognizable than natural stimuli in every model we tested regardless of architecture or training. Some model changes that produced more recognizable metamers did not improve conventional neural prediction metrics or evaluations of robustness, demonstrating that the metamer test provides a complementary tool to guide model improvements. Notably, the human recognizability of a model’s metamers was well predicted by other models’ recognition of the same metamers, suggesting that the discrepancy with humans lies in idiosyncratic model-specific invariances. Model metamers demonstrate a qualitative gap between current models of sensory systems and their biological counterparts and provide a benchmark for future model evaluation.
Results
General procedure
The goal of our metamer generation procedure (Fig. 1a) was to generate stimuli that produce nearly identical activations at some stage within a model but that were otherwise unconstrained and thus could differ in ways to which the model was invariant. We first measured the activations evoked by a natural image or sound at a particular model stage. The metamer for the natural image or sound was then initialized as a white noise signal (either an image or a sound waveform; white noise was chosen to sample the metamers as broadly as possible subject to the model constraints without biasing the initialization toward a specific object class). The noise signal was then modified to minimize the difference between its activations at the model stage of interest and those for the natural signal to which it was matched. The optimization procedure performed gradient descent on the input, iteratively updating the input while holding the model parameters fixed. Model metamers can be generated in this way for any model stage constructed from differentiable operations. Because the models that we considered are hierarchical, if the image or sound was matched with high fidelity at a particular stage, all subsequent stages were also matched (including the final classification stage in the case of supervised models, yielding the same decision).
Experimental logic
The logic of our approach can be related to four sets of stimuli. For a given ‘reference’ stimulus, there is a set of stimuli for which humans produce the same classification judgment as the reference (Fig. 1b). A subset of these are stimuli that are indistinguishable from the reference stimulus (that is, metameric) to human observers. If a model performs a classification task, it will also have a set of stimuli judged to be the same category as the reference stimulus, and a subset of these stimuli will produce the same activations at a given model stage (model metamers). Even if the model does not perform classification, it could instantiate invariances that define sets of model metamers for the reference stimulus at each model stage.
In our experiments, we generate stimuli (sounds or images) that are metameric to a model and present these stimuli to humans performing a classification task (Fig. 1c). Because we have access to the internal representations of the model, we can generate metamers for each model stage (Fig. 1d). In many models there is limited invariance in the early stages (as is believed to be true of early stages of biological sensory systems9), with model metamers closely approximating the stimulus from which they are generated (Fig. 1d, left). But successive stages of a model may build up invariance, producing successively larger sets of model metamers. In a feedforward model, if two distinct inputs map onto the same representation at a given model stage, then any differences in the inputs cannot be recovered in subsequent stages, such that invariance cannot decrease from one stage to the next. If a model replicates a human sensory system, every model metamer from each stage should also be classified as the reference class by human observers (Fig. 1d, top). Such a result does not imply that all human invariances will be shared by the model, but it is a necessary condition for a model to replicate human invariances.
Discrepancies in human and model invariances could result in model metamers that are not recognizable by human observers (Fig. 1d, bottom). The model stage at which this occurs could provide insight into where any discrepancies with humans arise within the model.
Our approach differs from classical work on metamers17 in that we do not directly assess whether model metamers are also metamers for human observers (that is, indistinguishable). The reason for this is that a human judgment of whether two stimuli are the same or different could rely on any representations within their sensory system that distinguish the stimuli (rather than just those that are relevant to a particular behavior). By contrast, most current neural network models of sensory systems are trained to perform a single behavioral task. As a result, we do not expect metamers of such models to be fully indistinguishable to a human, and the classical metamer test is likely to be too sensitive for our purposes. Models might fail the classical test even if they capture human invariances for a particular task. But if a model succeeds in reproducing human invariances for a task, its metamers should produce the same human behavioral judgment on that task because they should be indistinguishable to the human representations that mediate the judgment. We thus use recognition judgments as the behavioral assay of whether model metamers reflect the same invariances that are instantiated in an associated human sensory system. We note that if humans cannot recognize a model metamer, they would also be able to discriminate it from the reference stimulus, and the model would also fail a traditional metamerism test.
We sought to answer several questions. First, we asked whether the learned invariances of commonly used neural network models are shared by human sensory systems. Second, we asked where any discrepancies with human perception arise within models. Third, we asked whether any discrepancies between model and human invariances would also be present in models obtained without supervised learning. Fourth, we explored whether model modifications intended to improve robustness would also make model metamers more recognizable to humans. Fifth, we asked whether metamer recognition identifies model discrepancies that are not evident using other methods of model assessment, such as brain predictions or adversarial vulnerability. Sixth, we asked whether metamers are shared across models.
Metamer optimization
Because metamer generation relies on an iterative optimization procedure, it was important to measure optimization success. We considered the procedure to have succeeded only if it satisfied two conditions. First, measures of the match between the activations for the natural reference stimulus and its model metamer at the matched stage had to be much higher than would be expected by chance, as quantified with a null distribution (Fig. 1e) measured between randomly chosen pairs of examples from the training dataset. This criterion was adopted in part because it is equally applicable to models that do not perform a task. Metamers had to pass this criterion for each of three different measures of the match (Pearson and Spearman correlations and signal-to-noise ratio (SNR) expressed in decibels (dB); Methods). Second, for models that performed a classification task, the metamer had to result in the same classification decision by the model as the reference stimulus. In practice, we trained linear classifiers on top of all unsupervised models, such that we were also able to apply this second criterion for them (to be conservative).
Example distributions of the match fidelity (using Spearman’s ρ in this example) are shown in Fig. 1e. Activations of the matched model stage have a correlation close to 1, as intended, and are well outside the null distribution for random pairs of training examples. As expected, given the feedforward nature of the model, matching at an early stage produces matched activations in a late stage (Fig. 1e). But because the models we consider build up invariances over a series of feedforward stages, stages earlier than the matched stage need not have the same activations and in general these differ from those for the original stimulus to which the metamer was matched (Fig. 1e). The match fidelity of this example was typical, and optimization summaries for each analyzed model are included at https://github.com/jenellefeather/model_metamers_pytorch.
Metamers of standard visual deep neural networks
We generated metamers for multiple stages of five standard visual neural networks trained to recognize objects26–29 (trained on the ImageNet1K dataset30; Fig. 2a). The five models spanned a range of architectural building blocks and depths. Such models have been posited to capture similar features as primate visual representations, and, at the time the experiments were run, the five models placed 1st, 2nd, 4th, 11th and 59th on a neural prediction benchmark26,31. We subsequently ran a second experiment on an additional five models pretrained on larger datasets that became available at later stages of the project32–34. To evaluate human recognition of the model metamers, humans performed a 16-way categorization task on the natural stimuli and model metamers (Fig. 2b)10.
Contrary to the idea that the trained neural networks learned human-like invariances, human recognition of the model metamers decreased across model stages, reaching near-chance performance at the latest stages even though the model metamers remained as recognizable to the models as the corresponding natural stimuli, as intended (Fig. 2c,d). This reduction in human recognizability was evident as a main effect of observer and an interaction between the metamer generation stage and the observer, both of which were statistically significant for each of the ten models (P < 0.0001 in all cases; Supplementary Table 1).
From visual inspection, many of the metamers from late stages resemble noise rather than natural images (Fig. 2e and see Extended Data Fig. 1a for metamers generated from different noise initializations). Moreover, analysis of confusion matrices revealed that for the late model stages, there was no detectably reliable structure in participant responses (Extended Data Fig. 2). Although the specific optimization strategies we used had some effect on the subjective appearance of model metamers, human recognition remained poor regardless of the optimization procedure (Supplementary Fig. 1). The poor recognizability of late-stage metamers was also not explained by less successful optimization; the activation matches achieved by the optimization were generally good (for example, with correlations close to 1), and what variation we did observe was not predictive of metamer recognizability (Extended Data Fig. 3).
Metamers of standard auditory deep neural networks
We performed an analogous experiment with two auditory neural networks trained to recognize speech (the word recognition task in the Word–Speaker–Noise dataset25). Each model consisted of a biologically inspired ‘cochleagram’ representation35,36, followed by a convolutional neural network (CNN) whose parameters were optimized during training. We tested two model architectures: a ResNet50 architecture (henceforth referred to as CochResNet50) and a convolutional model with nine stages similar to that used in a previous publication4 (henceforth referred to as CochCNN9). Model metamers were generated for clean speech examples from the validation set. Humans performed a 793-way classification task4 to identify the word in the middle of the stimulus (Fig. 2f).
As with the visual models, human recognition of auditory model metamers decreased markedly at late model stages for both architectures (Fig. 2g), yielding a significant main effect of human versus model observer and an interaction between the model stage and the observer (P < 0.0001 for each comparison; Supplementary Table 1). Subjectively, the model metamers from later stages sound like noise (and appear noise-like when visualized as cochleagrams; Fig. 2h). This result suggests that many of the invariances present in these models are not invariances for the human auditory system.
Overall, these results demonstrate that the invariances of many common visual and auditory neural networks are substantially misaligned with those of human perception, even though these models are currently the best predictors of brain responses in each modality.
Unsupervised models also exhibit discrepant metamers
Biological systems typically do not have access to labels at the scale that is needed for supervised learning37 and instead must rely in large part on unsupervised learning. Do the divergent invariances evident in neural network models result in some way from supervised training with explicit category labels? Metamers are well suited to address this question given that their generation is not dependent on a classifier and thus can be generated for any sensory model.
At present, the leading unsupervised models are ‘self-supervised’, being trained with a loss function favoring representations in which variants of a single training example (different crops of an image, for instance) are similar, whereas those from different training examples are not38 (Fig. 3a). We generated model metamers for four such models38–41 along with supervised comparison models with the same architectures.
As shown in Fig. 3b–d, the self-supervised models produced similar results as those for supervised models. Human recognition of model metamers declined at late model stages, approaching chance levels for the final stages. Some of the models had more recognizable metamers at intermediate stages (significant interaction between model type and model stage; ResNet50 models: F21,420 = 16.0, P < 0.0001, ; IPCL model: F9,198 = 3.13, P = 0.0018, ). However, for both architectures, recognition was low in absolute terms, with the metamers bearing little resemblance to the original image they were matched to. Overall, the results suggest that the failure of standard neural network models to pass our metamer test is not specific to the supervised training procedure. This result also demonstrates the generality of the metamers method, as it can be applied to models that do not have a behavioral readout. Analogous results with two classical sensory system models (HMAX3,8 and a spectrotemporal modulation filterbank42), which further illustrate the general applicability of the method, are shown in Extended Data Figs. 4 and 5.
Discrepant metamers are not explained by texture bias
Another commonly noted discrepancy between current models and humans is the tendency for models to base their judgments on texture rather than shape43–45. This ‘texture bias’ can be reduced with training datasets of ‘stylized’ images (Fig. 3e) that increase a model’s reliance on shape cues, making them more human-like in this respect43. To assess whether these changes also serve to make model metamers less discrepant, we generated metamers from two models trained on Stylized ImageNet. As shown in Fig. 3f, these models had metamers that were comparably unrecognizable to humans as those from models trained on the standard ImageNet1K training set (no interaction between model type and model stage; ResNet50: F7,140 = 0.225, P = 0.979, ; AlexNet: F8,160 = 0.949, P = 0.487, ). This result suggests that metamer discrepancies are not simply due to texture bias in the models.
Effects of adversarial training on visual model metamers
A known peculiarity of contemporary artificial neural networks is their vulnerability to small adversarial perturbations designed to change the class label predicted by a model14,15. Such perturbations are typically imperceptible to humans due to their small magnitude but can drastically alter model decisions and have been the subject of intense interest in part due to the security risk they pose for machine systems. One way to reduce this vulnerability is via ‘adversarial training’ in which adversarial perturbations are generated during training, and the model is forced to learn to recognize the perturbed images as the ‘correct’ human-interpretable class46 (Fig. 4a). This adversarial training procedure yields models that are less susceptible to adversarial examples for reasons that remain debated47.
We asked whether adversarial training would improve human recognition of model metamers. A priori, it was not clear what to expect. Making models robust to adversarial perturbations causes them to exhibit more of the invariances of humans (the shaded orange covers more of the blue outline in Fig. 1b), but it is not obvious that this will reduce the model invariances that are not shared by humans (that is, to decrease the orange outlined regions that do not overlap with the blue shaded region in Fig. 1b). Previous work visualizing latent representations of visual neural networks suggested that robust training might make model representations more human-like48, but human recognition of model metamers had not been behaviorally evaluated.
We first generated model metamers for five adversarially trained vision models48 with different architectures and perturbation sizes. As a control, we also trained models with equal magnitude perturbations in random, rather than adversarial, directions, which are typically ineffective at preventing adversarial attacks49. As intended, adversarially trained models were more robust to adversarial perturbations than the standard-trained model or models trained with random perturbations (Supplementary Fig. 2a,b).
Metamers for the adversarially trained models were in all cases significantly more recognizable than those from the standard model (Fig. 4b–d and Extended Data Fig. 1), evident as a main effect of training type in each case. Training with random perturbations did not yield the same benefit (Supplementary Table 2). Despite some differences across adversarial training variants, all variants that we tried produced a human recognition benefit. It was nonetheless the case that metamers for late stages remained less than fully recognizable to humans for all model variants. We note that performance is inflated by the use of a 16-way alternative force choice task, for which above-chance performance is possible even with severely distorted images. See Extended Data Figs. 6 and 7 for an analysis of the consistency of metamer recognition across human observers and examples of the most and least recognizable metamers.
Given that metamers from adversarially trained models look less noise-like than those from standard models and that standard models may overrepresent high spatial frequencies50, we wondered whether the improvement in recognizability could be replicated in a standard-trained model by including a smoothness regularizer in metamer optimization. Such regularizers are common in neural network visualizations51, and although they side step the goal of human–model comparison, it was nonetheless of interest to assess their effect. We implemented the regularizer used in a well-known visualization paper51. Adding smoothness regularization to the metamer generation procedure for the standard-trained AlexNet model improved the recognizability of its metamers (Fig. 4e) but not as much as did adversarial training (and did not come close to generating metamers as recognizable as natural images; see Extended Data Fig. 8 for examples generated with different regularization coefficients). This result suggests that the benefit of adversarial training is not simply replicated by imposing smoothness constraints and that discrepant metamers more generally cannot be resolved with the addition of a smoothness prior.
Effects of adversarial training on auditory model metamers
We conducted analogous experiments with auditory models, again using two architectures and several perturbation types. Because the auditory models contain a fixed cochlear stage at their front end, there are two natural places to generate adversarial examples: they can be added to the waveform or the cochleagram. We explored both for completeness and found that adversarial training at either location resulted in adversarial robustness (Supplementary Fig. 2c–f).
We first investigated adversarial training with perturbations to the waveform (Fig. 5a). As with the visual models, human recognition was generally better for metamers from adversarially trained models but not for models trained with random perturbations (Fig. 5b,c and Supplementary Table 2). The model metamers from the robust models were visibly less noise-like when viewed in the cochleagram representation (Fig. 5d).
We also trained models with adversarial perturbations to the cochleagram representation (Fig. 5e). These models had significantly more recognizable metamers than both the standard models and the models adversarially trained on waveform perturbations (Fig. 5f,g and Supplementary Table 2), and the benefit was again specific to models trained with adversarial (rather than random) perturbations. These results suggest that the improvements from intermediate-stage perturbations may in some cases be more substantial than those from perturbations to the input representation.
Overall, these results suggest that adversarial training can cause model invariances to become more human-like in both visual and auditory domains. However, substantial discrepancies remain, as many model metamers from late model stages remain unrecognizable even after adversarial training.
Metamer recognizability dissociates from adversarial robustness
Although adversarial training increased human recognizability of model metamers, the degree of robustness from the training was not itself predictive of metamer recognizability. We first examined all the visual models from Figs. 2–5 and compared their adversarial robustness to the recognizability of their metamers from the final model stage (this stage was chosen because it exhibited considerable variation in recognizability across models). There was a correlation between robustness and metamer recognizability (ρ = 0.73, P < 0.001), but it was mostly driven by the overall difference between two groups of models, those that were adversarially trained and those that were not (Fig. 6a).
The auditory models showed a similar relationship as the visual models (Fig. 6b). When standard and adversarially trained models were analyzed together, metamer recognizability and robustness were correlated (ρ = 0.63, P = 0.004), driven by the overall difference between the two groups of models, but there was no obvious relationship when considering just the adversarially trained models.
To further assess whether variations in robustness produce variation in metamer recognizability, we compared the robustness of a large set of adversarially trained models (taken from a well-known robustness evaluation52) to the recognizability of their metamers from the final model stage. Despite considerable variation in both robustness and metamer recognizability, the two measures were not significantly correlated (ρ = 0.31, P = 0.099; Fig. 6c). Overall, it seems that something about the adversarial training procedure leads to more recognizable metamers but that robustness per se does not drive the effect.
Adversarial training is not the only means of making models adversarially robust. But when examining other sources of robustness, we again found examples where a model’s robustness was not predictive of the recognizability of its metamers. Here, we present results for two models with similar robustness, one of which had much more recognizable metamers than the other.
The first model was a CNN that was modified to reduce aliasing (LowpassAlexNet). Because many traditional neural networks contain downsampling operations (for example, pooling) without a preceding lowpass filter, they violate the sampling theorem25,53 (Fig. 6d). It is nonetheless possible to modify the architecture to reduce aliasing, and such modifications have been suggested to improve model robustness to small image translations12,13. The second model was a CNN that contained an initial processing block inspired by the primary visual cortex in primates54. This block contained hard-coded Gabor filters, had noise added to its responses during training (VOneAlexNet; Fig. 6e) and had been previously demonstrated to increase adversarial robustness55. A priori, it was unclear whether either model modification would improve human recognizability of the model metamers.
Both architectures were comparably robust and more robust than the standard AlexNet to adversarial perturbations (Fig. 6f) as well as ‘fooling images’14 (Extended Data Fig. 9a) and ‘feature adversaries’56 (Extended Data Fig. 9b,c). However, metamers generated from LowpassAlexNet were substantially more recognizable than metamers generated from VOneAlexNet (Fig. 6g,h). This result provides further evidence that model metamers can differentiate models even when adversarial robustness does not.
These adversarial robustness-related results may be understood in terms of configurations of the four types of stimulus sets originally shown in Fig. 1b (Fig. 6i). Adversarial examples are stimuli that are metameric to a reference stimulus for humans but are classified differently from the reference stimulus by a model. Adversarial robustness thus corresponds to a situation where the human metamers for a reference stimulus fall completely within the set of stimuli that are recognized as the reference class by a model (blue outline contained within the orange shaded region in Fig. 6i, right column). This situation does not imply that all model metamers will be recognizable to humans (orange outline contained within the blue shaded region in the top row). These theoretical observations motivate the use of model metamers as a complementary model test and are confirmed by the empirical observations of this section.
Metamer recognizability and out-of-distribution robustness
Neural network models have also been found to be less robust than humans to images that fall outside their training distribution (for example, line drawings, silhouettes and highpass-filtered images that qualitatively differ from the photos in the common ImageNet1K training set; Fig. 6j)10,57,58. This type of robustness has been found to be improved by training models on substantially larger datasets59. We compared model robustness for such ‘out-of-distribution’ images to the recognizability of their metamers from the final model stage (the model set included several models trained on large-scale datasets taken from Fig. 2d, along with all other models from Figs. 2c, 3 and 4). This type of robustness (measured by two common benchmarks) was again not correlated with metamer recognizability (ImageNet-C: ρ = –0.16, P = 0.227; Geirhos 2021: ρ = –0.17, P = 0.215; Fig. 6k).
Metamer recognizability dissociates from model–brain similarity
Are the differences between models shown by metamer recognizability similarly evident when using standard brain comparison benchmarks? To address this question, we used such benchmarks to evaluate the visual and auditory models described above in Figs. 2–5. For the visual models, we used the Brain-Score platform to measure the similarity of model representations to neural benchmarks for visual areas V1, V2 and V4 and the inferior temporal cortex (IT26,31; Fig. 7a). The platform’s similarity measure combines a set of model–brain similarity metrics, primarily measures of variance explained by regression-derived predictions. For each model, the score was computed for each visual area using the model stage that gave the highest similarity in held-out data for that visual area. We then compared this neural benchmark score to the recognizability of the model’s metamers from the same stage used to obtain the neural predictions. This analysis showed modest correlations between the two measures for V4 and IT, but these were not significant after Bonferroni correction and were well below the presumptive noise ceiling (Fig. 7b). Moreover, the neural benchmark scores were overall fairly similar across models. Thus, most of the variation in metamer recognizability was not captured by standard model–brain comparison benchmarks.
We performed an analogous analysis for the auditory models using a large dataset of human auditory cortical functional magnetic resonance imaging (fMRI) responses to natural sounds60 that had previously been used to evaluate neural network models of the auditory system4,61. We analyzed voxel responses within four regions of interest in addition to all of the auditory cortex, in each case again choosing the best-predicting model stage, measuring the variance it explained in held-out data and comparing that to the recognizability of the metamers from that stage (Fig. 7c). The correlation between metamer recognizability and explained variance in the brain response was not significant when all voxels were considered (ρ = –0.06 and P = 1.0 with Bonferroni correction; Fig. 7d). We did find a modest correlation within one of the regions of interest (ROIs; speech: ρ = 0.58 and P = 0.08 with Bonferroni correction), but it was well below the presumptive noise ceiling (ρ = 0.78).
We conducted analogous analyses using representational similarity analysis instead of regression-based explained variance to evaluate auditory model–brain similarity; these analyses yielded similar conclusions as the regression-based analyses (Extended Data Fig. 10). Overall, the results indicate that the metamer test is complementary to traditional metrics of model–brain fit (and often distinguishes models better than these traditional metrics).
Metamer transfer across models
Are one model’s metamers recognizable by other models? We addressed this issue by taking all the models trained for one modality, holding one model out as the ‘generation’ model and presenting its metamers to each of the other models (‘recognition’ models), measuring the accuracy of their class predictions (Fig. 8a). We repeated this procedure with each model as the generation model. As a summary measure for each generation model, we averaged the accuracy across the recognition models (Fig. 8a and Supplementary Figs. 3 and 4). To facilitate comparison, we analyzed models that were different variants of the same architecture. We used permutation tests to evaluate differences between generation models (testing for main effects).
Metamers from late stages of the standard-trained ResNet50 were generally not recognized by other models (Fig. 8b). A similar trend held for the models trained with self-supervision. By contrast, metamers from the adversarially trained models were more recognizable to other models (Fig. 8b; P < 0.0001 compared to either standard or self-supervised models). We saw an analogous metamer transfer boost from the model with reduced aliasing (LowpassAlexNet), for which metamers for intermediate stages were more recognizable to other models (Fig. 8c; P < 0.0001 compared to either standard or VOneAlexNet models). Similar results held for auditory models (Fig. 8d; waveform adversarially trained versus standard, P = 0.011; cochleagram adversarially trained versus standard, P < 0.001), although metamers from the standard-trained CochResNet50 transferred better to other models than did those for the supervised vision model, perhaps due to the shared cochlear representation present in all auditory models, which could increase the extent of shared invariances.
These results suggest that models tend to contain idiosyncratic invariances, in that their metamers vary in ways that render them unrecognizable to other models. This finding is loosely consistent with findings that the representational dissimilarity matrices for natural images can vary between individual neural network models62. The results also clarify the effect of adversarial training. Specifically, they suggest that adversarial training removes some of the idiosyncratic invariances of standard-trained deep neural networks rather than learning new invariances that are not shared with other models (in which case their metamers would not have been better recognized by other models). The architectural change that reduced aliasing had a similar effect, albeit limited to the intermediate model stages.
The average model recognition of metamers generated from a given stage of another model is strikingly similar to human recognition of the metamers from that stage (compare Fig. 8b–d to Figs. 3c, 4b, 5f and 6g). To quantify this similarity, we plotted the average model recognition for metamers from each stage of each generating model against human recognition of the same stimuli, revealing a strong correlation for both visual (Fig. 8e) and auditory (Fig. 8f) models. This result suggests that the human–model discrepancy revealed by model metamers reflects invariances that are often idiosyncratic properties of a specific neural network, leading to impaired recognition by both other models and human observers.
Discussion
We used model metamers to reveal invariances of deep artificial neural networks and compared these invariances to those of humans by measuring human recognition of visual and auditory model metamers. Metamers of standard deep neural networks were dominated by invariances that are absent from human perceptual systems, in that metamers from late model stages were typically completely unrecognizable to humans. This was true across modalities (visual and auditory) and training methods (supervised versus self-supervised training). The effect was driven by invariances that are idiosyncratic to a model, as human recognizability of a model’s metamers was well predicted by their recognizability to other models. We identified ways to make model metamers more human-recognizable in both the auditory and visual domains, including a new type of adversarial training for auditory models using perturbations at an intermediate model stage. Although there was a substantial metamer recognizability benefit from one common training method to reduce adversarial vulnerability, we found that metamers revealed model differences that were not evident by measuring adversarial vulnerability alone. Moreover, the model improvements revealed by model metamers were not obvious from standard brain prediction metrics. These results show that metamers provide a model comparison tool that complements the standard benchmarks that are in widespread use. Although some models produced more recognizable metamers than others, metamers from late model stages remained less recognizable than natural images or sounds in all cases we tested, suggesting that further improvements are needed to align model representations with those of biological sensory systems.
Might humans analogously have invariances that are specific to an individual? This possibility is difficult to explicitly test given that we cannot currently sample human metamers (metamer generation relies on having access to the model’s parameters and responses, which are currently beyond reach for biological systems). If idiosyncratic invariances were also present in humans, the phenomenon we have described here might not represent a human–model discrepancy and could instead be a common property of recognition systems. The main argument against this interpretation is that several model modifications (different forms of adversarial training and architectural modifications to reduce aliasing) substantially reduced the idiosyncratic invariances present in standard deep neural network models. These results suggest that idiosyncratic invariances are not unavoidable in a recognition system. Moreover, the set of modifications explored here was far from exhaustive, and it seems plausible that idiosyncratic invariances could be further alleviated with alternative training or architecture changes in the future.
Relation to previous work
Previous work has also used gradient descent on the input to visualize neural network representations51,63. However, the significance of these visualizations for evaluating neural network models of biological sensory systems has received little attention. One contributing factor may be that model visualizations have often been constrained by added natural image priors or other forms of regularization64 that help make visualizations look more natural but mask the extent to which they otherwise diverge from a perceptually meaningful stimulus. By contrast, we intentionally avoided priors or other regularization when generating model metamers, as they defeat the purpose of the metamer test. When we explicitly measured the benefit of regularization, we found that it did boost recognizability somewhat but that it was not sufficient to render model metamers fully recognizable or reproduce the benefits of model modifications that improve metamer recognizability (Fig. 4e).
Another reason the discrepancies we report here have not been widely discussed within neuroscience is that most studies of neural network visualizations have not systematically measured recognizability to human observers (in part because these visualizations are primarily reported within computer science, where such experiments are not the norm). We found controlled experiments to be essential. Before running full-fledged experiments, we always conducted the informal exercise of generating examples and evaluating them subjectively. Although the largest effects were evident informally, the variability of natural images and sounds made it difficult to predict with certainty how an experiment would turn out. It was thus critical to substantiate informal observation with controlled experiments in humans.
Metamers are also methodologically related to a type of adversarial example generated by adding small perturbations to an image from one class such that the activations of a classifier (or internal stage) match those of a reference image from a different class56,65, despite being seen as different classes by humans when tested informally66,67. Our method differs in probing model invariances without any explicit bias to cause metamers to appear different to humans. We found models in which vulnerability to these adversarial examples dissociated from metamer recognizability (Extended Data Fig. 9), suggesting that metamers may reflect distinct model properties.
Effects of unsupervised training
Unsupervised learning potentially provides a more biologically plausible computational theory of learning41,68 but produced qualitatively similar model metamers as supervised learning. This finding is consistent with evidence that the classification errors of self-supervised models are no more human-like than those of supervised models69. The metamer-related discrepancies are particularly striking for self-supervised models because they are trained with the goal of invariance, being explicitly optimized to become invariant to the augmentations performed on the input. We also found that the divergence with human recognition had a similar dependence on model stage irrespective of whether models were trained with or without supervision. These findings raise the possibility that factors common to supervised and unsupervised neural networks underlie the divergence with humans.
Differences in metamers across stages
The metamer test differs from some other model metrics (for example, behavioral judgments of natural images or sounds, or measures of adversarial vulnerability) in that metamers can be generated from every stage of a model, with the resulting discrepancies associated with particular model stages. For instance, metamers revealed that intermediate stages were more human-like in some models than others. The effects of reducing aliasing produced large improvements in the human recognizability of metamers from intermediate stages (Fig. 6g), consistent with the idea that biological systems also avoid aliasing. By contrast, metamers from the final stages showed little improvement. This result indicates that this model change produces intermediate representations with more human-like invariances despite not resolving the discrepancy introduced at the final model stages. The consistent discrepancies at the final model stages highlight these late stages as targets for model improvements45.
For most models, the early stages produced model metamers that were fully recognizable but that also resemble the original image or sound they were matched to. By contrast, metamers from late stages physically deviated from the original image or sound but for some models nonetheless remained recognizable. This difference highlights two ways that a model’s metamers can pass the recognition test used here, either by being perceptually indistinguishable to humans or by being recognizable to humans as the same class despite being perceptually distinct. This distinction could be quantified in future work by combining a traditional metamer test with our recognition test.
Limitations
Although a model that fails our metamer test is ruled out as a description of human perception, passing the test on its own reveals little. For instance, a model that instantiates the identity mapping would pass our test despite not being able to account for human perceptual abilities. Traditional metrics thus remain critical but on their own are also insufficient (as shown in Figs. 6 and 7). Failing the test also does not imply that the model representations are not present in the brain, only that they are not sufficient to account for the recognition behavior under consideration. For instance, there is considerable evidence for time-averaged auditory statistics in auditory perception19,70 even though they do not produce human-recognizable metamers for speech (Extended Data Fig. 5c). The results point to the importance of a large suite of test metrics for model comparison, including, but not limited to, the model metamer test.
Model metamers are generated via gradient-based optimization of a non-convex loss function and only approximately reproduce the activations of the natural stimulus to which they are matched. We attempted to improve on previous neural network visualization work51,63 by setting explicit criteria for optimization success (Fig. 1e and Extended Data Fig. 4). However, the reliance on optimization may be a limitation in some contexts and with some models.
The metamer optimization process is also not guaranteed to sample uniformly from the set of a model’s metamers. Non-uniform sampling cannot explain the human–model discrepancies we observed but could in principle contribute to differences between the magnitude of discrepancies for some models compared to others, for instance if differences in the optimization landscape make it more or less likely that the metamer generation process samples along a model’s idiosyncratic invariances. We are not aware of any reason to think that this might be the case, but it is not obvious how to fully exclude this possibility.
Future directions
The underlying causes of the human–model discrepancies demonstrated here seem important to understand, both because they may clarify biological sensory systems and because many potential model applications, such as model-based signal enhancement71,72, are likely to be hindered by human-discrepant model invariances. The results of Fig. 8 (showing that human recognition of a model’s metamers can be predicted by the recognition judgments of a set of other models) suggest a way to efficiently screen for discrepant metamers, which should facilitate evaluation of future models.
One explanation for the human–model discrepancies we observed could be that biological sensory systems do not instantiate invariances per se in the sense of mapping multiple different inputs onto the same representation73,74. Instead, they might learn representations that ‘untangle’ behaviorally relevant variables. For instance, a system could represent word labels and talker identity or object identity and pose via independent directions in a representational space. Such a representational scheme could enable invariant classification without invariant representations and might be facilitated by training on multiple tasks or objectives (rather than the single tasks/objectives used for the models we tested). Alternative model architectures may also help address this hypothesis. In particular, ‘generative’ models that estimate the probability of an input signal given a latent variable (rather than the probability of a latent variable for a given input signal as in the ‘discriminative’ models studied here) seem likely to mitigate the metamer discrepancies we found. There are indications that adding generative training objectives can improve the alignment of model representations with humans in models trained on small-scale tasks75. But currently, we lack methods for building such models that can support human-level recognition at scale76,77.
The discrepancies shown here for model metamers contrast with a growing number of examples of human–model similarities for behavioral judgments of natural stimuli. Models optimized for object recognition78, speech recognition4, sound localization79 and pitch recognition80 all exhibit qualitative and often quantitative similarities to human judgments when run in traditional psychophysical experiments with natural or relatively naturalistic stimuli (that fall near their training distribution). However, these same models can exhibit inhuman behavior for signals that fall outside the distribution of natural sounds and images, particularly those derived from the model.
Current deep neural network models are overparametrized, such that training produces one of many functions consistent with the training data. From this perspective it is unsurprising that different systems can perform similarly on natural signals while exhibiting different responses to signals outside the training distribution of natural images or sounds. Yet, we nonetheless found that sensible engineering modifications succeeded in bringing models into better alignment with human invariances. These results demonstrate that divergence between human and model invariances is not inevitable and show how metamers can be a useful metric to guide and evaluate the next generation of brain models.
Methods
All experiments with human participants (both online and in the lab) were approved by the Committee On the Use of Humans as Experimental Subjects at the Massachusetts Institute of Technology (MIT) and were conducted with the informed consent of the participants.
Model implementation
Models were implemented in the PyTorch deep learning library82 and obtained through publicly available checkpoints or trained by authors on the MIT OpenMind computing cluster. All models and analysis used Python 3.8.2 and PyTorch 1.5.0, except in cases where models or graphics processing unit hardware required operations not present in PyTorch 1.5.0, in which case we used PyTorch 1.12.1. Details of all Python dependencies and package versions are provided in the form of a conda environment at https://github.com/jenellefeather/model_metamers_pytorch.
Additional details of model training and evaluation are provided in Supplementary Modeling Information Note 1 and Supplementary Tables 3 and 4. Full architecture descriptions are provided in Supplementary Modeling Information Note 2.
Metamer generation
Optimization of metamers
Gradient descent was performed on the input signal to minimize the normalized squared error between all activations at a particular model stage (for instance, each x, y and channel value from the output of a convolutional layer) for the model metamer and the corresponding activations for a natural signal
where A represents the activations from the natural signal, and A′ represents the activations from the model metamer (that is, sampling from the preimage of the model activations at the generation stage). The weights of the model remained fixed during the optimization. Each step of gradient descent was constrained to have a maximum L2 norm of η, where η was initialized at 1 and was dropped by a factor of 0.5 after every 3,000 iterations. Optimization was run for a total of 24,000 steps for each generated metamer. The shape of the input stimuli, range of the input and any normalization parameters were matched to those used for testing the model on natural stimuli. Normalization that occurred after data augmentation in the visual models (subtracting channel means and dividing by channel standard deviations) was included as a model component during metamer generation (that is, gradients from these operations contributed to the metamer optimization along with all other operations in the model). For vision models, the input signal was initialized as a sample from a normal distribution with a standard deviation of 0.05 and a mean of 0.5 (or a standard deviation of 10 and a mean of 127.5 in the case of HMAX). For auditory models, the input signal was initialized from a random normal distribution with a standard deviation of 10−7 and a mean of 0 (or a standard deviation of 10−5 and a mean of 0 in the case of Spectemp).
Criteria for optimization success
Because metamers are derived via a gradient descent procedure, the activations that they produce approach those of the natural signal used to generate them but never exactly match. It was thus important to define criteria by which the optimization would be considered sufficiently successful to include the model metamer in the behavioral experiments.
The first criterion was that the activations for the model metamer had to be matched to those for the natural signal better than would be expected by chance. We measured the fidelity of the match between the activations for the natural stimulus and its model metamer at the matched model stage using three different metrics: Spearman ρ, Pearson R2 and the SNR,
where x is the activations for the original sound when comparing metamers or for a randomly selected sound for the null distribution, and y is activations for the comparison sound (the model metamer or another randomly selected sound). We then ensured that for each of the three measures, the value for the model metamer fell outside of a null distribution measured between 1,000,000 randomly chosen image or audio pairs from the training dataset. Metamers that did not pass the null distribution test for any of the Spearman ρ, Pearson R2 or SNR values measured at the stage used for the optimization were excluded from the set of experimental stimuli. The only exception to this was the HMAX model, for which we only used the SNR for the matching criteria.
The second criterion was that the models had to produce the same class label for the model metamer and natural signal. For visual models, the model metamer had to result in the same 16-way classification label as the natural signal to which it was matched. For the auditory models, the model metamer had to result in the same word label (of 794 possible labels, including ‘null’) as the natural speech signal to which it was matched. For models that did not have a classifier stage (the self-supervised models, HMAX and the spectrotemporal filter model), we trained a classifier as described in Supplementary Modeling Note 1 for this purpose. The classifier was included to be conservative but in practice could be omitted in future work, as very few stimuli pass the first matching fidelity criterion but not the classifier criterion.
Handling gradients through the ReLU operation
Many neural networks use the ReLU nonlinearity, which yields a partial derivative of 0 if the input is negative. We found empirically that it was difficult to match ReLU layers due to the initialization producing many activations of 0. To improve the optimization when generating a metamer for activations immediately following a ReLU, we modified the derivative of the metamer generation layer ReLU to be 1 for all input values, including values below 0 (ref. 25). ReLU layers that were not the metamer generation layer behaved normally, with a gradient of 0 for input values below 0.
Metamer generation with regularization
To investigate the effects of regularization on metamer recognizability, we generated metamers with additional constraints on the optimization procedure. We followed the procedures of an earlier paper by Mahendran and Vedaldi51. Two regularization terms were included: (1) a total variation (TV) regularizer and (2) an α-norm regularizer.
The resulting objective function minimized to generate metamers was
using the 6-norm for the α-norm regularizer
and using the TV regularizer
where A represents the activations evoked by the natural signal at the generation layer, A′ represents the activations evoked by the model metamer at the generation layer, λα and λTV are scaling coefficients for the regularizers, and is the mean of the input signal x (x is normalized according to the typical normalization for the model, subtracting the dataset mean and dividing by the dataset standard deviation for each channel).
For the TV regularizer, we generated metamers for three different coefficient values with , and . As observed by Mahendran and Vedaldi51, we found that larger TV regularizers impaired optimization at early model stages, with resulting stimuli often not passing our metamer optimization success criteria (for instance, only 2/400 metamers generated from relu0 of AlexNet passed this criteria for the largest regularizer coefficient value). Thus, for the behavioral experiments, we chose separate coefficient values for each model stage. Specifically, in AlexNet, we used for relu0 and relu1, for relu2 and relu3 and for relu4, fc0_relu, fc1_relu and final (this is exactly what was done in Mahendran and Vedaldi51 except that the λ values are different due to differences in how the input is normalized, 0–255 in Mahendran and Vedaldi51 compared to 0–1 in our models).
For the α-norm regularizer, we followed the methods used by Mahendran and Vedaldi51, with α = 6, and used a single coefficient of λα = 0.005 for all stages. This coefficient was chosen based on the logic proposed in Mahendran and Vedaldi51 for the starting value, with a small sweep around the values for a small number of examples (10× up and 10× down), in which we subjectively judged which value produced the largest visual recognizability benefit.
We observed that when these regularizers were used, the default step sizes (initial learning rate of η = 1) used in our metamer generation method resulted in stimuli that looked qualitatively more ‘gray’ than expected, that is, stayed close to the mean. Thus, to maximize the chances of seeing a benefit from the regularization, in a separate condition, we increased the initial step size for metamer generation to be 16 times the default value (initial η = 16).
We found empirically that there was a trade-off between satisfying the goal of matching the metamer activations and minimizing the regularization term. As described above, it was necessary to hand-tune the regularization weights to obtain something that met our convergence criteria, but even when these criteria were met, metamers generated with regularization tended to have worse activation matches than metamers generated without regularization. This observation is consistent with the idea that there is not an easy fix to the discrepancies revealed by metamers that simply involves adding an additional term to the optimization. And in some domains (such as audio), it is not obvious what to use for a regularizer. Although the use of additional criteria to encourage the optimization to stay close to the manifold of ‘natural’ examples likely has useful applications, we emphasize that it is at odds with the goal of testing whether a model on its own replicates the properties of a biological sensory system.
Behavioral experiments
All behavioral experiments presented in the main text were run on Amazon Mechanical Turk. To increase data quality, Amazon Turk qualifications were set to relatively stringent levels. The ‘HIT Approval Rate for all Requesters’ HITs’ had to be at least 97%, and the ‘Number of HITs Approved’ had to exceed 1,000. Blinding was not applicable as the analysis was automated. Example code to run the online experiments is available at https://github.com/jenellefeather/model_metamers_pytorch.
Stimuli: image experiments
Each stimulus belonged to 1 of the 16 entry-level Microsoft Common Objects in Context categories. We used a mapping from these 16 categories to the corresponding ImageNet1K categories (where multiple ImageNet1K categories can map onto a single Microsoft Common Objects in Context category), used in a previous publication10. For each of the 16 categories, we selected 25 examples from the ImageNet1K validation dataset for a total of 400 natural images that were used to generate stimuli. A square center crop was taken for each ImageNet1K image (with the smallest dimension of the image determining the size), and the square image was rescaled to the necessary input dimensions for each ImageNet1K-trained model. Metamers were generated for each of the 400 images to use for the behavioral experiments.
Stimuli: auditory experiments
Stimuli were generated from 2-s speech audio excerpts randomly chosen from the test set of the Word–Speaker–Noise dataset25 (Supplemental Modeling Note 1) constrained such that only clips from unique sources within the Wall Street Journal corpus were used. Sounds were cropped to the middle 2 s of the clip such that the labeled word was centered at the 1-s mark. To reduce ambiguity about the clip onset and offset, we also screened to ensure that the beginning and end 0.25 s of the clip was no more than 20 dB quieter than the full clip. Four hundred clips were chosen subject to these constraints and such that each clip contained a different labeled word. Metamers were generated for each of the 400 clips.
Image behavioral experiment
We created a visual experiment in JavaScript similar to that described in a previous publication10. Participants were tasked with classifying an image into 1 of 16 presented categories (airplane, bear, bicycle, bird, boat, bottle, car, cat, chair, clock, dog, elephant, keyboard, knife, oven and truck). Each category had an associated image icon that participants chose from during the experiment. Each trial began with a fixation cross at the center of the screen for 300 ms, followed by a natural image or a model metamer presented at the center of the screen for 300 ms, a pink noise mask presented for 300 ms and a 4 × 4 grid containing all 16 icons. Participants selected an image category by clicking on the corresponding icon. To minimize effects of internet disruptions, we ensured that the image was loaded into the browser cache before the trial began. To assess whether any timing variation in the online experiment setup might have affected overall performance, we compared recognition performance on natural images to that measured during in-lab pilot experiments (with the same task but different image examples) reported in an earlier conference paper25. The average online performance across all natural images was on par or higher than that measured in the lab (in-lab proportion correct = 0.888 ± 0.0240) for all experiments.
The experimental session began with 16 practice trials to introduce participants to the task with 1 trial for each category, each presenting a natural image from the ImageNet1K training set. Participants received feedback for these first 16 trials. Participants then began a 12-trial demo experiment that contained some natural images and some model metamers generated from the ImageNet1K training set. The goal of this demo experiment was twofold: (1) to introduce participants to the types of stimuli they would see in the main experiment and (2) to be used as a screening criterion to remove participants who were distracted, misunderstood the task instructions, had browser incompatibilities or were otherwise unable to complete the task. Participants were only allowed to start the main experiment if they correctly answered 7 of 12 correct on the demo experiment, which was the minimum that were correctly answered for these same demo stimuli by 16 in-lab participants in a pilot experiment25. In total, 341 of 417 participants passed the demo experiment and chose to move on to the main experiment. Participants received $0.50 for completing the demo experiment.
There were 12 different online image experiments, each including a set of conditions (model stages) to be compared. Participants only saw 1 natural image or metamer for each of the 400 images in the behavioral stimulus set. Participants additionally completed 16 catch trials consisting of the icon image for one of the classes. Participant data were only included in the analysis if the participant got 15 of 16 of these catch trials correct (270 of 341 participants were included across the 12 experiments). Of these participants, 125 self-identified as female, 143 self-identified as male, and 2 did not report. The mean age was 42.1 years, the minimum age was 20 years, and the maximum age was 78 years. For all but the HMAX experiment, participants completed 416 trials, 1 for each of the 400 original images plus the 16 catch trials. The 400 images were randomly assigned to the experiment conditions subject to the constraint that each condition had approximately the same number of trials (Supplementary Table 5). The resulting 416 total trials were then presented in random order across the conditions of the experiment. The HMAX experiment used only 200 of the original 400 images for a total of 216 trials. Participants received an additional $6.50 for completing the experiment (or $3.50 in the case of HMAX).
Model performance on this 16-way classification task was evaluated by measuring the predictions for the full 1,000-way ImageNet classification task and finding the maximum probability for a label that was included in the 16-class dataset (231 classes).
Auditory behavioral experiment
The auditory experiment was similar to that used in earlier publications4,25. Each participant listened to a 2-s audio clip and chose 1 of 793 word labels corresponding to the word in the middle of the clip (centered at the 1-s mark of the clip). Responses were entered by typing the word label into a response box. As participants typed, word labels matching the letter string they were typing appeared below the response box to help participants identify allowable responses. Once a word was typed that matched 1 of the 793 responses, participants could move on to the next trial.
To increase data quality, participants first completed a short experiment (six trials) that screened for the use of headphones83. Participants received $0.25 for completing this task. If participants scored five of six or higher on this screen (224/377 participants), they moved on to a practice experiment consisting of ten natural audio trials with feedback (drawn from the training set) designed to introduce the task. This was followed by a demo experiment of 12 trials without feedback. These 12 trials contained both natural audio and model metamers25. The audio demo experiment served to introduce participants to the types of stimuli they would hear in the main experiment and to screen out poorly performing participants. A screening criterion was set at 5 of 12, which was the minimum for 16 in-lab participants in earlier work25. In total, 154 of 224 participants passed the demo experiment and chose to move on to the main experiment. Participants received an additional $0.50 for completing the demo experiment. We have repeatedly found that online auditory psychophysical experiments qualitatively and quantitatively reproduce in-lab results, provided that steps such as these are taken to help ensure good audio presentation quality and attentive participants84–87. Here, we found that average online performance on natural stimuli was comparable to in-lab performance reported in Feather et al.25 using the same task with different audio clips (in-lab proportion correct = 0.863 ± 0.0340).
There were six different main auditory experiments. The design of these experiments paralleled that of the image experiments. Participants only heard 1 natural speech or metamer stimulus for each of the 400 excerpts in the behavioral stimulus set. Participants additionally completed 16 catch trials. These catch trials each consisted of a single word corresponding to one of the classes. Participant data were only included in the analysis if the participant got 15 of 16 of these trials correct (this inclusion criterion removed 8 of 154 participants). Some participants chose to leave the experiment early and were excluded from the analysis (23 of 154), and 3 participants were excluded due to self-reported hearing loss, yielding a total of 120 participants across all auditory experiments. Of these participants, 45 self-identified as female, 68 self-identified as male, and 7 chose not to report (mean age = 39 years, minimum age = 22 years, maximum age = 77 years). For all but the Spectemp experiment, participants completed 416 trials, 1 for each of the 400 original excerpts, plus the 16 catch trials. The 400 excerpts were randomly assigned to the experiment conditions subject to the constraint that each condition had approximately the same number of trials (Supplementary Table 6). The resulting 416 total trials were then presented in random order across the conditions of the experiment. The Spectemp experiment used only 200 of the original 400 excerpts for a total of 216 trials. We collected online data in batches until we reached the target number of participants for each experiment. Participants received $0.02 cents for each trial completed plus an additional $3.50 bonus for completing the full experiment (or $2.00 for the Spectemp experiment).
Statistical tests: difference between human and model recognition accuracy
Human recognition experiments were analyzed by comparing human recognition of a generating model’s metamers to the generating model’s recognition of the same stimuli (its own metamers). Each human participant was run on a distinct set of model metamers; we presented each set to the generation model and measured its recognition performance for that set. Thus, if N human participants performed an experiment, we obtained N model recognition curves. We ran mixed-model, repeated measures ANOVAs with a within-group factor of metamer generation model stage and a between-group factor of observer (human or model observer), testing for both a main effect of observer and an interaction between observer and model stage. Data were non-normal due to a prevalence of values close to 1 or 0 depending on the condition, and so we evaluated statistical significance non-parametrically using permutation tests comparing the observed F statistic to that obtained after randomly permuting the data labels. To test for main effects, we permuted observer labels (model versus human). To test for interactions of observer and model stage, we permuted both observer labels and model stage labels independently for each participant. In each case, we used 10,000 random permutations and computed a P value by comparing the observed F statistic to the null distribution of F statistics from permuted data (that is, the P value was 1 – rank of the observed F statistic/number of permutations). F statistics here and elsewhere were calculated with MATLAB 2021a.
Because the classical models (Extended Data Figs. 4 and 5) did not perform recognition judgments, rather than comparing human and model recognition as in the experiments involving neural network models, we instead tested for a main effect of model stage on human observer recognition. We performed a single-factor repeated measures ANOVA using a within-group factor of model stage, again evaluating statistical significance non-parametrically (we randomly permuted the model stage labels of the recognition accuracy data, independently for each participant, with 10,000 random permutations).
Statistical tests: difference between human recognition of metamers generated from different models
To compare human recognition of metamers generated from different models, we ran a repeated measures ANOVA with within-group factors of model stage and generating model. This type of comparison was only performed in cases where the generating models had the same architecture (so that the model stages were shared between models). We again evaluated statistical significance non-parametrically by comparing the observed F statistic to a null distribution of F statistics from permuted data (10,000 random permutations). To test for a main effect of generating model, we randomly permuted the generating model label independently for each participant. To test for an interaction between generating model and model stage, we permuted both generating model and model stage labels independently for each participant.
Power analysis to determine sample sizes
To estimate the number of participants necessary to be well powered for the planned statistical tests, we ran a pilot experiment comparing the standard versus adversarially trained ResNet50 and CochResNet50 models, as this experiment included the largest number of conditions, and we expected that differences between different adversarially trained models would be subtle, putting an upper bound on the sample sizes needed across experiments.
For the vision experiment, we ran ten participants in a pilot experiment on Amazon Mechanical Turk. The format was identical to that of the main experiments described here, with the exception that we used a screening criterion of 8 of 12 correct for the pilot rather than the 7 of 12 correct used for the main experiment. In this pilot experiment, the smallest effect size out of those we anticipated analyzing in the main experiments was the comparison between the L∞-norm (ε = 8/256) adversarially trained ResNet50 and the L2-norm (ε = 3) adversarially trained Resnet50 with a partial η2 value of 0.10 for the interaction. A power analysis with G*Power88 showed that 18 participants were needed to have a 95% chance of seeing an effect of this size at a P < 0.01 significance level. We thus set a target of 20 participants for each online vision experiment.
For the auditory experiments, we ran 14 participants in a pilot experiment on Amazon Mechanical Turk. The format was identical to that of the main experiments in this paper with the exception that 8 of the 14 participants only received six original audio trials with feedback, whereas in the main experiment, ten trials with feedback were used. The smallest effect size of interest was that for the comparison between the L∞-norm (ε = 0.002) adversarially trained CochResNet50 and the L2-norm (ε = 1) waveform adversarially trained CochResNet50, yielding a partial η2 value of 0.37 for the interaction. A power analysis with G*Power indicated that 12 participants were needed to have a 95% change of seeing an effect of this size at a P < 0.01 significance level. To match the image experiments, we set a target of 20 participants for each main auditory experiment.
Split-half reliability analysis of metamer confusion matrices
To assess whether human participants had consistent error patterns, we compared confusion matrices from split halves of participants. Each row of the confusion matrix (corresponding to a category label) was normalized by the number of trials for that label. We then computed the Spearman correlation between the confusion matrices from each split and compared this correlation to that obtained from confusion matrices from permuted participant responses for the condition. We computed the correlation for 1,000 random splits of participants (splitting the participants in half) and used a different permutation of the response for each split. We counted the number of times that the difference between the true split-half correlation and the shuffled correlation was less than or equal to 0 (noverlap), and the P value was computed as
Human consistency of errors for individual stimuli
In the experiment to evaluate the consistency of errors for individual stimuli (Extended Data Fig. 7), we only included four conditions to collect enough data to analyze performance on individual images: natural images, metamers from the relu2 and final stages for the random perturbation-trained AlexNet L2-norm (ε = 1) model and metamers from the final stage of the adversarial perturbation-trained AlexNet L2-norm (ε = 1) model. The rationale for the inclusion of these stages was that the relu2 stage of the random perturbation AlexNet and the final stage of the adversarial perturbation AlexNet had similarly recognizable metamers (Fig. 4c), whereas metamers from the final stage of the random perturbation AlexNet were recognized no better than by chance by humans.
To first assess the reliability of the recognizability of individual stimuli (Extended Data Fig. 7a), we measured the Spearman correlation of the recognizability (proportion correct) of each stimulus across splits of participants separately for each of the four conditions. We averaged this correlation over 1,000 random splits of participants. P values were computed non-parametrically by shuffling the participant responses for each condition and each random split and computing the number of times the true average Spearman ρ was lower than the shuffled correlation value. We only included images in the analysis that had at least four trials in each split of participants, and when there were more than four trials in a split, we only included four of the trials, randomly selected, in the average to avoid having some images exert more influence on the result than others.
Most and least recognizable images
To analyze the consistency of the most and least recognizable metamers in each condition (Extended Data Fig. 7b), we used one split of participants to select 50 images that had the highest recognition score and 50 images with the lowest recognition score. We then measured the recognizability of these images in the second split of participants and assessed whether the ‘most’ recognizable images had a higher recognition score than the ‘least’ recognizable images. P values for this comparison were computed by using 1,000 splits of participants and measuring the proportion of splits in which the difference between the two scores was greater than 0.
To select examples of the most and least recognizable images (Extended Data Fig. 7c,d), we only included example stimuli with at least eight responses for both the natural image condition and the model metamer stage under consideration and that had 100% correct responses on the natural image condition. From this set, we selected the ‘most’ recognizable images (as those with scores of 100% correct for the considered condition) and the ‘least’ recognizable images (as those with scores of 0% correct).
Model–brain comparison metrics for visual models
We used the Brain-Score31 platform to obtain metrics of neural similarity in four visual cortical areas of the macaque monkey brain: V1, V2, V4 and IT. For each model considered, we analyzed only the stages that were included in our human metamer recognition experiments. We note that some models may have had higher brain similarity scores had we analyzed all stages. Each of these model stages was fit to a public data split for each visual region, with the best-fitting stage for that region selected for further evaluation. The match of this model stage to brain data was then evaluated on a separate set of evaluation data for that region. Evaluation data for V1 consisted of the average of 23 benchmarks: 22 distribution-based comparison benchmarks from Marques et al.89 and the V1 partial least squares (PLS) regression benchmark from Freeman et al.90. Evaluation data for V2 consisted of the V2 PLS benchmark from Freeman et al.90. Evaluation data for V4 consisted of the average of four benchmarks: the PLS V4 benchmark from Majaj et al.91, the PLS V4 benchmark from Sanghavi and DiCarlo92, the PLS V4 benchmark from Sanghavi et al.93 and the PLS V4 benchmark from Sanghavi et al.94. Evaluation data for IT consisted of the average of four benchmarks: the PLS IT benchmark from Majaj et al.91, the PLS IT benchmark from Sanghavi and DiCarlo92, the PLS IT benchmark from Sanghavi et al.93 and the PLS IT benchmark from Sanghavi et al.94. When comparing metamer recognizability to the Brain-Score results, we used the human recognition of metamers from the model stage selected as the best match for each visual region.
We used Spearman correlations to compare metamer recognizability to the Brain-Score results. The analogous Pearson correlations were lower, and none reached statistical significance. We report Spearman correlations on the grounds that the recognizability was bounded by 0 and 1 and to be conservative with respect to our conclusion that metamer recognizability is not explained by standard model–brain comparison metrics.
We estimated the noise ceiling of the correlation between Brain-Score results and human recognizability of model metamers as the geometric mean of the reliabilities of each quantity. To estimate the reliability of the metamer recognizability, we split the participants for an experiment in half and measured the recognizability of metamers for each model stage used to obtain the Brain-Score results (that is, the best-predicting stage for each model for the brain region under consideration). We then calculated the Spearman correlation between the recognizability for the two splits and Spearman–Brown corrected to account for the 50% reduction in sample size from the split. This procedure was repeated for 1,000 random splits of participants. We then took the mean across the 1,000 splits as an estimate of the reliability. This estimated reliability was 0.917 for V1, 0.956 for V2, 0.924 for V4 and 0.97 for IT. As we did not have access to splits of the neural data used for Brain-Score, we estimated the reliability of the Brain-Score results as the Pearson correlation of the score reported in Kubilius et al.81 for two sets of neural responses to the same images (Spearman-Brown corrected). This estimated reliability was only available for IT (r = 0.87), but we assume that the reliability would be comparable for other visual areas.
Model–brain comparison metrics for auditory models
The auditory fMRI analysis closely followed that of a previous publication4 using the fMRI dataset collected in another previous publication60. The essential components of the dataset and analysis methods are replicated here, but for additional details, see refs. 4,60. The text from sections fMRI data acquisition and preprocessing is an edited version of a similar section from a previous publication4.
Natural sound stimuli
The stimulus set was composed of 165 2-s natural sounds spanning 11 categories (instrumental music, music with vocals, English speech, foreign speech, non-speed vocal sounds, animal vocalization, human non-vocal sound, animal non-vocal sound, nature sound, mechanical sound or environment sound). The sounds were presented in a block design with five presentations of each 2-s sound. A single fMRI volume was collected following each sound presentation (‘sparse scanning’), resulting in a 17-s block. Silence blocks of the same duration as the stimulus blocks were used to estimate the baseline response. Participants performed a sound intensity discrimination task to increase attention. One sound in the block of five was presented 7 dB lower than the other four (the quieter sound was never the first sound), and participants were instructed to press a button when they heard the quieter sound. Sounds were presented with magnetic resonance-compatible earphones (Sensimetrics S14) at 75 dB sound pressure level (SPL) for the louder sounds and 68 dB SPL for the quieter sounds. Blocks were grouped into 11 runs, each containing 15 stimulus blocks and 4 silence blocks.
fMRI data acquisition and preprocessing
Data were acquired in a previous study60. These magnetic resonance data were collected on a 3T Siemens Trio scanner with a 32-channel head coil at the Athinoula A. Martinos Imaging Center of the McGovern Institute for Brain Research at MIT. Repetition time was 3.4 s (acquisition time was only 1 s due to sparse scanning), echo time was 30 ms, and flip angle was 90°. For each run, the five initial volumes were discarded to allow homogenization of the magnetic field. In-plane resolution was 2.1 × 2.1 mm (96 × 96 matrix), and slice thickness was 4 mm with a 10% gap, yielding a voxel size of 2.1 × 2.1 × 4.4 mm. iPAT was used to minimize acquisition time. T1-weighted anatomical images were collected in each participant (1 mm isotropic voxels) for alignment and surface reconstruction. Each functional volume consisted of 15 slices oriented parallel to the superior temporal plane, covering the portion of the temporal lobe superior to and including the superior temporal sulcus.
Functional volumes were preprocessed using FMRIB Software Library and in-house MATLAB scripts. Volumes were corrected for motion and slice time and were skull stripped. Voxel time courses were linearly detrended. Each run was aligned to the anatomical volume using FLIRT and BBRegister. These preprocessed functional volumes were then resampled to vertices on the reconstructed cortical surface computed via FreeSurfer and were smoothed on the surface with a 3-mm full-width at half-maximum two-dimensional Gaussian kernel to improve SNR. All analyses were done in this surface space, but for ease of discussion, we refer to vertices as ‘voxels’ in this paper. For each of the three scan sessions, we estimated the mean response of each voxel (in the surface space) to each stimulus block by averaging the response of the second through the fifth acquisitions after the onset of each block (the first acquisition was excluded to account for the hemodynamic lag). Pilot analyses showed similar response estimates from a more traditional general linear model60. These signal-averaged responses were converted to percent signal change by subtracting and dividing by each voxel’s response to the blocks of silence. These percent signal change values were then downsampled from the surface space to a 2-mm isotropic grid on the FreeSurfer-flattened cortical sheet. Analysis was performed within localized voxels in each participant.
fMRI data
We used the voxel responses from the original Norman-Haignere et al. study60, which measured fMRI responses to each natural sound relative to a silent baseline (as described in the previous section) and selected voxels with a consistent response to sounds from a large anatomical constraint region encompassing the superior temporal and posterior parietal cortex. As in Kell et al.4, within this set of voxels, we localized four ROIs in each participant, consisting of voxels selective for (1) frequency (that is, tonotopy), (2) pitch, (3) speech and (4) music, according to a ‘localizer’ statistical test. We excluded voxels that were selected by more than one localizer. The frequency-selective, pitch and speech localizers used additional fMRI data collected in separate scans. In total, there were 379 voxels in the frequency-selective ROI, 379 voxels in the pitch ROI, 393 voxels in the music ROI and 379 voxels in the speech ROI. The voxel responses and ROI assignments are available at https://github.com/jenellefeather/model_metamers_pytorch.
Frequency-selective voxels were identified from responses to pure tones in six different frequency ranges (center frequencies: 200, 400, 800, 1,600, 3,200 and 6,400 Hz)95,96 as the top 5% of all selected voxels in each participant ranked by P values of an ANOVA across frequency. In practice, most selected voxels centered around Heschl’s gyrus. Pitch-selective voxels were identified from responses to harmonic tones and spectrally matched noise96 as the top 5% of voxels in each participant with the lowest P values from a one-tailed t-test comparing those conditions. Speech-selective voxels were identified from responses to German speech and to temporally scrambled (‘quilted’) speech stimuli generated from the same German source recordings97. The ROI consisted of the top 5% of voxels in each participant with the lowest P values from a one-tailed t-test comparing intact and quilted speech. Music-selective voxels were identified with the music component derived by Norman-Haignere et al.60 as the top 5% of voxels with the most significant component weights.
Voxel-wise encoding analysis
We used the model responses to predict the fMRI responses. Each of the 165 sounds from the fMRI experiment was resampled to 20,000 Hz and passed through each model. To compare the model responses to the fMRI response, we averaged over the time dimension for all units that had a temporal dimension (all model stages except fully connected layers). Each voxel’s time-averaged responses were modeled as a linear combination of these responses. Ten random train–test splits (83/82) were taken from the stimulus set. For each split, we estimated a linear mapping using L2-regularized (‘ridge’) linear regression using RidgeCV from the scikit learn library version 0.23.1 (ref. 98). The mean response of each feature across sounds was subtracted from the regressor matrix before fitting.
The best ridge regression parameter for each voxel was independently selected using leave-one-out cross-validation across the 83 training sounds in each split sweeping over 81 logarithmically spaced values (each power of 10 between 10−40 and 1040). Holding out one sound in the training set at a time, the mean squared error of the prediction for that sound was computed using regression weights from the other 82 training set sounds for each of the regularization parameter values. The parameter value that minimized the error averaged across the held-out training sounds was used to fit a linear mapping between model responses to all 83 training set sounds and the voxel responses. This mapping was used to predict the voxel response to the 82 test sounds. Fitting fidelity was evaluated with the squared Pearson correlation (r2). Explained variance was computed for voxel responses averaged across the three scans in the original study.
This explained variance was corrected for the effects of measurement noise using the reliability of the voxel responses and the predicted voxel response99. Voxel response reliability () was computed as the median Spearman–Brown-corrected Pearson correlation between all three pairs of scans, where the Spearman–Brown correction accounts for increased reliability expected from tripling the amount of data100. Voxel response prediction reliability () was similarly computed by using the training data for each of the three scans to predict the test data from the same scan and calculating the median Spearman–Brown-corrected correlation between the three pairs of predicted voxel responses. The corrected explained variance is
where r is the Pearson correlation between the predicted and measured voxel responses to the test data when using the averaged voxel responses across the three scans for fitting and evaluation. If voxels and/or predictions are very unreliable, this can lead to large corrected variance explained measures101. We set a minimum value of 0.182 for (the value at which the correlation of two 83-dimensional random variables reaches significance at a threshold of P < 0.05; 83 being the number of training data values) and a minimum value of 0.183 for (the analogous value for 82-dimensional random variables matching the number of test data values).
The corrected variance explained was computed for each voxel using each model stage for each of ten train–test splits of data. We took the median variance explained across the ten splits of data. We computed a summary metric of variance explained across each of the ROIs (Fig. 7d; all auditory voxels, tonotopic voxels, pitch voxels, music voxels and speech voxels) as follows. First, a summary measure for each participant and model stage was computed by taking the median across all voxels of the voxel-wise corrected variance explained values within the ROI. Holding out one participant, we then averaged across the remaining participant values to find the stage with the highest variance explained within the given ROI. We measured the corrected variance explained for this stage in the held-out participant. This cross-validation avoids issues of non-independence when selecting the best stage. This procedure was repeated for each participant, and we report the mean corrected variance explained across the participants. Metamer recognition was measured from the model stage most frequently chosen as the best-predicting model stage across participants (in practice, nearly all participants had the same ‘best’ model stage).
Noise ceiling estimates for correlation between metamer recognizability and fMRI metrics
We estimated the noise ceiling of the correlation between auditory fMRI predictivity and human recognizability of model metamers as the geometric mean of the reliabilities of each quantity. To estimate the reliability of the metamer recognizability, we split the participants for an experiment in half and measured the recognizability of metamers for the model stage that was most frequently chosen (across all participants) as the best-predicting stage for the ROI under consideration (that is, the stages used for Fig. 7d). We then calculated the Spearman correlation between the recognizability for the two splits and Spearman–Brown corrected to account for the 50% reduction in sample size from the split. This procedure was repeated for 1,000 random splits of participants. We then took the mean across the 1,000 splits as an estimate of the reliability. This estimated reliability of the metamer recognizability was 0.811, 0.829, 0.819, 0.818 and 0.801 for the best-predicting stage of all auditory voxels, the tonotopic ROI, the pitch ROI, the music ROI and the speech ROI, respectively. To estimate the reliability of the fMRI prediction metric, we took two splits of the fMRI participants and calculated the mean variance explained for each model using the stage for which recognizability was measured. We then computed the Spearman correlation between the explained variance for the two splits and Spearman–Brown corrected the result. We then repeated this procedure for 1,000 random splits of the participants in the fMRI study and took the mean across the 1,000 splits as the estimated reliability. This reliability of fMRI predictions was 0.923 for all auditory voxels, 0.768 for the tonotopic ROI, 0.922 for the pitch ROI, 0.796 for the music ROI and 0.756 for the speech ROI.
Representational similarity analysis
To construct the model representational dissimilarity matrix (RDM) for a model stage, we computed the dissimilarity (1 – Pearson correlation coefficient) between the model activations evoked by each pair of the 165 sounds for which we had fMRI responses. Similarly, to construct the fMRI RDM, we computed the dissimilarity in voxel responses (1 – Pearson correlation coefficient) between all ROI voxel responses from a participant to each pair of sounds. Before computing the RDMs from the fMRI or model responses, we z scored the voxel or unit responses.
To compute RDM similarity for the model stage that best matched an ROI (Extended Data Fig. 10), we first generated 10 random train–test splits of the 165 sound stimuli into 83 training sounds and 82 test sounds. For each split, we computed the RDMs for each model stage and for each participant’s fMRI data for the 83 training sounds. We then chose the model stage that yielded the highest Spearman ρ between the model stage RDM and the participant’s fMRI RDM. Using this model stage, we measured model and fMRI RDMs from the test sounds and computed the Spearman ρ. We repeated this procedure for each of the ten train–test splits and took the median Spearman ρ. We then computed the mean of this median Spearman ρ across participants for each model. When comparing RDM similarity to metamer recognizability, we measured recognizability from the model stage that was most frequently chosen as the best-matching model stage across participants.
As an estimate of the upper bound for the RDM correlation that could be reasonably expected to be achieved between a model RDM and a single participant’s fMRI RDM given fMRI measurement noise, we calculated the correlation between one participant’s RDM and the average of all the other participants’ RDMs. The RDMs were measured from the same ten train–test splits described in the previous paragraph using the 82 test sounds for each split. We took the median Spearman ρ (between RDMs) across the ten splits of data to yield a single value for each participant. The upper bound shown in Extended Data Fig. 10 is the mean across the measured value for each held-out participant. We used this upper bound rather than noise correcting the human–model RDM correlation to be consistent with prior modeling papers102.
Model recognition of metamers generated from other models
To measure the recognition of a model’s metamers by other models, we took the generated image or audio that was used for the human behavioral experiments, provided it as input to a ‘recognition’ model and measured the 16-way image classification (for the visual models) or the 763-way word classification (for the auditory models).
The plots in Fig. 8b show the average recognition by other models of metamers generated from a particular type of ResNet50 model. This curve plots recognition performance averaged across all other vision recognition models (as shown in Fig. 8a). The curve for self-supervised models is also averaged across the three self-supervised generation models (SimCLR, MoCo_V2 and BYOL), and the curve for adversarially trained models is also averaged across the three adversarially trained ResNet50 models (trained with L2-norm (ε = 3), L∞-norm (ε = 4/255) and L∞-norm (ε = 8/255) perturbations, respectively). For these latter two curves, we first computed the average curve for each recognition model across all three generation models, omitting the recognition model from the average if it was the same as the generation model (in practice, this meant that there was one less value included in the average for the recognition models that are part of the generation model group). We then averaged across the curves for each recognition model. The error bars on the resulting curves are the s.e.m. computed across the recognition models.
The graphs in Fig. 8c,d were generated in an analogous fashion. We used one ‘standard’ generation model (the standard supervised AlexNet and CochResNet50, respectively). The curves in Fig. 8c plot results for LowPassAlexNet and VOneAlexNet. In Fig. 8d, the curve for the waveform adversarially trained models was averaged across the three such CochResNet50 models (trained with L2-norm (ε = 0.5), L2-norm (ε = 1) and L∞-norm (ε = 0.002) perturbations, respectively). The curve for the cochleagram adversarially trained models was averaged across the two such CochResNet50 models (trained with L2-norm (ε = 0.5) and L2-norm (ε = 1) perturbations, respectively). The group averages and error bars were computed as in Fig. 8b.
We used permutation tests to evaluate differences between the recognizability of metamers from different types of generation models and measured the statistical significance of a main effect of generation model group. We compared the observed difference between the recognizability of metamers from two generation model groups (averaged across recognition models and model stages) to a null distribution obtained from 10,000 random permutations of the generation model labels (independently permuted for each recognition model). When there was a single generation model in the group (that is, for the standard-trained model), responses were not defined for the recognition model when it was the same as the generation model. In this case, we permuted the recognition model responses as if the value existed but treated the value as missing during the average across recognition models.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41593-023-01442-0.
Supplementary information
Acknowledgements
We thank R. Gonzalez for help constructing the Word–Speaker–Noise dataset used for training. We also thank R. Gonzalez and A. Durango for help running in-lab experiments, J. Dapello for guidance on the VOneNet models and M. Schrimpf for help with Brain-Score evaluations. We thank A. Francl and M. Saddler for advice on model training and evaluation, A. Kell and S. Norman-Haignere for help with fMRI data analysis and M. McPherson for help with Amazon Turk experiment design and statistics decisions. This work was supported by National Science Foundation grant number BCS-1634050 to J.H.M., National Institutes of Health grant number R01DC017970 to J.H.M., a Department of Energy Computational Science Graduate Fellowship under grant number DE-FG02-97ER25308 to J.F. and a Friends of the McGovern Institute Fellowship to J.F.
Extended data
Author contributions
J.F. and J.H.M. conceived the project and designed experiments. J.F. conducted all analyses, ran behavioral experiments and made the figures. G.L. and A.M. assisted with code and experiment design for adversarial models and evaluation. J.F. and J.H.M. drafted the manuscript. All authors edited the manuscript.
Peer review
Peer review information
Nature Neuroscience thanks Justin Gardner, Tim Kietzmann, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Data availability
Human data, trained model checkpoints and an interface to view/listen to the generated metamers used in the human recognition experiments are available at https://github.com/jenellefeather/model_metamers_pytorch. The Word–Speaker–Noise training dataset is available from the authors upon request.
Code availability
Code for generating metamers, training models and running online experiments is available at https://github.com/jenellefeather/model_metamers_pytorch (10.5281/zenodo.8373260). Auditory front-end (cochleagram generation) code is available at https://github.com/jenellefeather/chcochleagram.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Jenelle Feather, Email: jfeather@mit.edu.
Josh H. McDermott, Email: jhm@mit.edu
Extended data
is available for this paper at 10.1038/s41593-023-01442-0.
Supplementary information
The online version contains supplementary material available at 10.1038/s41593-023-01442-0.
References
- 1.Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex. 1991;1:1–47. doi: 10.1093/cercor/1.1.1. [DOI] [PubMed] [Google Scholar]
- 2.Fukushima K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980;36:193–202. doi: 10.1007/BF00344251. [DOI] [PubMed] [Google Scholar]
- 3.Serre T, Oliva A, Poggio T. A feedforward architecture accounts for rapid categorization. Proc. Natl Acad. Sci. USA. 2007;104:6424–6429. doi: 10.1073/pnas.0700622104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kell AJE, Yamins DLK, Shook EN, Norman-Haignere SV, McDermott JH. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron. 2018;98:630–644. doi: 10.1016/j.neuron.2018.03.044. [DOI] [PubMed] [Google Scholar]
- 5.Kriegeskorte N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 2015;1:417–446. doi: 10.1146/annurev-vision-082114-035447. [DOI] [PubMed] [Google Scholar]
- 6.Tacchetti A, Isik L, Poggio TA. Invariant recognition shapes neural representations of visual input. Annu. Rev. Vis. Sci. 2018;4:403–422. doi: 10.1146/annurev-vision-091517-034103. [DOI] [PubMed] [Google Scholar]
- 7.Goodfellow, I., Lee, H., Le, Q., Saxe, A. & Ng, A. Measuring invariances in deep networks. In Advances in Neural Information Processing Systems 22 (eds Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. & Culotta, A.) 646–654 (Curran Associates, Inc., 2009).
- 8.Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 1999;2:1019–1025. doi: 10.1038/14819. [DOI] [PubMed] [Google Scholar]
- 9.Rust NC, Dicarlo JJ. Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area V4 to IT. J. Neurosci. 2010;30:12978–12995. doi: 10.1523/JNEUROSCI.0179-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Geirhos, R., Temme, C. R. M. & Rauber, J. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems 31 (eds Bengio, S. et al.) 7538–7550 (Curran Associates, Inc., 2018).
- 11.Jang H, McCormack D, Tong F. Noise-trained deep neural networks effectively predict human vision and its neural responses to challenging images. PLoS Biol. 2021;19:e3001418. doi: 10.1371/journal.pbio.3001418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhang, R. Making convolutional networks shift-invariant again. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri K., and Salakhutdinov, R.) 7324-7334 (PMLR, 2019).
- 13.Azulay, A. & Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 20, 1−25 (2019).
- 14.Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 427–436 (IEEE, 2015).
- 15.Szegedy, C. et al. Intriguing properties of neural networks. In Proc. 2nd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (2014).
- 16.Wandell, B. A. Foundations of Vision (Sinauer Associates, 1995).
- 17.Wyszecki, G. & Stiles, W. S. Color Science 2nd edn (Wiley, 1982).
- 18.Julesz B. Visual pattern discrimination. IEEE Trans. Inf. Theory. 1962;8:84–92. doi: 10.1109/TIT.1962.1057698. [DOI] [Google Scholar]
- 19.McDermott JH, Schemitsch M, Simoncelli EP. Summary statistics in auditory perception. Nat. Neurosci. 2013;16:493–498. doi: 10.1038/nn.3347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ziemba CM, Simoncelli EP. Opposing effects of selectivity and invariance in peripheral vision. Nat. Commun. 2021;12:4597. doi: 10.1038/s41467-021-24880-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hillis JM, Ernst MO, Banks MS, Landy MS. Combining sensory information: mandatory fusion within, but not between, senses. Science. 2002;298:1627–1630. doi: 10.1126/science.1075396. [DOI] [PubMed] [Google Scholar]
- 22.Sohn, H. & Jazayeri, M. Validating model-based Bayesian integration using prior-cost metamers. Proc. Natl Acad. Sci. USA118, e2021531118 (2021). [DOI] [PMC free article] [PubMed]
- 23.Balas B, Nakano L, Rosenholtz R. A summary-statistic representation in peripheral vision explains visual crowding. J. Vis. 2009;9:13.1–13.18. doi: 10.1167/9.12.13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Freeman J, Simoncelli EP. Metamers of the ventral stream. Nat. Neurosci. 2011;14:1195–1201. doi: 10.1038/nn.2889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Feather, J., Durango, A., Gonzalez, R. & McDermott, J. Metamers of neural networks reveal divergence from human perceptual systems. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 10078–10089 (Curran Associates, Inc. 2019).
- 26.Schrimpf, M. et al. Brain-Score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv10.1101/407007 (2018).
- 27.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (eds Bengio Y. & LeCun Y.) (2015)
- 28.He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In Computer Vision -- ECCV 2016 (eds Leibe, B., Matas, J., Sebe, N., & Welling, M.) 630–645 (Springer, 2016).
- 29.Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (eds Pereira, F., Burges, C. J., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc., 2012).
- 30.Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (IEEE, 2009).
- 31.Schrimpf M, et al. Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron. 2020;108:413–423. doi: 10.1016/j.neuron.2020.07.040. [DOI] [PubMed] [Google Scholar]
- 32.Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
- 33.Yalniz, I. Z., Jégou, H., Chen, K., Paluri, M. & Mahajan, D. Billion-scale semi-supervised learning for image classification. Preprint at arXiv10.48550/arXiv.1905.00546 (2019).
- 34.Steiner, A. P. et al. How to train your ViT? Data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research (2022); https://openreview.net/forum?id=4nPswr1KcP&nesting=2&sort=date-desc
- 35.Glasberg BR, Moore BCJ. Derivation of auditory filter shapes from notched-noise data. Hear. Res. 1990;47:103–138. doi: 10.1016/0378-5955(90)90170-T. [DOI] [PubMed] [Google Scholar]
- 36.McDermott JH, Simoncelli EP. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron. 2011;71:926–940. doi: 10.1016/j.neuron.2011.06.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lindsay, G. W. Convolutional neural networks as a model of the visual system: past, present, and future. J. Cogn. Neurosci. 33, 2017–2031 (2020). [DOI] [PubMed]
- 38.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning (eds Daumé III, H. & Singh, A.) 1597–1607 (PMLR, 2020).
- 39.Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. Preprint at arXiv10.48550/arXiv.2003.04297 (2020).
- 40.Grill, J.-B. et al. Bootstrap your own latent: a new approach to self-supervised learning. In Advances in Neural Information Processing Systems 33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., & Lin, H.) 21271–21284 (Curran Associates, Inc., 2020).
- 41.Konkle T, Alvarez GA. A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. 2022;13:491. doi: 10.1038/s41467-022-28091-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chi T, Ru P, Shamma SA. Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 2005;118:887–906. doi: 10.1121/1.1945807. [DOI] [PubMed] [Google Scholar]
- 43.Geirhos, R. et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. 7th International Conference on Learning Representations (eds Sainath, T., Rush, A., Levine, S. Livescu, K. & Mohamed, S.) (2019).
- 44.Hermann, K., Chen, T. & Kornblith, S. The origins and prevalence of texture bias in convolutional neural networks. In Advances in Neural Information Processing Systems33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 19000–19015 (Curran Associates, Inc., 2020).
- 45.Singer JJD, Seeliger K, Kietzmann TC, Hebart MN. From photos to sketches—how humans and deep neural networks process objects across different levels of visual abstraction. J. Vis. 2022;22:4. doi: 10.1167/jov.22.2.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. In Proc. 6th International Conference on Learning Representations (eds Bengio, Y., LeCun, Y., Sainath, T., Murray, I., Ranzato, M., & Vinyals, O.) (2018).
- 47.Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems 32 (eds Wallach, H., et al.) 125-136 (Curran Associates, Inc., 2019).
- 48.Engstrom, L. et al. Adversarial robustness as a prior for learned representations. Preprint at arXiv10.48550/arXiv.1906.00945 (2019).
- 49.Goodfellow, I., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. In Proc. 3rd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (2015).
- 50.Kong NCL, Margalit E, Gardner JL, Norcia AM. Increasing neural network robustness improves match to macaque V1 eigenspectrum, spatial frequency preference and predictivity. PLoS Comput. Biol. 2022;18:e1009739. doi: 10.1371/journal.pcbi.1009739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Mahendran, A. & Vedaldi, A. Understanding deep image representations by inverting them. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5188–5196 (IEEE, 2015).
- 52.Croce, F. et al. RobustBench: a standardized adversarial robustness benchmark. In Proc. of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (eds Vanschoren, J. & Yeung, S.) (Curran, 2021).
- 53.Hénaff, O. J. & Simoncelli, E. P. Geodesics of learned representations. In Proc. 4th International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (2016).
- 54.Dapello, J. et al. Neural population geometry reveals the role of stochasticity in robust perception. In Advances in Neural Information Processing Systems 34 (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Wortman Vaughan, J.) 15595–15607 (Curran Associates, Inc., 2021).
- 55.Dapello, J. et al. Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. In Advances in Neural Information Processing Systems 33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 13073–13087 (Curran Associates, Inc., 2020).
- 56.Sabour, S., Cao, Y., Faghri, F. & Fleet, D. J. Adversarial manipulation of deep representations. In Proc. 4th International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (2016).
- 57.Hendrycks, D. & Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. 7th International Conference on Learning Representations (eds Sainath, T., Rush, A., Levine, S. Livescu, K. & Mohamed, S.) (2019).
- 58.Dodge, S. & Karam, L. A study and comparison of human and deep learning recognition performance under visual distortions. In Proc. 26th International Conference on Computer Communication and Networks (ICCCN), 1–7 (IEEE, 2017).
- 59.Geirhos, R. et al. Partial success in closing the gap between human and machine vision. In Advances in Neural Information Processing Systems 34 (eds Ranzato, M. et al.) 23885–23899 (Curran Associates, Inc., 2021).
- 60.Norman-Haignere S, Kanwisher NG, McDermott JH. Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition. Neuron. 2015;88:1281–1296. doi: 10.1016/j.neuron.2015.11.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Tuckute, G., Feather, J., Boebinger, D. & McDermott, J. H. Many but not all deep neural network audio models capture brain responses and exhibit hierarchical region correspondence. Preprint at bioRxiv10.1101/2022.09.06.506680 (2022).
- 62.Mehrer J, Spoerer CJ, Kriegeskorte N, Kietzmann TC. Individual differences among deep neural network models. Nat. Commun. 2020;11:5725. doi: 10.1038/s41467-020-19632-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Olah, C., Mordvintsev, A. & Schubert, L. Feature visualization. Distillhttps://distill.pub/2017/feature-visualization/ (2017).
- 64.Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. & Lipson, H. Understanding neural networks through deep visualization. Preprint at arXiv10.48550/arXiv.1506.06579 (2015).
- 65.Shafahi, A. et al. Poison frogs! Targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems 31 (eds Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K. & Cesa-Bianchi, N.) 6106–6116 (Curran Associates, 2018).
- 66.Jacobsen, J.-H., Behrmann, J., Zemel, R. & Bethge, M. Excessive invariance causes adversarial vulnerability. In Proc. 7th International Conference on Learning Representations, (ICLR) (eds Sainath, T., Rush, A., Levine, S. Livescu, K. & Mohamed, S.) (2019).
- 67.Jacobsen, J.-H., Behrmannn, J., Carlini, N., Tramèr, F. & Papernot, N. Exploiting excessive invariance caused by norm-bounded adversarial robustness. Preprint at 10.48550/arXiv.1903.10484 (2019).
- 68.Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci. USA118, e2014196118 (2021). [DOI] [PMC free article] [PubMed]
- 69.Geirhos, R. et al. On the surprising similarities between supervised and self-supervised models. In SVRHM 2020 Workshop @ NeurIPS (2020).
- 70.McWalter R, McDermott JH. Adaptive and selective time averaging of auditory scenes. Curr. Biol. 2018;28:1405–1418. doi: 10.1016/j.cub.2018.03.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Lesica NA, et al. Harnessing the power of artificial intelligence to transform hearing healthcare and research. Nat. Mach. Intell. 2021;3:840–849. doi: 10.1038/s42256-021-00394-z. [DOI] [Google Scholar]
- 72.Saddler, M. R., Francl, A., Feather, J. & McDermott, J. H. Speech denoising with auditory models. In Proc. Interspeech 2021 (eds Heřmanský, H. et al.) 2681–2685 (2021).
- 73.Hong H, Yamins DLK, Majaj NJ, DiCarlo JJ. Explicit information for category-orthogonal object properties increases along the ventral stream. Nat. Neurosci. 2016;19:613–622. doi: 10.1038/nn.4247. [DOI] [PubMed] [Google Scholar]
- 74.Thorat, S., Aldegheri, G. & Kietzmann, T. C. Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization. In SVRHM 2021 Workshop @ NeurIPS (2021).
- 75.Golan T, Raju PC, Kriegeskorte N. Controversial stimuli: pitting neural networks against each other as models of human cognition. Proc. Natl Acad. Sci. USA. 2020;117:29330–29337. doi: 10.1073/pnas.1912334117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Fetaya, E., Jacobsen, J.-H., Grathwohl, W. & Zemel, R. Understanding the limitations of conditional generative models. In Proc. 8th International Conference on Learning Representations (eds Rush, A., Mohamed, S., Song, D., Cho, K., & White, M.) (2020).
- 77.Yang, X., Su, Q. & Ji, S. Towards bridging the performance gaps of joint energy-based models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 15732-15741 (IEEE, 2023).
- 78.Rajalingham R, Schmidt K, DiCarlo JJ. Comparison of object recognition behavior in human and monkey. J. Neurosci. 2015;35:12127–12136. doi: 10.1523/JNEUROSCI.0573-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Francl A, McDermott JH. Deep neural network models of sound localization reveal how perception is adapted to real-world environments. Nat. Hum. Behav. 2022;6:111–133. doi: 10.1038/s41562-021-01244-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Saddler MR, Gonzalez R, McDermott JH. Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception. Nat. Commun. 2021;12:7278. doi: 10.1038/s41467-021-27366-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Kubilius, J. et al. Brain-like object recognition with high-performing shallow recurrent ANNs. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 12805–12816 (Curran Associates, Inc., 2019).
- 82.Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8024–8035 (Curran Associates, Inc., 2019).
- 83.Woods KJP, Siegel MH, Traer J, McDermott JH. Headphone screening to facilitate web-based auditory experiments. Atten. Percept. Psychophys. 2017;79:2064–2072. doi: 10.3758/s13414-017-1361-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Woods KJP, McDermott JH. Schema learning for the cocktail party problem. Proc. Natl Acad. Sci. USA. 2018;115:E3313–E3322. doi: 10.1073/pnas.1801614115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.McPherson MJ, McDermott JH. Time-dependent discrimination advantages for harmonic sounds suggest efficient coding for memory. Proc. Natl Acad. Sci. USA. 2020;117:32169–32180. doi: 10.1073/pnas.2008956117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Traer J, Norman-Haignere SV, McDermott JH. Causal inference in environmental sound recognition. Cognition. 2021;214:104627. doi: 10.1016/j.cognition.2021.104627. [DOI] [PubMed] [Google Scholar]
- 87.McPherson MJ, Grace RC, McDermott JH. Harmonicity aids hearing in noise. Atten. Percept. Psychophys. 2022;84:1016–1042. doi: 10.3758/s13414-021-02376-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods. 2007;39:175–191. doi: 10.3758/BF03193146. [DOI] [PubMed] [Google Scholar]
- 89.Marques, T., Schrimpf, M. & DiCarlo, J. J. Multi-scale hierarchical neural network models that bridge from single neurons in the primate primary visual cortex to object recognition behavior. Preprint at bioRxiv10.1101/2021.03.01.433495 (2021).
- 90.Freeman J, Ziemba CM, Heeger DJ, Simoncelli EP, Movshon JA. A functional and perceptual signature of the second visual area in primates. Nat. Neurosci. 2013;16:974–981. doi: 10.1038/nn.3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Majaj NJ, Hong H, Solomon EA, DiCarlo JJ. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. J. Neurosci. 2015;35:13402–13418. doi: 10.1523/JNEUROSCI.5181-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Sanghavi, S. & DiCarlo, J. J. Sanghavi2020. 10.17605/OSF.IO/CHWDK (2021).
- 93.Sanghavi, S., Jozwik, K. M. & DiCarlo, J. J. SanghaviJozwik2020.10.17605/OSF.IO/FHY36 (2021).
- 94.Sanghavi, S., Murty, N. A. R. & DiCarlo, J. J. SanghaviMurty2020. 10.17605/OSF.IO/FCHME (2021).
- 95.Humphries C, Liebenthal E, Binder JR. Tonotopic organization of human auditory cortex. Neuroimage. 2010;50:1202–1211. doi: 10.1016/j.neuroimage.2010.01.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Norman-Haignere S, Kanwisher N, McDermott JH. Cortical pitch regions in humans respond primarily to resolved harmonics and are located in specific tonotopic regions of anterior auditory cortex. J. Neurosci. 2013;33:19451–19469. doi: 10.1523/JNEUROSCI.2880-13.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Overath T, McDermott JH, Zarate JM, Poeppel D. The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts. Nat. Neurosci. 2015;18:903–911. doi: 10.1038/nn.4021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Pedregosa F, et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 99.Spearman C. The proof and measurement of association between two things. Am. J. Psychol. 1904;15:72–101. doi: 10.2307/1412159. [DOI] [PubMed] [Google Scholar]
- 100.Spearman C. Correlation calculated from faulty data. Br. J. Psychol. 1910;3:271–295. [Google Scholar]
- 101.Huth AG, de Heer WA, Griffiths TL, Theunissen FE, Gallant JL. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature. 2016;532:453–458. doi: 10.1038/nature17637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Khaligh-Razavi S-M, Kriegeskorte N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 2014;10:e1003915. doi: 10.1371/journal.pcbi.1003915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Santoro R, et al. Encoding of natural sounds at multiple spectral and temporal resolutions in the human auditory cortex. PLoS Comput. Biol. 2014;10:e1003412. doi: 10.1371/journal.pcbi.1003412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Norman-Haignere SV, McDermott JH. Neural responses to natural and model-matched stimuli reveal distinct computations in primary and nonprimary auditory cortex. PLoS Biol. 2018;16:e2005127. doi: 10.1371/journal.pbio.2005127. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Human data, trained model checkpoints and an interface to view/listen to the generated metamers used in the human recognition experiments are available at https://github.com/jenellefeather/model_metamers_pytorch. The Word–Speaker–Noise training dataset is available from the authors upon request.
Code for generating metamers, training models and running online experiments is available at https://github.com/jenellefeather/model_metamers_pytorch (10.5281/zenodo.8373260). Auditory front-end (cochleagram generation) code is available at https://github.com/jenellefeather/chcochleagram.