What do adversarial images tell us about human vision?

Marin Dujmović; Gaurav Malhotra; Jeffrey S Bowers

doi:10.7554/eLife.55978

. 2020 Sep 2;9:e55978. doi: 10.7554/eLife.55978

What do adversarial images tell us about human vision?

Marin Dujmović ^1,^†,^✉, Gaurav Malhotra ^1,^†, Jeffrey S Bowers ^1,^†

Editors: Gordon J Berman², Ronald L Calabrese³

PMCID: PMC7467732 PMID: 32876562

Abstract

Deep convolutional neural networks (DCNNs) are frequently described as the best current models of human and primate vision. An obvious challenge to this claim is the existence of adversarial images that fool DCNNs but are uninterpretable to humans. However, recent research has suggested that there may be similarities in how humans and DCNNs interpret these seemingly nonsense images. We reanalysed data from a high-profile paper and conducted five experiments controlling for different ways in which these images can be generated and selected. We show human-DCNN agreement is much weaker and more variable than previously reported, and that the weak agreement is contingent on the choice of adversarial images and the design of the experiment. Indeed, we find there are well-known methods of generating images for which humans show no agreement with DCNNs. We conclude that adversarial images still pose a challenge to theorists using DCNNs as models of human vision.

Research organism: Human

Introduction

Deep convolutional neural networks (DCNNs) have reached, and in some cases exceeded, human performance in many image classification benchmarks such as ImageNet (He et al., 2015). In addition to having obvious commercial implications, these successes raise questions as to whether DCNNs identify objects in a similar way to the inferotemporal cortex (IT) that supports object recognition in humans and primates. If so, these models may provide important new insights into the underlying computations performed in IT. Consistent with this possibility, a number of researchers have highlighted various functional similarities between DCNNs and human vision (Peterson et al., 2018) as well as similarities in patterns of activation of neurons in IT and units in DCNNs (Yamins and DiCarlo, 2016). This has led some authors to make strong claims regarding the theoretical significance of DCNNs to neuroscience and psychology. For example, Kubilius et al., 2018 write: ‘Deep artificial neural networks with spatially repeated processing (a.k.a., deep convolutional [Artificial Neural Networks]) have been established as the best class of candidate models of visual processing in primate ventral visual processing stream’ (p.1).

One obvious problem in making this link is the existence of adversarial images. These are ‘inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake’ (Goodfellow et al., 2017). Figure 1 shows examples of two types of adversarial images. On first impression, it seems inconceivable that these adversarial images would ever confuse humans. There is now a small industry of researchers creating adversarial attacks that produce images which DCNNs classify in bizarre ways (Akhtar and Mian, 2018). The confident classification of these adversarial images by DCNNs suggests that humans and current architectures of DCNNs perform image classification in fundamentally different ways. If this is the case, the existence of adversarial images poses a challenge to research that considers DCNNs as models of human behaviour (e.g., Kubilius et al., 2018; Ritter et al., 2017; Peterson et al., 2017; Cichy and Kaiser, 2019; Dodge and Karam, 2017), or as plausible models of neural firing patterns in primate and human visual cortex (e.g., Khaligh-Razavi and Kriegeskorte, 2014; Cadieu et al., 2014; Rajalingham et al., 2018; Yamins et al., 2014; Eickenberg et al., 2017; Cichy et al., 2016).

Figure 1. — (a) *fooling adversarial images* taken from Nguyen et al., 2015 that do not look like any familiar object. The two images on the left (labelled ‘Electric guitar’ and ‘Robin’) have been generated using evolutionary algorithms using *indirect* and *direct* encoding, respectively, and classified confidently by a DCNN trained on `ImageNet`. The image on the right (labelled ‘1’) is also generated using an evolutionary algorithm using *direct* encoding and it is classified confidently by a DCNN trained on `MNIST`. (b) An example of a *naturalistic adversarial image* taken from Goodfellow et al., 2014 that is generated by perturbing a naturalistic image on the left (classified as ‘Panda’) with a high-frequency noise mask (middle) and confidently (mis)classified by a DCNN (as a ‘Gibbon’).

However, some recent studies have suggested that there may, in fact, be theoretically relevant overlap between DCNNs and humans in how they process these adversarial images. Zhou and Firestone, 2019 (Z&F from here on) recently reported that humans can reliably decipher fooling adversarial images that, on first viewing, look uninterpretable (as in Figure 1a). The authors took a range of published adversarial images that were claimed to be uninterpretable by humans and, in a series of experiments, they showed those images to human subjects next to the DCNN’s preferred label and various foil labels. They reported that, over the course of an experimental session, a high percentage of participants (often close to 90%) chose the DCNN’s preferred label at above-chance rates. Furthermore, they report evidence that humans appreciate subtler distinctions made by the machine rather than simply agreeing on the basis of some superficial features (such as predicting ‘bagel’ rather than ‘pinwheel’ when confronted with an image of a round and yellow blob). These results are important because they speak to an important theoretical question that Z&F pose in the first line of their abstract: ‘Does the human mind resemble the machine-learning systems that mirror its performance?’ (p.1). The high level of agreement they report seems to suggest the answer is ‘yes’.

Here we show that the agreement between humans and DCNNs on adversarial images was weak and highly variable between participants and images. The remaining agreement appeared to reflect participants making educated guesses based on some superficial features (such as colour) within images and the limited response alternatives presented to them. We then carried out five experiments in which we systematically manipulated factors that can contribute to an observed agreement between humans and DCNNs in order to better understand how humans interpret adversarial images. The experiments demonstrate that the overlap between human and DCNN classification is contingent upon various details of the experimental design such as the selection of adversarial images used as stimuli, the response alternatives presented to participants during the experiment, the adversarial algorithm used to generate the images and the dataset on which the model was trained. When we controlled for these factors, we observed that the agreement between humans and DCNNs dropped to near chance levels. Even when adversarial images were selected such that multiple DCNNs confidently assigned the same label to these images, humans seldom agreed with the machine label, especially when they had to choose between response alternatives that contained superficial features present within these images. We also show that it is straightforward to generate adversarial images that fool networks trained on ImageNet but are truly meaningless to human participants, irrespective of how the stimuli are selected or response alternatives are presented to a participant. We take the findings to highlight a dramatic difference between human and DCNN object recognition.

Results

Reassessing the level of agreement in Zhou and Firestone, 2019

Our first step, in trying to understand the agreement between humans and DCNNs observed by Z&F, was to assess how well their methods reflect the degree of agreement between humans and DCNNs. Z&F conducted seven experiments in which they measured agreement by computing the number of trials on which the participants matched the DCNN’s classification and working out whether this number is numerically above or below chance level. So in Experiment 3, for example, a participant is shown an adversarial image on each trial and asked to choose one amongst 48 labels for that image. Each trial was independent, so a participant can choose any of the 48 labels for each image. Chance level is 1/48, so if the participant chooses the same label as the DCNN on two or more trials, they were labelled as agreeing with the DCNN. In addition, half of the participants who agreed with the DCNN on only 1/48 trials were also counted towards the number of participants who agreed with the DCNN. When computed in this manner, Z&F calculated that 142 out of 161 (88%) participants in Experiment 3a, and 156 out of 174 (90%) in Experiment 3b agreed with the DCNN at above chance levels.

This is a reasonable way of measuring agreement if the goal is to determine whether agreement between humans and DCNNs is statistically above chance levels. However, if the goal is to measure the degree of agreement, this method may be misleading and liable to misinterpretation. Firstly, the rates of agreement obtained using this method ignore inter-individual variability and assign the same importance to a participant that agrees on 2 out of 48 trials as a participant who agrees on all 48 trials with the DCNN. Secondly, this method obscures information about the number of trials on which humans and DCNNs disagree. So even if every participant disagreed with the network on 46 out of 48 trials, the rate of agreement, computed in this manner, would be 100% and even a sample of blindfolded participants would show 45% agreement (see Materials and methods). In fact, not a single participant in Experiments 3a and 3b (from a total of 335 participants) agreed with the model on a majority (24 or more) of trials, yet the level of agreement computed using this method is nearly 90%.

A better way of measuring the degree of agreement is to simply report the average agreement. This can be calculated as the mean percentage of images (across participants) on which participants and DCNNs agree. This method overcomes the disadvantages mentioned above: it takes into consideration the level of agreement of each participant (a participant who agrees on 4/48 trials is not treated equivalently to a participant who agrees on 48/48 trials), and it reflects both the levels of agreement and disagreement observed (so a mean agreement of 100% would indeed mean that participants agreed with the DCNN classification on all the trials). Z&F reported mean agreement for the first of their seven experiments and in Table 1 we report mean agreement levels in all their experiments. Viewed in this manner, it is clear that the degree of agreement in the experiments carried out by Z&F is, in fact, fairly modest and far from ‘surprisingly universal’ (p.2) or ‘general agreement’ (p.4) the authors reported.

Table 1. Mean DCNN-participant agreement in the experiments conducted by Zhou and Firestone, 2019.

Exp.	Test type	Mean agreement	Chance
1	Fooling 2AFC ^N15	74.18% (35.61/48 images)	50%
2	Fooling 2AFC ^N15	61.59% (29.56/48 images)	50%
3a	Fooling 48AFC ^N15	10.12% (4.86/48 images)	2.08%
3b	Fooling 48AFC ^N15	9.96% (4.78/48 images)	2.08%
4	TV-static 8AFC ^N15	28.97% (2.32/8 images)	12.5%
5	Digits 9AFC ^P16	16% (1.44/9 images)	11.11%
6	Naturalistic 2AFC ^K18	73.49% (7.3/10 images)	50%
7	3D Objects 2AFC ^A17	59.55% (31.56/53 images)	50%

Open in a new tab

* To give the readers a sense of the levels of agreement observed in these experiments, we have also computed the average number of images in each experiment where humans and DCNNs agree as well as the level of agreement expected if participants were responding at chance.

^† Stimuli sources: N15 - Nguyen et al., 2015; P16 - Papernot et al., 2016; K18 - Karmon et al., 2018; A17 - Athalye et al., 2017.

Reassessing the basis of the agreement in Zhou and Firestone, 2019

Although the mean agreement highlights a much more modest degree of agreement, it is still the case that the agreement was above chance. Perhaps the most striking result is in Z&F’s Experiment 3 where participants had to choose between 48 response alternatives and mean agreement was ∼10% with chance being ∼2%. Does this consistent, above chance agreement indicate that there are common underlying principles in the way humans and DCNNs perform object classification?

In order to clarify the basis of overall agreement we first assessed the level of agreement for each of the 48 images separately. As shown in Appendix 1—figure 1, the distribution of agreement levels was highly skewed and had a large variance. There was a small subset of images that looked like the target class (such as the Chainlink Fence, which can be seen in Appendix 1—figure 2) and participants showed a high level of agreement with DCNNs on these images. Another subset of images with lower (but statistically significant) levels of agreement contained some features consistent with the target class, such as the Computer Keyboard which contains repeating rectangles. But agreement on many images ( $21 / 48$ ) was at or below chance levels. This indicates that the agreement is largely driven by a subset of adversarial images, some of which (such as the Chainlink Fence) simply depict the target class.

We also observed that there was only a small subset of images on which participants showed a clear preference amongst response alternatives that matched the DCNN’s label. For most adversarial images, the distribution of participant responses across response alternatives was fairly flat (see Appendix 1—figures 4–6) and the most frequent human response did not match the machine label even when agreement between humans and DCNNs was above chance (see Appendix 1—figure 2). In fact, the label assigned to the image by DCNNs was ranked 9th (Experiment 3a) or 10th (Experiment 3b) on average. 75% of the adversarial images in Experiment 3a and 79.2% in Experiment 3b were not assigned the label chosen by the DCNN with highest frequency (Appendix 1—figures 2 and 3). This indicates that most adversarial images do not contain features required by humans to uniquely identify an object category.

Collectively, these findings suggest that the above chance level of agreement was driven by two subsets of images. A very small subset of images have features that humans can perceive and are highly predictive of the target category (e.g., Chainlink Fence image that no one would call ‘uninterpretable’), and another subset of images that include visible features consistent with the target category as well as a number of other categories. These category-general features (such as colour or curvature) are what Z&F called ‘superficial commonalities’ between images (Zhou and Firestone, 2019, p. 2). For this subset of images, the most frequent response chosen by participants does not usually match the label assigned by the DCNN. Participants in these cases seem to be making educated guesses using superficial features of the target images to hedge their bets. For the rest of the images agreement is at or below chance levels.

In order to more directly test how humans interpret adversarial images we carried out five experiments. First, if participants are making educated guesses based on superficial features, then agreement levels should decrease when presented with response alternatives that do not support this strategy. We test this in Experiment 1. Second, if a DCNN develops human-like representations for a subset of categories (e.g., the Chainlink Fence category for which human-DCNN agreement was high for a specific adversarial image of a chainlink fence), then it should not matter which adversarial image from these categories is used to evaluate agreement. We test this in Experiment 2. Third, if DCNNs are processing images in very different ways to humans, then it should be possible to find situations in which overall agreement levels are at absolute chance levels. In Experiment 3 we show that one class of adversarial images for the MNIST dataset generated overall chance level agreement. In Experiment 4 we show that it is straightforward to generate adversarial images for the ImageNet dataset that produce overall chance level agreement. Finally, in Experiment 5 we show that agreement levels between humans and DCNNs remain low and variable even for images that fool an ensemble of DCNNs. The findings further undermine any claim that DCCNs and humans categorize adversarial images in a similar way.

Experiment 1: Response alternatives

One critical difference between decisions made by DCNNs and human participants in an experiment is the number of response alternatives available. For example, DCNNs trained on ImageNet will choose a response from amongst 1000 alternatives while participants will usually choose from a much smaller cohort. In Experiment 1, we tested whether agreement levels are contingent on how these response alternatives are chosen during an experiment. We chose a subset of ten images from the 48 that were used by Z&F and identified four competitive response alternatives (from amongst the 1000 categories in ImageNet) for each of these images. One of these alternatives was always the category picked by the DCNN and the remaining three were subjectively established as categories which share some superficial visual features with the target adversarial image. For example, one of the adversarial images contains a florescent orange curve and is confidently classified by the DCNN as a Volcano. For this image, we chose the set of response alternatives {Lighter, Missile, Table lamp, Volcano}, all of which also contain this superficial visual feature. See Appendix 2—figure 1 for the complete list of images and response alternatives. Participants were then shown each of these ten images and asked to choose one amongst these four competitive response alternatives. Note that if humans possess a ‘machine-theory-of-mind’, it should not matter how one samples response alternatives as a DCNN classifies the fooling adversarial images with high confidence (>99%) in the presence of all 999 alternative labels, including the competing alternatives we have selected. In the control condition an independent sample of participants completed the same task but the alternative labels were chosen at random from the 48 used by Z&F.

We observed that agreement levels fell nearly to chance in the competitive condition while being well above chance in the random condition (see Figure 2). The mean agreement level in the competitive condition was at 28.5% (SD = 11.67) with chance being at 25%. A single sample t-test comparing the mean agreement level to the fixed value of 25% did show the difference was significant ( $t (99) = 3.00, p = .0034, d = 0.30$ ). However, in the random condition mean agreement was 49.8% (SD = 16.02) which was both significantly above chance ( $t (99) = 15.48, p < .0001, d = 1.54$ ) and well above agreement in the competitive condition ( $t (198) = 10.75, p < .0001, d = 1.52$ ). Both conditions are in stark contrast to the DCNN which classified these images with a confidence >99% even in the presence of these competing categories.

These results highlight a key contrast between human and DCNN image classification. While the features in each of these adversarial images are sufficient for a DCNN to uniquely identify one amongst a 1000 categories, for humans they are not. Instead features within these images only allow them to identify a cohort of categories. Thus, the observed decrease in agreement between the random and competitive conditions supports the hypothesis that participants are making plausible guesses in these experiments, using superficial features (shared amongst a cohort of categories) to eliminate response alternatives.

It should be noted that Z&F were themselves concerned about how the choice of response alternatives may have influenced their results. Therefore, they carried out another experiment where, instead of choosing between the DCNN’s preferred label and another randomly selected label, the participants had to choose between the DCNN’s 1st and 2nd-ranked labels. The problem with this approach is that the DCNN generally has a very high level of confidence (>99%) in it’s 1st choice. Accordingly, it is not at all clear that the 2nd most confident choice made by the network provides the most challenging response alternative for humans. The results from Experiment 1 show that when the competing alternative is selected using a different criterion, the agreement between participants and DCNNs does indeed drop to near-chance levels.

Experiment 2: Target adversarial images

Our reanalysis above also showed that there was large variability in agreement between images. One possible explanation for this is that the DCNN learns to represent some categories (such as Chainlink Fence or Computer Keyboard) in a manner that closely relates to human object recognition while representations for other categories diverge. If there was meaningful overlap between human and DCNN representations for a category, we would expect participants to show a similar level of agreement on all adversarial images for this category as all adversarial images will capture these common features. So replacing an adversarial image from these categories with another image generated in the same manner should lead to little change in agreement. In Experiment 2 we directly tested this hypothesis by sampling two different images (amongst the five images for each category generated by Nguyen et al., 2015) for the same ten categories from Experiment 1. We chose the best and worst representative stimuli for each of the categories by running a pre-study (see the Materials and methods section) and labelled the two conditions as best-case and worst-case. An example of each type of image is shown in Figure 3.

Figure 4 shows the mean agreement for participants viewing the best-case and worst-case adversarial images. The difference in agreement between the two conditions was highly significant ( $t (198) = 22.28, p < .0001, d = 3.15$ ). Both groups showed agreement levels significantly different from chance (which was at 25%). The best-case group was significantly above chance ( $t (99) = 20.12, p < .0001, d = 2.01$ ) while the worst-case was significantly below chance ( $t (99) = 10.58, p < .0001, d = 0.99$ ).

Thus, we observed a large drop in agreement when we replaced one set of adversarial images with a different set, and there was no evidence for consistent above-chance agreement for all adversarial images from a subset of categories (see Appendix 2—figure 2 for an item-wise breakdown). In other words, we did not observe any support for the hypothesis that DCNNs learn to represent even a subset of categories in a manner that closely relates to human object recognition.

Experiment 3: Different types of adversarial images

Although we can easily reduce DCNN-human agreement to chance by judicially selecting the targets and foils, it remains the case that a random selection of targets and foils has led to above chance performance on this set of images. In the next experiment, we asked whether this effect is robust across different types of adversarial images. All the images in the experiments above were generated to fool a network that had been trained on ImageNet and belonged to the subclass of regular adversarial images generated by Nguyen et al., 2015 using an indirect encoding evolutionary algorithm. In fact, Nguyen et al., 2015 generated four different types of adversarial images by manipulating the type of encoding – direct or indirect – and the type of database the network was trained on – ImageNet or MNIST (see Figure 5). We noticed that Z&F used images designed to fool DCNNs trained on images from ImageNet, but did not consider the adversarial images designed to fool a network trained on MNIST dataset. To our eyes, these MNIST adversarial images looked completely uninterpretable and we wanted to test whether the above chance agreement was contingent on which set of images were used in the experiments.

Figure 5. — Images are generated using an evolutionary algorithm either using the *direct* or *indirect* encoding and generated to fool a network trained on either `ImageNet` or `MNIST`.

Accordingly, we designed a 2 × 2 experiment in which we tested participants on all four conditions corresponding to the four types of images (Figure 5). Since MNIST has ten response categories and we wanted to compare results for the MNIST images with ImageNet images, we used the same 10 categories from Experiments 1 and 2 for the two ImageNet conditions. On each trial, participants were shown an adversarial image and asked to choose one out of ten response alternatives that remained fixed for all trials.

Mean agreement levels in this experiment are shown in Figure 6. We observed a large difference in agreement levels depending on the types of adversarial images. Results of a two-way repeated measures ANOVA revealed a significant effect of dataset on agreement levels ( $F (1, 197) = 298.62, p < .0001, η_{p}^{2} = 0.60$ ). Participants agreed with DCNN classification for images designed to fool ImageNet classifiers significantly more than for images designed to fool MNIST classifiers. Participants also showed significantly larger agreement for indirectly-encoded compared to directly-encoded images ( $F (1, 197) = 67.57, p < .0001, η_{p}^{2} = 0.26$ ). The most striking observation was that agreement dropped from 26% for ImageNet images to near chance for MNIST images. Participants were slightly above chance for indirectly-encoded MNIST images ( $t (197) > 6.30, p < .0001, d = 0.44$ ) and at chance agreement for directly-encoded MNIST images ( $t (197) = 1.03, p = 0.31$ ).

In addition to the between-condition differences, we also found high within-condition variability for the ImageNet images.We observed that this was because agreement was driven by a subset of adversarial images (see Appendix 2—figure 3 for a break down). Thus, even for these ImageNet images, DCNN representations do not consistently overlap with representations used by humans.

Experiment 4: Generating fooling images for `ImageNet`

Experiment 3 showed that it is straightforward to obtain overall chance level performance on the MNIST images, and this raises the obvious question of whether it is also straightforward to observe chance performance for adversarial images designed to fool ImageNet classifiers? In order to test this we generated our own irregular (TV-static like) adversarial images using a standard method of generating adversarial images (see Materials and methods section). Each of these images was confidently classified as one out of a 1000 categories by a network trained on ImageNet. Participants were presented three of these adversarial images and asked to choose the image that most closely matches the target category (see inset in Figure 7). In half of the trials participants were shown adversarial images that were generated to fool AlexNet while in the other half they were shown adversarial images generated to fool Resnet-18.

Results of the experiment are shown in Figure 7. For both types of images, the agreement between participants and DCNNs was at chance. Additionally, we ran binomial tests for each image in order to determine whether the number of participants which agreed with DCNN classification was significantly above chance and the results showed that not a single image showed agreement that was significantly above chance. Clearly, participants could not find meaningful features in any of these images, while networks were able to confidently classify each of these images.

Experiment 5: Transferable adversarial images

In the experiments above we observe that while DCNNs are vulnerable to adversarial attacks (they classify these images with extremely high confidence), participants do not show such a vulnerability or even a consistent agreement with the DCNN classification. But it does not necessarily follow that DCNNs are poor models of biological vision. In fact, there are many different methods of generating adversarial images (Akhtar and Mian, 2018) and some do not transfer even from one DCNN to another, and this does not merit the conclusion that the different the DCNNs function in fundamentally different ways (indeed, current DCNNs are highly similar to one another, by design). In a similar manner, the fact that adversarial images do not transfer between DCNNs and humans does not, by itself, support the conclusion that the human visual system and DCNNs are fundamentally different.

In order to provide an stronger test of the similarity of DCNNs and human vision, we asked whether adversarial images that fool multiple DCNNs are decipherable by humans. If indeed there are some underlying and reliable similarities in how stimuli are processed in DCNNs and humans, then it might be expected that highly transferable DCNN adversarial attacks should also lead to higher human to network agreement.

So in the next experiment, we chose 20 adversarial images that 10 DCNNs classify with high confidence and high between-network agreement (see the Materials and methods section for details). The experiment then follows the same procedure as Experiment 1, where a participant is shown an adversarial image on each trial and asked to choose a label from four response alternatives. Like Experiment 1, participants are assigned to one of two conditions. In the random alternatives condition, participants were shown the network label and three other labels, which were randomly drawn from the remaining 19 labels. In the competitive alternatives condition, participants again had to choose from the network label and three alternative labels. However, in this condition the labels were chosen amongst the 999 remaining category labels in ImageNet such that they contain some superficial features contained within these images (see Materials and methods for details). Note that all DCNNs classified these images with high confidence and with all 1000 ImageNet labels present as alternatives.

Results are depicted in Figure 8(b). There was a significant difference between the two conditions $t (198) = 16.37, p < .0001, d = 2.32$ . Additionally, both conditions were significantly different from chance. Agreement in the random alternatives was above chance ( $t (99) = 18.66, p < .0001, d = 1.87$ ) and below chance in the competitive labels condition ( $t (99) = 3.13, p < .01, d = 0.31$ ).

Thus we find very similar levels of agreement for these adversarial images, which fool multiple DCNNs, to the adversarial images from Experiment 1 (compare Figure 8(b) and Figure 2). To further examine how the DCNN-to-DCNN agreement compares to DCNN-to-human agreement, we computed the probability that two randomly sampled networks will agree on an image’s label and compared it to the probability that a randomly sampled network will agree with a randomly sampled participant (see Materials and methods for details). Figure 8(c) shows these probabilities for the competitive condition for both Experiment 1 and Experiment 5. We observed that: (i) even when the probability that two networks agree on an adversarial image is larger than 90% the probability of network-human agreement is low (∼10%), and (ii) the increase in probability of network-network agreement (between Experiment 1 and Experiment 5) has very little impact on human classification as the probability of human-human and network-human agreement remains much the same in the two experiments. Thus, participants showed very little agreement with DCNNs even when DCNNs agreed with each other. Interestingly, humans showed more agreement amongst themselves, consistent with the hypothesis that participants represent these adversarial images in similar ways, even though we find no evidence that these representations overlap those of the networks. This again suggests that humans and current DCNNs process these images in fundamentally different ways.

Discussion

Zhou and Firestone, 2019’s claim that humans can robustly decipher adversarial images suggests that there are important similarities in how humans and DCNNs process these images, and objects more generally. However, when we examined their results using an alternative analysis, we found that the level of agreement was rather low, highly variable, and largely driven by a subset of images where participants could eliminate response alternatives based on superficial features present within these images. This was confirmed in a series of experiments that found that agreement between humans and DCNNs was contingent on the adversarial images chosen as stimuli (Experiments 2 and 3) and the response alternative presented to participants (Experiment 1). We also show that there are well-known methods for generating adversarial images that lead to overall chance level DCNN-human agreement (Experiments 3 and 4), again demonstrating that DCNNs confidently identify images on the basis of features that humans completely ignore. Furthermore, even when humans were presented with adversarial images that fooled at least 9 of 10 DCNNs, the level of agreement between humans and DCNNs remained low and variable (Experiments 5). Indeed, manipulating the level of agreement between DCNNs (by varying the adversarial images) had no impact on the level of agreement between DCNNs and humans, or amongst humans, as highlighted in Figure 8. Taken together, these findings not only refute the claim that there is a robust and reliable similarity in processing these adversarial images, but also suggest that humans and current DCNNs categorize objects in fundamentally different ways.

A similar distinction between human and DCNN classification is made by Ilyas et al., 2019, who argue that current architectures of DCNNs are vulnerable to adversarial attacks due to their tendency for relying on non-robust features present in databases. These are features that are predictive of a category but highly sensitive to small perturbations of the image. It is this propensity for relying on non-robust features that makes it easy to generate adversarial images that are completely uninterpretable by humans but classified confidently by the network (Experiment 4). A striking example of DCNNs picking up on non-robust features was recently reported by Malhotra et al., 2020 who showed that DCNNs trained on a CIFAR-10 dataset modified to contain a single diagnostic pixel per category, learn to categorize images based on single pixels ignoring everything else in the image. Humans, by contrast, tend to use robust features of objects, such as their shape, for classifying images (Biederman and Ju, 1988).

We would like to note that we are not claiming that there is no role played by superficial and non-robust features in human object recognition. In a recent study, Elsayed et al., 2018 asked human participants to classify naturalistic adversarial images (see Figure 1(b)) when these images were briefly flashed (for around 70 ms) on the screen. They found that there is a small, but statistically significant, effect of the adversarial manipulation on choices made by participants (i.e., participants were slightly more likely to classify a ‘cat’ image as a ‘dog’ when the image was adversarially perturbed towards a ‘dog’). Thus, these results seem to suggest that humans are sensitive to the same type of non-robust features that lead to adversarial attacks on DCNNs. However, it is important to note here that the size of these effects is small: while human accuracy drops by less than 10% when normal images are replaced by adversarially perturbed images, DCNNs (mis)classify these adversarially perturbed images with high confidence. These findings are consistent with our observation that some adversarial images capture some superficial features that can be used by participants to make classification decisions, leading to an overall above-chance agreement.

It should also be noted that we have only considered a small fraction of adversarial images here and, like Experiment 4, there are many other types of adversarial attacks that produce images that seem completely undecipherable for humans. It could be that humans find these images completely uninterpretable due to the difference in acuity of human and machine vision (a line taken by Z&F). There are two reasons why we think a difference in acuity cannot be the primary explanation of the difference between human and machine perception of adversarial images. Firstly, we have shown above that the very same algorithm produced some images that supported above chance agreement and other images that supported no agreement (for example, Appendix 2—figure 2). There is no reason to believe that the two sets of images are qualitatively different, with DCNNs selectively exploiting subliminal features only when overall agreement levels are chance. Secondly, a wide variety of adversarial attacks clearly do not rely on subtle visual features that are below human perceptual threshold. This includes semantic adversarial attacks that occur when the colour of an images is changed (Hosseini and Poovendran, 2018), or attacks that cause incorrect classification by simply change the pose of an image (Alcorn et al., 2019), etc. These are all dramatic examples of differences between DCNNs and humans that cannot be attributed to the acuity of human perceptual front-end. Rather they reflect the fact that current architectures of DCNNs are often relying on visual features that humans can see but ignore.

Of course, it might be possible to modify DCNNs so that they perform more like humans in our adversarial tasks. For example, training similar models on data sets that are more representative of human visual experience might reduce their susceptibility to adversarial images and lead DCNNs to produce more variable responses in our tasks as a consequence of picking up on superficial visual features. In addition, modifying the architectures of DCNNs or introducing new ones may lead to a better DCNN-human agreement on these tasks (for example, capsule networks [Sabour et al., 2017]). But researchers claiming that current DCNNs provide the best models of visual processing in primate ventral visual processing stream need to address this striking disconnect between the two systems.

To conclude, our findings with fooling adversarial images pose a challenge for theorists using current DCNNs trained on data sets like ImageNet as psychological models of human object identification. An important goal for for future research is to develop models that are sensitive to the visual features that humans rely on, but at the same time insensitive to other features that are diagnostic of object category but irrelevant to human vision. This involves identifying objects on the basis of shape rather than texture or color or other diagnostic features (Geirhos et al., 2018; Baker et al., 2018), where vertices are the critical components of images (Biederman, 1987), where Gestalt principles are used to organize features (Pomerantz and Portillo, 2011), where relations between parts are explicitly coded (Hummel and Stankiewicz, 1996), where features and objects are coded independently of retinal position (Blything et al., 2019) size (Biederman and Cooper, 1992), left/right reflection (Biederman and Cooper, 1991), etc. When DCNNs rely on these set of features, we expect they will not be subject to adversarial attacks that seem so bizarre to humans, and will show the same set of of strengths and weakness (visual illusions) that characterize human vision.

Materials and methods

Reassessing agreement: Blindfolded participants

If a participant is blindfolded and chooses one of 48 options randomly on 48 trials, the probability of them making the same choice as the DCNN on k trials is given by the binomial distribution $(\begin{matrix} 48 \\ k \end{matrix}) p^{k} (1 - p)^{48 - k}$ , where $p = \frac{1}{48}$ . Substituting different values of k, one can compute that 37.2% of these blindfolded participants will agree with the DCNN on 1 trial, 18.6% will agree on 2 trials, 6% will agree on 3 trials, and so on. To compute the proportion of participants who agree with the DCNN, Zhou and Firestone, 2019 count all participants who agree on 2 or more trials as agreeing with the DCNN (chance is 1 out of 48 trials) and half of the participants that agree on exactly 1 trial. Thus, summing up all the blindfolded participants that agree on 2 or more trials and half of those who agree on exactly 1 trial, this method will show ∼45% agreement between participants and the DCNN.

Experiment 1

This experiment examined whether agreement between humans and DCNNs depended on the response alternatives presented to participants. We tested $N = 200$ participants and each participant completed 10 trials. During each trial, participants were presented a fooling adversarial image and four response alternatives underneath the image and asked to choose one of these alternatives. Participants indicated their response by moving their cursor to the response alternative and clicking. We selected 10 fooling adversarial images from amongst the 48 images used by Z&F in in their Experiments 1–3. Each of these images was classified with >99% confidence by a DCNN which was trained to classify the ImageNet dataset. We selected these 10 adversarial images to minimise semantic and functional overlap in the labelled categories (for example, we avoided selecting both ‘computer keyboard’ and ‘remote control’). The experiment consisted of two conditions, which differed in how the response alternatives were chosen on each trial. In the ‘Competitive’ condition ( $N = 100$ ) we chose four response alternatives that subjectively seemed to contain one or more visual features that were present in the adversarial image. One of these response alternatives was always the label chosen by the DCNN. The other three were chosen from amongst 1000 ImageNet class labels. This was again done to minimise a semantic or functional overlap with the target class (e.g. an alternative for the ‘baseball’ class was parallel bars but not basketball). All ten images and the four competitive response alternatives for each image are shown in Appendix 2—figure 1 In the ‘Random’ condition ( $N = 100$ ) the three remaining alternative responses were drawn at random (on each trial) from the aforementioned 48 target classes from Experiment 3 in Zhou and Firestone, 2019. We randomised the order of images, as well as the order of the response alternatives for each participant.

Experiment 2

This experiment was designed to examine whether all fooling adversarial images for a category show similar levels of agreement between humans and DCNNs. The experiment’s design was the same as Experiment 1 above, except participants were now randomly assigned to the ‘best-case’ ( $N = 100$ ) and ‘worst-case’ ( $N = 100$ ) conditions. In each condition, participants again completed 10 trials and on each trial, they saw an adversarial image and four response alternatives. One of these alternatives was the category chosen by a DCNN with >99% confidence and the other three were randomly drawn from amongst the 48 categories used by Z&F in their Experiments 1–3. The difference between the ‘best-case’ and ‘worst-case’ conditions was the adversarial image that was shown to the participants on each trial.

In order to choose the best and worst representative image for each of the categories we ran a pre-study. Each images used by Z&F in their Experiments 1–3 was chosen from a set consisting of five adversarial images for that category generated by Nguyen et al., 2015. In the pre-study, participants ( $N = 100$ ) were presented all five fooling images and asked to choose an image that was most-like and least-like a member from that category (e.g. most like a computer keyboard). Then, during the study, participants in the ‘best-case’ condition were shown the image from each category that was given the most-like label with the highest frequency. Similarly, participants in the ‘worst-case’ condition were shown images that were labelled as least-like with the highest frequency. DCNNs showed the same confidence in classifying both sets of images. We again randomised the order of presentation of images.

Experiment 3

The experiment consisted of four experimental conditions in a 2 × 2 repeated measures design (every participant completed each condition). The first factor of variation was the database on which the DCNNs were trained – ImageNet or MNIST with one condition containing images designed to fool ImageNet and the other containing images designed to fool MNIST classifiers. The second factor of variation was the evolutionary algorithm used to generate the adversarial images – direct or indirect. The indirect encoding method leads to adversarial images which have regular features (e.g. edges) that often repeat, while the direct encoding method leads to noise-like adversarial images. All of the images were from the seminal Nguyen et al., 2015 paper on fooling images. The MNIST dataset consists of ten categories (corresponding to handwritten numbers between 0 and 9), while ImageNet consists of 1000 categories. As we wanted to compare agreement across conditions, we selected ten images (from ten different categories) for both datasets. The indirectly-encoded ImageNet images were the same as the ones in Experiment 1 while the images for the other three conditions were randomly sampled from the images generated by Nguyen et al., 2015. Participants were shown one image at a time and asked to categorize it as one of ten categories (category labels were shown beneath the image). One of these ten categories was the label assigned to the image by a DCNN. Therefore, chance level agreement was 10%. The participants had to click on the label they thought represented what was in the image. The order of conditions was randomized for each participant and the order of images within each condition was randomized as well. A total of $N = 200$ participants completed the study. Two participants were excluded from analysis because their choices were made with average response times below 500 ms indicating random clicking rather than actually making decisions based on looking at the images themselves.

Experiment 4

In this experiment we used the Foolbox package (Rauber et al., 2017) to generate images that fool DCNNs trained on ImageNet. The experiment consisted of two conditions, one with images designed to fool AlexNet (Krizhevsky et al., 2012) and the other with images designed to fool ResNet-18, both trained on ImageNet. We generated our own adversarial images by first generating an image in which each pixel was independently sampled and successively modifying this image using an Iterative Gradient Attack based on the fast gradient sign method (Goodfellow et al., 2014) until a DCNN classified this image as a target category with a >99% confidence. The single trial procedure mirrors Experiment 4 from Z&F. Participants ( $N = 200$ ) were shown three of the generated images and a set of five real-world example images of the target class (see Inset in Figure 7). They were asked to choose the adversarial image which contained an object from the target class. The example images were randomly chosen from the ImageNet dataset for each class. Each participant completed both experimental conditions. The order of trials was randomized for each participant.

Experiment 5

The experiment mirrors Experiment 1 in procedure and experimental conditions. Participants ( $N = 200$ ) were sequentially shown 20 images and asked to choose one out of four response alternatives for each image. The order of presentation of these images was randomized. The stimuli were chosen from ten independent runs (a total of 10000 images) of the evolutionary algorithm used in Nguyen et al., 2015 which were kindly provided to us by the first author. The images were selected such that at least 9 out of 10 networks classify the images as the same category with high confidence (median confidence of 92.61%). Before settling on the final set of 20 images, the stimuli were checked by the first author in order to exclude any which are not in fact adversarial, but rather exemplars of the category. The DCNN models were pre-trained on ImageNet and are a part of the model zoo of the PyTorch framework. The models are: Alexnet, Densenet-161, GoogLeNet, MNASNet 1.0, MobileNet v2, Resnet 18, Resnet 50, Shufflenet v2, Squeezenet 1.0, and VGG-16. The input images were transformed in accordance with recommendations found in PyTorch documentation: 224 × 224 centre crop and normalization with $m e a n = [0.485, 0.456, 0.406]$ and $s t d = [0.229, 0.224, 0.225]$ prior to classification.

The two experimental conditions mirror Experiment 1. In the random alternatives condition, for each image, participants ( $N = 100$ ) chose among labels which included the network classification and three alternatives chosen at random among the remaining 19 stimuli labels. In the competitive alternatives condition, participants ( $N = 100$ ) chose among labels which included the network classification and three competitive labels. To determine these competitive labels, we conducted a pre-study, where participants ( $N = 20$ ) were asked to generate three labels for each adversarial image. These labels were then used as a guide to select the three competitive categories from ImageNet while ensuring that these categories did not semantically overlap with the target category. Participants were assigned to one of the two conditions randomly, the order of images and label positioning on the screen were randomized for each participant. Stimuli and competitive labels can be seen in Appendix 2—figures 4 and 5.

Statistical analyses

All conducted statistical analyses were two-tailed with a p-value under 0.05 denoting a significant result. In Experiments 1, 2, 4, and 5 we conducted single sample t-tests to check if agreement levels were significantly above a fixed chance level (25% in Experiments 1, 2, and 5, 33.33% in Experiment 4). We additionally ran a between-subject t-test (Experiments 1, 2, and 5) and a within-subject t-test (Experiment 4) to determine whether the difference between experimental conditions was significant. We also conduct a Binomial test in Experiment 4 to determine for how many items was agreement level significantly above chance. In Experiment 3 we ran a two-way repeated measures analysis of variance. In Experiment 5 we ran a mixed two-way analysis of variance. We report effect size measures for all tests (Cohen’s d for t-tests and partial eta squared for ANOVA effects). We calculate probability of network-network, human-human, and network-human agreement in the competitive labels condition of Experiment 1 and 5. This was done by calculating the percentage of agreements among all possible comparisons. For example, the total number of comparisons to calculate the probability of agreement between two networks, was: 20(images) $* 45$ (number of possible combinations of two networks). The number of such comparisons which resulted in agreement between two networks divided by the total number of comparisons gives the probability of two networks agreeing when classifying an adversarial image. We conducted the same calculation on data from Experiment 1, since those stimuli were not specifically chosen to be highly transferable between networks.

Power analysis

A sample size of $N = 200$ was chosen for each experiment which mirrors Z&F experiments 1–6 in order to detect similar effects. This allowed us to detect an effect size as low as $d = 0.18$ at $α = .05$ with 0.80 power in within-subject and $d = 0.35$ in between-subject experiments.

Online recruitment

We conducted all four experiments online with recruitment through the Prolific platform. Each sample was recruited from a pool of registered participants which met the following criteria. Fluent English speakers living in the UK, USA, Canada or Australia of both genders between the ages of 18 and 50 with normal or corrected to normal vision and a high feedback rating on the Prolific platform (above 90). Participants were reimbursed for their time upon successful completion through the Prolific system.

Data availability

Data, scripts, and stimuli form all our experiments are available via the Open Science Framework at https://osf.io/a2sh5/. Stimuli from evolutionary runs producing fooling images by Nguyen et al., 2015 can be found at https://anhnguyen.me/project/fooling/.

Acknowledgements

This research was supported by the European Research Council Grant Generalization in Mind and Machine, ID number 741134. We would like to thank Alex Doumas, Jeff Mitchell, Milton Liera Montero and Brian Sullivan for their insights and feedback.

Appendix 1

Appendix 1—figure 4. — Each histogram contains the adversarial stimuli and shows the percentage of responses per each choice (y-axis). The choice labels (x-axis) are ordered the same way as in Appendix 1—figures 2 and 3 from 1 to 48. Black bars indicate the DCNN choice for a particular adversarial image.

Reanalysis of Zhou and Firestone, 2019

Roughly similar agreement levels on most images accompanied by mean agreement above chance would indicate some systematic underlying overlap between human and network object recognition. However, as shown in Appendix 1—figure 1, there are vast differences in agreement levels depending on adversarial image. The inset of Appendix 1—figure 1 also shows that the distribution is skewed and that the mean agreement metric overestimates true human-network agreement which is better represented by the median. There is a minority of images which drive agreement as outliers, giving credence to the hypothesis about two separate sources of agreement discussed in the paper.

When trying to determine the nature of human-network agreement it is important to consider distributions of choices on a per-item level not merely whether agreement was above chance for a particular stimulus. If there was general agreement between humans and networks one could expect that for most images the most frequently chosen label would be the one assigned to the stimulus by the network. Additionally, it could be expected that a large percentage of participants choose exactly this label. Appendix 1—figures 2 and 3 show the top eight most frequently chosen labels by participants in Experiment 3b from Zhou and Firestone, 2019. It can clearly be seeen that for only a minority of images the network label was the most often chosen one by participants. Additionally, there is only one stimulus for which majority of participants chose the network label (e.g. ’chainlink fence’, image number 6 in the Appendix 1—figure 2) while only for a few others can it be said that a fairly substantial percentage of participants chose the network label. For many of the stimuli, the network label is not amongst the top eight choices made by participants. It can also be observed that for most stimuli the top most frequently made choice wasn’t overwhelmingly favoured by participants, meaning the distribution of label choices for most stimuli is flat, indicating guessing rather than agreement (Appendix 1—figures 4–6 for more information).

As was implied in Appendix 1—figures 2 and 3, Appendix 1—figures 4–6 reveal flat distributions of label choices for the vast majority of stimuli. Indeed only eight histograms resemble what could be expected if there was systematic human-network agreement on classification. For those stimuli the most often chosen label was the network label and the percentage of participants who chose the label peaks above the percentage of choices for other labels. There are examples with similar peaks but in which the most often chosen label was not the one assigned to the stimulus by DCNNs. Overall, this provides evidence for the hypothesis that agreement is derived from two sources. First, some stimuli (e.g. ’chainlink fence’) can hardly be called adversarial images since they retain almost all of the features as well as maintain the relationship between features of the target category. In those rare cases, agreement is trivial. In other cases in which agreement is above chance it is likely that those levels were achieved by excluding labels based on a few superficial features. These features were not sufficient for object recognition but did allow for exclusion of labels which do not contain the specific features (e.g. dismissing ’monarch butterfly’ when viewing the ’projector’ stimulus). We believe that this exact pattern of data supports such a hypothesis.

Appendix 2

Supplementary information for Experiments 1–5

Appendix 2—figure 2. — Average agreement levels for each category in each condition with 95% CI are presented in (a) with the black line referring to chance agreement. The best case stimuli are presented in (b), these stimuli were judged as containing the most features in common with the target category (out of 5 generated by Nguyen et al., 2015). The worst case stimuli are presented in (c), these were judged to contain the least number of features in common with the target category.

Appendix 2—figure 3. — Each bar shows the agreement level for a particular image, that is, the percentage of participants that agreed with DCNN classification for that image. Each sub-figure also shows the images that correspond to the highest (blue) and lowest (red) levels of agreement under that condition.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Marin Dujmović, Email: marin.dujmovic@bristol.ac.uk.

Gordon J Berman, Emory University, United States.

Ronald L Calabrese, Emory University, United States.

Funding Information

This paper was supported by the following grant:

H2020 European Research Council 741134 to Jeffrey S Bowers.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Resources, Supervision, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Ethics

Human subjects: Participants were informed about the nature of the study, and their right to withdraw during the study or to withdraw their data from analysis. The participants gave consent for anonymized data to be used for research and available publicly. The project has been approved by the IRB at the University of Bristol (application ID 76741).

Additional files

Transparent reporting form

elife-55978-transrepform.docx^{(246.2KB, docx)}

Data availability

Data, scripts and stimuli for all of the experiments are available via the Open Science Framework (https://osf.io/a2sh5).

The following dataset was generated:

Dujmovic M, Malhotra G, Bowers JS. 2020. What do adversarial images tell us about human vision? Open Science Framework. a2sh5

References

Akhtar N, Mian A. Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access. 2018;6:14410–14430. doi: 10.1109/ACCESS.2018.2807385. [DOI] [Google Scholar]
Alcorn MA, Li Q, Gong Z, Wang C, Mai L, Ku W-S, Nguyen A. Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 4845–4854. [DOI] [Google Scholar]
Athalye A, Engstrom L, Ilyas A, Kwok K. Synthesizing robust adversarial examples. arXiv. 2017 https://arxiv.org/abs/1707.07397
Baker N, Lu H, Erlikhman G, Kellman PJ. Deep convolutional networks do not classify based on global object shape. PLOS Computational Biology. 2018;14:e1006613. doi: 10.1371/journal.pcbi.1006613. [DOI] [PMC free article] [PubMed] [Google Scholar]
Biederman I. Recognition-by-components: a theory of human image understanding. Psychological Review. 1987;94:115–147. doi: 10.1037/0033-295X.94.2.115. [DOI] [PubMed] [Google Scholar]
Biederman I, Cooper EE. Evidence for complete translational and reflectional invariance in visual object priming. Perception. 1991;20:585–593. doi: 10.1068/p200585. [DOI] [PubMed] [Google Scholar]
Biederman I, Cooper EE. Size invariance in visual object priming. Journal of Experimental Psychology: Human Perception and Performance. 1992;18:121–133. doi: 10.1037/0096-1523.18.1.121. [DOI] [Google Scholar]
Biederman I, Ju G. Surface versus edge-based determinants of visual recognition. Cognitive Psychology. 1988;20:38–64. doi: 10.1016/0010-0285(88)90024-2. [DOI] [PubMed] [Google Scholar]
Blything R, Vankov I, Ludwig C, Bowers J. Extreme translation tolerance in humans and machines. Conference on Cognitive Computational Neuroscience.2019. [Google Scholar]
Cadieu CF, Hong H, Yamins DL, Pinto N, Ardila D, Solomon EA, Majaj NJ, DiCarlo JJ. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLOS Computational Biology. 2014;10:e1003963. doi: 10.1371/journal.pcbi.1003963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports. 2016;6:27755. doi: 10.1038/srep27755. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cichy RM, Kaiser D. Deep neural networks as scientific models. Trends in Cognitive Sciences. 2019;23:305–317. doi: 10.1016/j.tics.2019.01.009. [DOI] [PubMed] [Google Scholar]
Dodge S, Karam L. A study and comparison of human and deep learning recognition performance under visual distortions. 2017 26th International Conference on Computer Communication and Networks (ICCCN) IEEE; 2017. pp. 1–7. [DOI] [Google Scholar]
Eickenberg M, Gramfort A, Varoquaux G, Thirion B. Seeing it all: convolutional network layers map the function of the human visual system. NeuroImage. 2017;152:184–194. doi: 10.1016/j.neuroimage.2016.10.001. [DOI] [PubMed] [Google Scholar]
Elsayed G, Shankar S, Cheung B, Papernot N, Kurakin A, Goodfellow I, Sohl-Dickstein J. Adversarial examples that fool both computer vision and time-limited humans. Advances in Neural Information Processing Systems; 2018. pp. 3910–3920. [Google Scholar]
Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann FA, Brendel W. Imagenet-trained cnns are biased towards texture; increasing shape Bias improves accuracy and robustness. arXiv. 2018 https://arxiv.org/abs/1811.12231
Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv. 2014 https://arxiv.org/abs/1412.6572
Goodfellow IJ, Papernot N, Huang S, Duan R, Abbeel P, Clark J. Attacking machine learning with adversarial examples. [June 18, 2002];2017 https://openai.com/blog/adversarial-example-research/
He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing Human-Level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision; 2015. pp. 1026–1034. [DOI] [Google Scholar]
Hosseini H, Poovendran R. Semantic adversarial examples. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2018. pp. 1614–1619. [DOI] [Google Scholar]
Hummel JE, Stankiewicz BJ. Categorical relations in shape perception. Spatial Vision. 1996;10:201–236. doi: 10.1163/156856896X00141. [DOI] [PubMed] [Google Scholar]
Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. Adversarial examples are not bugs, they are features. arXiv. 2019 https://arxiv.org/abs/1905.02175
Karmon D, Zoran D, Goldberg Y. Lavan: localized and visible adversarial noise. arXiv. 2018 https://arxiv.org/abs/1801.02608
Khaligh-Razavi SM, Kriegeskorte N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLOS Computational Biology. 2014;10:e1003915. doi: 10.1371/journal.pcbi.1003915. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems; 2012. pp. 1097–1105. [Google Scholar]
Kubilius J, Schrimpf M, Nayebi A, Bear D, Yamins DL, DiCarlo JJ. Cornet: modeling the neural mechanisms of core object recognition. bioRxiv. 2018 doi: 10.1101/408385. [DOI]
Malhotra G, Evans BD, Bowers JS. Hiding a plane with a pixel: examining shape-bias in CNNs and the benefit of building in biological constraints. Vision Research. 2020;174:57–68. doi: 10.1016/j.visres.2020.04.013. [DOI] [PubMed] [Google Scholar]
Nguyen A, Yosinski J, Clune J. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. pp. 427–436. [DOI] [Google Scholar]
Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings. 2016 IEEE European Symposium on Security and Privacy (EuroS&P); 2016. pp. 372–387. [DOI] [Google Scholar]
Peterson JC, Abbott JT, Griffiths TL. Adapting deep network features to capture psychological representations: an abridged report. International Joint Conference on Artificial Intelligence; 2017. pp. 4934–4938. [Google Scholar]
Peterson JC, Abbott JT, Griffiths TL. Evaluating (and improving) the correspondence between deep neural networks and human representations. Cognitive Science. 2018;42:2648–2669. doi: 10.1111/cogs.12670. [DOI] [PubMed] [Google Scholar]
Pomerantz JR, Portillo MC. Grouping and emergent features in vision: toward a theory of basic gestalts. Journal of Experimental Psychology: Human Perception and Performance. 2011;37:1331–1349. doi: 10.1037/a0024330. [DOI] [PubMed] [Google Scholar]
Rajalingham R, Issa EB, Bashivan P, Kar K, Schmidt K, DiCarlo JJ. Large-Scale, High-Resolution comparison of the core visual object recognition behavior of humans, monkeys, and State-of-the-Art deep artificial neural networks. The Journal of Neuroscience. 2018;38:7255–7269. doi: 10.1523/JNEUROSCI.0388-18.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rauber J, Brendel W, Bethge M. Foolbox: a Python toolbox to benchmark the robustness of machine learning models. arXiv. 2017 https://arxiv.org/abs/1707.04131
Ritter S, Barrett DG, Santoro A, Botvinick MM. Cognitive psychology for deep neural networks: a shape Bias case study. JMLR. org.Proceedings of the 34th International Conference on Machine Learning.2017. [Google Scholar]
Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc; Red Hook, United States. 2017. pp. 3859–3869. [Google Scholar]
Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. Performance-optimized hierarchical models predict neural responses in higher visual cortex. PNAS. 2014;111:8619–8624. doi: 10.1073/pnas.1403112111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yamins DL, DiCarlo JJ. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience. 2016;19:356–365. doi: 10.1038/nn.4244. [DOI] [PubMed] [Google Scholar]
Zhou Z, Firestone C. Humans can decipher adversarial images. Nature Communications. 2019;10:1334. doi: 10.1038/s41467-019-08931-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.55978.sa1

Decision letter

Editor: Gordon J Berman¹

Reviewed by: Chaz Firestone²

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The authors address the very timely and important question of whether primate vision functions in a similar manner to modern Deep Learning methods, in particular Deep convolutional neural networks (DCNNs). Several prominent groups have argued that the firing patterns of DCNNs resemble those seen in neural recordings of primate visions, lending some evidence to the idea that DCCNNs and primate vision may be similar. By reanalyzing the data from a previous study claiming that "Humans can decipher adversarial images" and by performing a new set of experiments, the authors show that humans and DCNNs only weakly agree and that agreement is highly dependent on experimental design choices.

Decision letter after peer review:

Thank you for submitting your article "What do adversarial images tell us about human vision?" for consideration by eLife. Your article has been reviewed by Ronald Calabrese as the Senior Editor, a Reviewing Editor, and three reviewers.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

In their paper, Dujmović and collaborators address the very timely and important question of whether primate vision functions in a similar manner to modern Deep Learning methods, in particular Deep convolutional neural networks (DCNNs). There is tremendous interest in this idea. Several prominent groups have argued that the firing patterns of DCNNS resemble those seen in neural recordings of primate visions and lending some evidence to the idea that DCCNNs and primate vision may be similar. One common objection to this analogy is that DCNNs and humans seem to have very different responses to "adversarial mages that fool DCNNs but are uninterpretable to humans." The existence of these adversarial images is often used to argue that DCNNs and primate vision function is very different ways. However, this view was recently challenged by Zhou and Firestone, 2019 that "Humans can decipher adversarial images".

This submission challenges the claim. The manuscript starts by reanalyzing the data from the Zhou and Firestone, 2019 experiments. The authors point out that the original analysis used a summary statistic on individual on responses, rather than the full distribution of responses, arguing that the degree of agreement in the original experiments between DCNN and primate visions was much less than presented in Zhou and Firestone, 2019. In addition, the authors designed four new experiments that used the same basic experimental design as Zhou and Firestone, 2019 but in using a different approach. In particular, in Experiment 1, they address the idea that superficial features were what was driving agreement in the similarity of classifying adversarial examples between DCNNs and humans by using "less adversarial" images. In Experiment 2, they checked whether a subset of categories were represented in a similar manner between humans and DCNNS or if the agreement in Zhou and Firestone, 2019 on certain images was just because a particular adversarial image used by Zhou and Firestone, 2019 was superficially similar to the real image. Finally, they show in Experiments 3 and 4 that it is possible to generate adversarial images that humans and DCNNs recognize in completely different ways.

Essential revisions:

While the reviewers felt that the paper was of sufficient interest for publication (and a fraction were very enthusiastic about its contributions to the literature and its eventual acceptance at eLife), there were five key points that would be essential to address before the article could be accepted, enumerated below. Additionally, we also ask the authors to also address the specific comments in the section below.

1) All three reviewers felt strongly that the language/framing of the paper should be tempered to more constructively reflect the relationship of this paper to the literature and to prevent misinterpretation. Specifically, although the discussion about which statistical methods should be computed to support the claims is key to the manuscript and an important point, there are several places (see specific comments for details) where the reader could get the impression that the authors are claiming that Zhou and Firestone, 2019 computed statistical tests incorrectly (rather than taking a different approach than suggested by the authors here) by using language such as ""statistically unsound." Discussion of the question of which tests to compute would be valuable to this field at the interface of ML and psychology, and so we ask the authors to provide clarity to the text accordingly.

2) In their original paper, Zhou and Firestone, 2019 write "We conclude that human intuition is a more reliable guide to machine (mis)classification than has typically been imagined." We ask the authors to comment upon whether that the authors' re-analyses (which still show significantly above-chance classification in Zhou and Firestone, 2019's experiments; e.g., in the authors' Table 2) do support this modest conclusion, even though those analyses are still consistent with a fairly low overall level of human-DCNN agreement.

3) Similarly, the reviewers often had difficulty understanding how subjectively the "best" and "worst" adversarial images were chosen for the experiments? Ideally, this would be done by, say, polling a random group of people, would everyone agree (measured statistically) with these rankings? In absence of such polling, though, we ask the authors to more explicitly discuss which set of adversarial images are used where and to discuss more fully how these choices lead to the conclusions they draw.

4) The authors consider the existence of adversarial images as a strong counterpoint to the possibility that humans and DCNNs categorize images in the same way. An alternative is that, for many cases, humans and DCNNs are categorizing images in similar ways, but that for some edge cases, DCNNs fail, and we have not yet figured out how to build a DCNN that recognizes when it is being fooled. But what does it mean to have the same way of categorizing images? In this paper, the implicit view is that if humans and the DCNN disagree on this suite of particular experiments, then they are working in fundamentally different ways. Under this definition, do all humans categorize images in one particular way? Do all DCNNs, including DCNNs with different architectures? To make their claim, the authors need to set clear criteria by which to judge that image categorization is the same across networks, and show that variability across DCNNs is smaller than the variability between humans and DCNNs.

5) Related, the particular adversarial images to which a network is sensitive are reflective of the training set that determined the DCNN. The DCNN in this study was trained on the ImageNet dataset. How would another artificial network, trained on a different image dataset, classify these adversarial images? This would establish which of their findings are expected of any pair of networks with different training sets.

To address 4) and 5), it would be helpful to observe the performance of additional DCNNs, trained on completely separate image banks, on each of the four experiments. For example, in Experiment 4, participants had to distinguish between adversarial images generated for ResNet-18 and AlexNet. For 3), how well does ResNet-18 identify AlexNet adversarial images? For 4) If you had two instantiations of AlexNet ("AlexNet1" and "AlexNet2") but trained them on disjoint sets of images from the same classes, would AlexNet1 agree with AlexNet2 on adversarial images generated for AlexNet2? If it turns out that, for instance, AlexNet1 agrees with AlexNet2 at chance level, then the interpretation of chance-level human performance is not valid.

Note: comments #12 and onward were written by reviewer #1.

1) DMB's summary of Zhou and Firestone, 2019's result in the Introduction is ambiguous and open to misreading. Readers might hear DMB's paragraph as stating that Zhou and Firestone, 2019 reported that humans agreed with machines 90% of the time they were asked, which of course Zhou and Firestone, 2019 do not report. A more precise and accurate summary of Zhou and Firestone, 2019's result would be something like this:

2) "The authors took a range of published adversarial images that were claimed to be uninterpretable by humans, and in a series of experiments, they showed those images to human subjects next to the DCNN's preferred label and various foil labels. They reported that, over the course of an experimental session, a high percentage of participants (often close to 90%) chose the DCNN's preferred label more often than would be expected by chance."

3). The final sentence of the Abstract needs to be tempered somewhat. DCNNs are a "promising model," in particular for many aspects of lower-level vision, but they are not a complete model.

4) Introduction: Correct the spelling of inferotemporal cortex.

5) Introduction: The paper revolves on the analysis of the classification of adversarial images, yet beyond showing examples of types of adversarial image, there is not a clear definition of what an adversarial image is and to what extent adversarial images are classified into types (e.g., 'fooling' or 'naturalistic'). Are these all the types of adversarial images?

6) Subsection “Reassessing the level of agreement in Zhou and Firestone (2019)” and Materials and methods section: Why are choices are independent in the null model? Was the task constructed so that choices may repeat? If so, then this is fine. On the other hand, if only 48 images and 48 labels are presented, a participant may treat this as a 'matching' task inducing correlations between choices. For 48 items, the correction to the null distribution of number of correct choices is minor, but it would not be for a smaller number of items. Either way, a clearer description of the task design is warranted.

7) Subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”and Table 1: Please explain what the 'Images' column means in this table. Up to this point, the text has referred only to the participant-level analysis.

8) Subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”: grammar – missing definite article

9) Subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”: grammar – colloquial usage

10) Discussion section: This final conclusion, that DCNNs that also incorporate a number of additional features would be less sensitive to adversarial attacks, is interesting. I would recommend reading and possibly citing work on "capsule networks" from Hinton/DeepMind.

11) Figure Appendix 1—figure 4: check cross-reference – there's a "?"

12) Critiques of scientific research, including this one, are important, valuable, and welcome. At the same time, a core obligation of any such critique is to present its target charitably, and to engage its strongest arguments. I'm worried that this paper does not do that in its current form. A key piece of context that, in my opinion, DMB's paper loses sight of is that Zhou and Firestone, 2019 were directly motivated by Nguyen et al.'s statement that their fooling images were "completely" and "totally unrecognizable to human eyes" – emphasis on "completely" and "totally". Zhou and Firestone, 2019 noticed that many of the images did seem to have some features in common with their target class, and that Nguyen et al. had collected no human data on this; and so Zhou and Firestone, 2019 simply set out to show that the fooling images in Nguyen et al., were not, as a rule, "totally unrecognizable". It's my strong opinion, even after reading and accepting much of what DMB say, that Zhou and Firestone, 2019's data and analyses (and even DMB's reanalyses) still support Zhou and Firestone, 2019's modest conclusion, and I strongly encourage the authors to consider the same. Note that even Nguyen himself agreed; he wrote to us after reading Zhou and Firestone, 2019's paper to say: "After reading this paper, I develop a stronger intuition that humans can decipher very well those robust, and model-transferrable AXs" and that "I personally agree that the phrase 'totally unrecognizable to human eyes' might have been an overstatement. Thanks to your work for pointing it out!". I agree with that statement too.

Indeed, throughout the paper, DMB write statements like Zhou and Firestone, 2019's "conclusions are not justified" [Introduction], but without actually stating what Zhou and Firestone, 2019's conclusions are. In fact, Zhou and Firestone, 2019's conclusions are clear and explicit:

- "We conclude that human intuition is a more reliable guide to machine (mis)classification than has typically been imagined" (final paragraph of Zhou and Firestone, 2019's Introduction, where "than has typically been imagined" refers to Nguyen et al.'s claim).

- "This implies at least some meaningful degree of similarity in the image features that humans and machines prioritize – or can prioritize – when associating an image with a label" (first paragraph of Zhou and Firestone, 2019's Discussion).

and

- "The present results suggest that this particular challenge to notions of human-machine similarity may not be as simple as it appears (though there may of course be other reasons to doubt the similarity of humans and machines)" (first paragraph of Zhou and Firestone, 2019's Discussion section).

In other words, Zhou and Firestone, 2019's conclusion is not that there's a high level of agreement (though see more on that below), or that DCNNs are good models of human vision, etc.; it is simply that the images are not "totally unrecognizable" in the way Nguyen et al. had claimed. And there can be no doubt that this is Zhou and Firestone, 2019's conclusion, since the only time Zhou and Firestone, 2019 even use the word "conclude" is in the above quote.

It is my strong opinion, perhaps my strongest opinion throughout this review, that DMB have not painted that conclusion in an accurate, fair, or charitable way. And so it is also my very strong opinion that the paper must be revised to state that conclusion accurately, and to consider whether the data as a whole (including Zhou and Firestone, 2019's and DMB's analyses) truly are inconsistent with it. Human-machine agreement can be low (as DMB suggest) while still being consistent with Zhou and Firestone, 2019's conclusion that the images are not "totally unrecognizable". There is not as much disagreement between Zhou and Firestone, 2019 and DMB as DMB makes it seem, and there's no reason not to acknowledge that agreement when it exists.

13) The authors claim that Zhou and Firestone, 2019 reported a "surprisingly large agreement between humans and DCNNs" [Introduction]; "the reported level of agreement is nearly 90%" [Results section]. These statements are inaccurate, or at least imprecise. Zhou and Firestone, 2019 do not report a "level of agreement of nearly 90%": "level of agreement" is a new phrase that DMB use (appearing nowhere in Zhou and Firestone, 2019's paper) that makes it seem as though Zhou and Firestone, 2019 reported something they did not, and that also makes it seem that Zhou and Firestone, 2019's measure is aimed at the same quantity as DMB's measure. Additional clarity here would be appreciated, since this comes up later (e.g., in saying that the agreement was "lower than reported", which I worry is misleading – the two measures simply report different quantities).

14) Similarly, DMB wait too long to note that Zhou and Firestone, 2019 already computed DMB's "average agreement" measure; indeed, it was the very first analysis in Zhou and Firestone, 2019's paper, but DMB only later acknowledge this, instead starting with Zhou and Firestone, 2019's Experiment 3 two pages earlier. In Experiment 1, Zhou and Firestone, 2019 state: "Classification "accuracy" (i.e., agreement with the machine's classification) was 74%, well above chance accuracy of 50% (95% confidence interval: [72.9, 75.5%]; two-sided binomial probability test: p < 0.001)". Indeed, this alone shows how it is misleading of DMB to claim that Zhou and Firestone, 2019 report a "level of agreement" near ceiling; Zhou and Firestone, 2019 are clear that "agreement with the machine's classification" is 74%, just as DMB report.

15) I stated above that I still think Zhou and Firestone, 2019's analyses support their conclusions. To see why, consider an analogy. Suppose someone claimed that dieting is "completely" and "totally" ineffective for losing weight, as Nguyen et al., claimed that their fooling images were "totally unrecognizable to human eyes". If you were skeptical (as Zhou and Firestone, 2019 were), you could assign subjects to a diet and see if they lose weight. There are then two standard approaches to evaluating whether the diet was "totally ineffective". You could (1) ask how much weight the cohort lost as a whole, and whether that amount deviates significantly from what would be expected under the null; or you could (2) ask how many people lost weight, and ask whether that number significantly deviates from what would be expected under the null. For example, you could (1) discover that the cohort lost 10% of their bodyweight on average, and that this number deviated significantly from 0%; or you could (2) discover that 95% of dieters lost some amount of weight (whether that amount was a lot or a little), and that this number significantly deviated from 50%.

Zhou and Firestone, 2019 take both approaches – (1) and (2) – in their Experiment 1, but then settle on (2) for the rest, because it seemed better suited to refute the claim that adversarial images are "completely unrecognizable to humans". By contrast, (1) is an approach DMB prefer, as is evident from Table 2. That's fine; as DMB show in that table, it also confirms Zhou and Firestone, 2019's conclusions by showing significantly above-chance average agreement. (I comment on Table 1 below.) What should be clear, however, is that these two approaches are both valid, and indeed have certain strengths and weaknesses over one another. For example, if approach (1) finds that subjects lost 10% of their bodyweight on average, that still might not refute the claim that the diet is ineffective, since it could be that most subjects gained a small amount of weight while a minority of subjects lost a very large amount of weight – a possibility that would make this measure ill-tailored to refuting claims of ineffectiveness (since it causes weight gain in most people). Conversely, if you used approach (2) and found that 95% of people lost some amount of weight, then that wouldn't tell you how much weight the cohort (or any one person) lost; but it is still informative, because if 95% of people lose weight on a diet (whether that amount is 1 pound or 50), then whatever you think of this diet, it's hard to plausibly claim that it's "completely" and "totally" ineffective (even if it might not be very effective overall). And, of course, both approaches combined would be best. But, again, they're both valid: If 98% of subjects pick the machine's label numerically more often than not, and if 98% is significantly higher than 50%, then it just can't be true that these images are "completely" and "totally" undecipherable – even if any given subject is only agreeing a bit, and even if that agreement is not even statistically significant within a single subject (just like one needn't say of any given dieter that they lost a "significant" amount of weight; if 95% of dieters lose weight, then that is telling, as long as 95% significantly differs from 50% in the sample).

So DMB are being too critical. There are multiple ways to understand these data, depending on the researchers' goals. DMB are correct to say that Zhou and Firestone, 2019's analyses in Experiment 3 "assign the same importance to a participant that agrees on 2 out of 48 trials as a participant who agrees on all 48 trials with the DCNN" [84]; but this just isn't a problem for Zhou and Firestone, 2019's research question. Zhou and Firestone, 2019's goal was to refute the claim that the fooling images in Nguyen et al. were "completely unrecognizable to humans". Perhaps DMB do not share this goal; but that is no reason to call Zhou and Firestone, 2019's approach "misleading and statistically unsound". Both Zhou and Firestone, 2019's approach and DMB's approach are fine. And, again, both approaches ended up coming out Zhou and Firestone, 2019's way, as DMB confirm in Table 2.

The paper should be revised to reflect this. There is not actually a large disagreement here between Zhou and Firestone, 2019 and DMB as regards to how Zhou and Firestone, 2019's data interact with Nguyen et al.'s claim (which is what Zhou and Firestone, 2019 were after). To give an example (though of course DMB could use language other than mine), the paper could say something like "Whereas the analyses Zhou and Firestone, 2019 use may have been sufficient to refute previous claims that such images were "totally unrecognizable to human eyes", those analyses are still consistent with a fairly low overall level of human-machine agreement." (But of course, much else would then have to change too, since DMB are so persistent in calling Zhou and Firestone, 2019's conclusions unjustified.)

16) A central issue with the re-analyses is how much DMB's discussion focuses on Zhou and Firestone, 2019's Experiment 3b. In my opinion, this was a strange experiment to choose, especially when it comes to giving a charitable critique. The entire purpose of Zhou and Firestone, 2019's Experiment 3b was to make human-machine agreement as difficult as possible, and this is the single least-discussed experiment (of 8) in Zhou and Firestone, 2019's entire paper – not the cornerstone of Zhou and Firestone, 2019's claims in any way. When Zhou and Firestone, 2019 observed above-chance human-machine agreement in Experiment 1, there was a worry that above-chance agreement when given two alternatives wasn't so impressive. So, Experiments 3a and 3b presented all the labels for every image at once, to see if subjects could show above-chance agreement even in extremely taxing circumstances. Zhou and Firestone, 2019 were very explicit about this: Zhou and Firestone, 2019 say "These results suggest that humans show general agreement with the machine even in the taxing and unnatural circumstance of choosing their classification from dozens of labels displayed simultaneously". So Zhou and Firestone, 2019 already described this task as "taxing and unnatural" and unlikely to produce high human-machine agreement. And indeed it's certain that subjects do not actually read all the labels before making judgments, because they often answer in just a few seconds – not enough time to have looked at all the labels. So, of course agreement will be low – that was the point of that experiment.

Indeed it is possible to see here – http://www.czf.perceptionresearch.org/adversarial/expts/texture48.html – how imposing that is, whereas Experiment 1 – here http://www.czf.perceptionresearch.org/adversarial/expts/texture.html – feels much more natural. Yet, subjects in Experiment 3a and 3b still showed a mean agreement of ~10%, rather than the chance level of ~2%. We thought that was impressive, given the circumstances. DMB may disagree, but that is simply a difference in opinion or taste, not an undermining of Zhou and Firestone, 2019's claims. (Indeed, DMB later describe this result as "striking"; strikingly high? If so, why not say so earlier?)

To be fairer and more accurate, the relevant claims in DMB's paper should also include Zhou and Firestone, 2019's Experiment 1 as an example. For example, subsection “Reassessing the level of agreement in Zhou and Firestone (2019)” should first describe how Zhou and Firestone, 2019 analyzed Experiment 1, and then describe Experiment 3. Figure 1.1 should include Experiment 1 and Experiment 3. And so on. Otherwise, I worry that DMB might be cherry picking, choosing the examples from Zhou and Firestone, 2019 that they feel are weakest, rather than the ones they feel are strongest. For example, DMB state in subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)” that "For the rest of the images agreement is at or below chance levels". But that's only true of Experiment 3; for Zhou and Firestone, 2019's Experiment 1, which was not designed to be so taxing on subjects, even DMB's binomial analysis shows that agreement is significantly above chance on over 85% of images. It is potentially misleading (if not simply false) to claim that "For the rest of the images agreement is at or below chance levels". This discussion should be more balanced, and either focus on Experiment 1 or at least include parallel analyses for Experiment 1's results whenever DMB discuss Experiment 3.

17) Beyond the re-analyses, I'm worried that DMB present their views as alternatives to Zhou and Firestone, 2019, when in fact many of these views were already explicitly articulated by Zhou and Firestone, 2019. I've already emphasized how this is true of Zhou and Firestone, 2019's main conclusion. But another example is DMB's suggestion that human-machine agreement on these images reflects "participants making educated guesses based on some superficial features (such as colour) within images and the limited response alternatives presented to them" [Introduction]. Truly, this is already Zhou and Firestone, 2019's exact view about those experiments. Again, with quotes, Zhou and Firestone, 2019 say that subjects "may have achieved this reliable classification not by discerning any meaningful resemblance between the images and their CNN-generated labels, but instead by identifying very superficial commonalities between them (e.g., preferring "bagel" to "pinwheel" for an orange yellow blob simply because bagels are also orange-yellow in color)". That applies to Experiments 1 and 3, which use the same labels.

DMB subtly acknowledge this elsewhere, but otherwise they present this hypothesis as if it is original to them. Consider subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)” or subsection “Experiment 5: Transferable adversarial images”, which says of Elsayed et al., that "These findings are consistent with our observation that some adversarial images capture some superficial features that can be used by participants to make classification decisions" (emphasis added). But this is already Zhou and Firestone, 2019's exact hypothesis. Both Zhou and Firestone, 2019 and DMB believe subjects are making educated guesses. (Indeed, Zhou and Firestone, 2019 think that may be the right way to describe what the CNN is doing too: Zhou and Firestone, 2019 write "both the CNNs' behavior and the humans' behavior might be readily interpreted as simply playing along with picking whichever label is most appropriate for an image", and "CNNs are.… forced to play the game of picking whichever label in their repertoire best matches an image (as were the humans in our experiments)".

Perhaps "consistent with Zhou and Firestone, 2019's and our observation" would be a better statement for this and other examples. But it's simply not right to portray this as DMB's own or original interpretation; it is precisely Zhou and Firestone, 2019's interpretation, as these quotes show. This is another example where DMB portray disagreement that may not really exist, and so should attribute this interpretation to Zhou and Firestone, 2019.

18) "Using a statistic that treats 45% agreement as chance is liable to be misinterpreted by readers" [subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”]. Perhaps this is true; I agree that Zhou and Firestone, 2019 could have been clearer. But doesn't this support the opposite of DMB's point? The reader that DMB are imagining would normally take chance to be 50% (as Zhou and Firestone, 2019's figures show, perhaps unclearly), whereas the true value of above-chance-performing participants is 88% or 90%. So the fact that chance is 45% rather than 50% makes the result more impressive, not less impressive, since 88% is even farther above chance than the reader was led to believe.

19) In the very next paragraph, DMB make the exact kind of misleading statement that they just charged Zhou and Firestone, 2019 with, and in a way that, I worry, makes this part of the reanalysis potentially misleading. DMB say "Measured in this manner (independent 2-tailed binomial tests with a critical p-value of 0.05 for each participant), the agreement in Experiment 3a between DCNN and participants drops from 88% to 57.76%" [Results section]. But in fact there is no "drop", because chance changes across these two measures. For the 88% measure, chance is 45%; but for the 57.76% measure, chance is 5% (or even 2.5%), since that's the α value of the test that DMB run. In other words, you would expect 5% of subjects to perform significantly different from chance under the null. So DMB have moved the standard without properly alerting the reader. Indeed, 57.76% when chance is 5% is, if anything, more impressive than 88% when chance is 50%. This paragraph should be revised to reflect this. It could say something like, "Measured in this manner, the agreement in Experiment 3a between DCNN and participants moves from 88% (where chance is 45%) to 57.76% (where chance is 5%)"; it would then be clear that "drops" isn't appropriate. They should also report this for Experiment 1, in line with my comment #3 above.

20) This same issue appears in Table 1. To be more interpretable, that table should not only report these %s but should also report the chance level for each %, just as DMB do for Table 2. That will help the reader understand how to interpret what DMB call the "drop" from higher numbers to lower numbers. I strongly recommend this; the table is, I worry, highly misleading without this correction – i.e., a column for something like "what would be expected by chance" (which, I gather is 5%). Indeed, a problem with the analysis in Table 1 is that, in some sense, there's no "null hypothesis"; there's no standard by which the analysis can decide whether the cohort of images was "totally unrecognizable" or not. The value of Zhou and Firestone, 2019 analysis, and DMB's Table 2, is that it's clear how to reject the null hypothesis: i.e., if significantly more images are numerically above chance than are numerical below chance (Zhou and Firestone, 2019), or if mean agreement is significant above chance (Zhou and Firestone, 2019 and DMB). But DMB's Table 1 analysis doesn't really work that way. So that's why it's crucial to portray what chance is for those tests, in the table itself, so that readers can see that it was 85% agreement when chance was 5%, or 57% when chance was 5%, etc.

21) A final issue with the reanalyses themselves, and in some ways the biggest one, is just that Zhou and Firestone, 2019's conclusions survive them, in ways that make it extremely confusing to me why DMB draw the conclusions they do. As Table 2 shows, every one of Zhou and Firestone, 2019's experiments continues to show significantly above-chance human-machine agreement, even on the analytical approach that DMB prefer. And DMB acknowledge this: They say "it is nevertheless the case that even these methods show that the overall agreement was above chance" [subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”]. First, DMB should not wait this long to say so; they should say as early as possible that their analyses, like ours, show significantly above-chance agreement. But second, this demonstrates that Zhou and Firestone, 2019's conclusions, as stated above, are secure after all. I must repeat again that Zhou and Firestone, 2019's claims, made explicitly in their paper, are as follows: "that human intuition is a more reliable guide to machine (mis)classification than has typically been imagined", and that the results "impl[y] at least some meaningful degree of similarity in the image features that humans and machines prioritize-or can prioritize-when associating an image with a label". DMB confirm that these conclusions are even more robust than Zhou and Firestone, 2019 suggested, since they are supported by Zhou and Firestone, 2019's approach and DMB's.

22) Moving to the experiments: DMB's abstract states that "it is easy to generate images with no agreement" [Abstract]. I don't see how this claim is justified by their experiments. First, half or more of DMB's experiments reflect a failure to "generate images with no agreement". Experiment 1 shows above-chance agreement: "A single sample t-test comparing the mean agreement level to the fixed value of 25% did show the difference was significant (t(99) = 3.00, p = .0034, d = 0.30)" [179]. Experiment 2 does so as well: when the best and worst cases are combined, their average agreement is above chance – it's only through picking images that DMB think look undecipherable that they were able to find undecipherable images (note that Zhou and Firestone, 2019 never state that all images will be undecipherable, but rather that the procedure for producing such images will tend to produce decipherable images in general; Experiment 2 confirms this). Experiment 3 does so as well: "Participants were slightly above chance for indirectly-encoded MNIST images (t(197) > 6.30, p <.0001, d = 0.44)" [subsection “Experiment 3: Different types of adversarial images”]. (Only Experiment 4 does not; I'll return to that later.)

This was confusing. Why, if it is "easy" to generate images with no agreement, did DMB so frequently fail to do so? This conclusion should be altered to something like "While difficult, it is possible to generate images with no agreement"; but surely not "easy"!

23) Experiment 1 is hard to interpret as run; or, if it is, it doesn't show a lack of human-machine agreement. The authors chose labels that they subjectively felt were good competitors for the DCNN's label, and found that agreement dropped but was still significantly above chance. This is unsurprising and no threat to Zhou and Firestone, 2019's view, for at least three reasons.

First, this conclusion was already explicitly reached by Zhou and Firestone, 2019, who also considered this and whose Experiment 2 showed that human-machine agreement drops with more competitive labels. DMB mention Zhou and Firestone, 2019's Experiment 2, and criticize it on other grounds; but those grounds are independent of this aspect of Zhou and Firestone, 2019's discussion. In other words, even granting DMB's criticism of Zhou and Firestone, 2019's Experiment 2, Zhou and Firestone, 2019 still perfectly anticipate the conclusion of DMB's Experiment 1. But again, as above, DMB do not credit Zhou and Firestone, 2019 with this, and instead present this as though it is original to DMB. DMB should acknowledge that their Experiment 1 confirms the results of Zhou and Firestone, 2019's Experiment 2 – that more competitive labels should reduce agreement.

Second, DMB's version (but not Zhou and Firestone, 2019's version) is much more difficult to interpret because it is almost a form of "double-dipping": The researchers, DMB, are human beings with visual systems who can appreciate which images look least like certain labels; so they picked the foil labels they thought fit best, and then discovered that other humans (their subjects) agreed. But Zhou and Firestone, 2019 never claimed that humans would pick the DCNN's label as the literal #1 label among all 1000! Again, Zhou and Firestone, 2019's claim is only that humans favor the DCNN's label better than would be expected if the images were "totally unrecognizable to human eyes". So, this experiment is perfectly consistent with Zhou and Firestone, 2019's view. I keep returning to this because DMB have made Zhou and Firestone, 2019's paper the cornerstone of their own paper. If, instead, they just presented four new interesting experiments, they could interpret those experiments in their chosen way, and readers could decide to be persuaded or not. But instead, DMB present these experiments as refuting Zhou and Firestone, 2019. In that case, they have to get Zhou and Firestone, 2019's claims right. All Zhou and Firestone, 2019 need is for the humans to think the DCNN's label fits better than previously thought; Zhou and Firestone, 2019 don't need it to be the very best label.

Third, and most crucially, DMB's Experiment 1 still shows above-chance agreement! So it seemingly shows the opposite of DMB's claim (that their experiments show it is easy to generate images with "no agreement"), and continues to support Zhou and Firestone, 2019.

24) Experiment 2 is also hard to interpret, for the same reason as Experiment 1, and for an additional reason. It shows that some images are easy to decipher (showing above chance classification), but some are hard to decipher (showing below chance classification). Note that Zhou and Firestone, 2019 explicitly predict this, but again DMB fail to acknowledge this. Zhou and Firestone, 2019 state: "A small minority of the images in the present experiments.… had CNN-generated labels that were actively rejected by human subjects, who failed to pick the CNN's chosen label even compared to a random label drawn from the image set. Such images better meet the ideal of an adversarial example, since the human subject actively rejects the CNN's label". So it is no surprise that it is possible to find images of the "worst-case" type by having a human pick them out; Zhou and Firestone, 2019 already said that should happen, when they wrote "An important question for future work will be whether adversarial attacks can ever be refined to produce only those images that humans cannot decipher, or whether such attacks will always output a mix of human-classifiable and human-unclassifiable images; it may well be that human validation will always be required to produce such truly adversarial images". DMB's Experiment 2 is perfect evidence for Zhou and Firestone, 2019's prediction; they show, just as Zhou and Firestone, 2019 say, that a process that tends to generate decipherable images will also generate some undecipherable ones that a researcher could pick out. This is yet another example of DMB making a claim as if it is original to them, rather than crediting Zhou and Firestone, 2019 with that claim and noting that DMB's results confirm Zhou and Firestone, 2019's claims or predictions. It continues to confuse me why DMB frame things this way. There is no need to write as though there is a disagreement here.

25) Indeed, a relevant difference between Zhou and Firestone, 2019's experiments and DMB's Experiment 2 is that Zhou and Firestone, 2019 used images that they didn't even "choose"; they simply used the ones Nguyen et al. displayed in their paper. DMB show that, if the researcher selects a subset of those images with the intent to pick the undecipherable ones, then it is possible to do so. But that's not what's at issue; what's at issue is whether the algorithmic process that generates such images tends to produce only undecipherable images, or a mix of decipherable and undecipherable images. Zhou and Firestone, 2019 chose their images in an unbiased/random way (at least, relying on Nguyen et al.'s presentation), and found decipherability. That's a crucial difference: "unbiased" vs "biased" selection of images.

26) Can the authors make their experiment code available? I apologize if I missed it, but I only saw the data and images in their OSF archive.

27) Experiment 3 also contracts DMB's stated conclusions, and confirms Zhou and Firestone, 2019's hypothesis yet again; even though "To our [DMB's] eyes, these MNIST adversarial images looked completely uninterpretable" [subsection “Experiment 3: Different types of adversarial images”], they still were interpretable! As DMB state, "Participants were slightly above chance for indirectly-encoded MNIST images (t(197) > 6.30, p <.0001, d = 0.44)" [subsection “Experiment 3: Different types of adversarial images”]. Indeed, DMB later contradict this result by saying "Experiment 3 showed that it is straightforward to obtain overall chance level performance on the MNIST images"; but that is not what happened, as subsection “Experiment 3: Different types of adversarial images” shows. It was not straightforward; the images, as a group, were deciphered above chance.

28) Another issue with Experiment 3 is that the study used some labels that would be highly unfamiliar to subjects (or, at least, it seems to have done so; again, the experiment code would be helpful). For example, Appendix 2—figure 3 highlights that subjects were reluctant to call a certain image a "Lesser Panda". The authors seem to interpret this as meaning that subjects believed the image did not look like a Lesser Panda. But, of course, an alternative is that the subjects don't know what a Lesser Panda is. Isn't this an alternative explanation? If so, it would have nothing to do with decipherability, and instead to do with whether subjects know what certain ImageNet labels refer to. Indeed, ImageNet gives "Red Panda" as an alternative, but DMB chose to use "Lesser Panda"; why? The image contains a central red patch; I'd strongly predict that subjects would have classified it above chance if DMB hadn't chosen the much more obscure label "Lesser Panda".

29) Experiment 4 is the most interesting contribution of the paper. Indeed, considering everything I have written above – which I acknowledge has been quite negative – Experiment 4 seems interpretable and really does show chance-level classification. This is the part of the paper that could make a new and meaningful contribution, in way that is not misleading, does not misconstrue Zhou and Firestone, 2019's conclusions, and does not show above-chance deciphering. Zhou and Firestone, 2019 do have a reply to this – it is a version of the "acuity" reply, which DMB consider for their indirectly encoded images (and rightly reject), but not for their directly encoded ones. DMB cite evidence that suggests this: Elsayed et al., showed that when properties of the primate retina are incorporated into a CNN that is adversarially attacked, those attacks do look quite decipherable to humans. But the fact that we would give this reply doesn't really bear on this review of DMB's paper, or its publishability. So even though I disagree with DMB's interpretation of Experiment 4, I have no problem with it in the way I do with the rest of the paper.

30) DMB's item-level analysis was described in a way I found confusing or maybe even misleading. Consider subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”: "agreement on many images (21∕48) was at or below chance levels. This indicates that the agreement is largely driven by a subset of adversarial images". But what DMB call a "subset" was in fact a majority of images! 27/48 here, and 41/48 in Experiment 1. (Indeed, it's not made explicit where this number comes from; in Experiment 3, 39/48 are numerically above chance and 9/48 are numerically below chance. The authors should clarify when they are referring to numerically above chance and when significantly above chance, and to always report what chance is for these analyses.) Again, I don't mean to say this disrespectfully, but it frequently feels that DMB are going out of their way to describe Zhou and Firestone, 2019's results in uncharitable ways. DMB note that 85% of images in Experiment 1 are significantly agreed-with above chance, and that 57% are significantly above chance in Experiment 3b. (And if they combined the data from 3a and 3b, which they do elsewhere, they would find that this total is closer to 70%). Calling a majority of images (85%, 70%, or 57%) a "subset" is of course literally true, but it implies that it's somehow a small number of images, when it fact it's most of the images! Especially when chance for all of these statistics is only 5%. Please refer to it that way, rather than imply it's some kind of small number.

30) Please also annotate the lines in Appendix 1—figure 1, with labels, so that it is immediately clear to the reader that the black line is chance.

31) The authors should always make clear when the stimuli they show to readers were chosen algorithmically or chosen by the authors' own subjective impression of them. For example, Figure 3 could give the impression that authors have some procedure to generate best-case and worst-case images; indeed, I originally interpreted the figure that way. But in fact, I now understand that this just reflects their own choices about which images look least like their target class. So this caption should say so – something like, "Example of best-case and worst-case images for the same category ('penguin'), as judged by the present authors, and as used in Experiment 2". And so on elsewhere, including the generation of labels. Another example is Appendix 2—figure 2, which says "these were judged to contain the least number of features in common with the target category". It would be clearer to say "we judged these images to contain the least…". I know this does happen in some places, but even there it is confusing (for example, DMB say they picked images from "each category"; but in fact I believe it's just each image from one of 10 categories, right? It's worth being especially clear here on both counts).

32) Similarly, in subsection “Experiment 1: Response alternatives” ("We chose a subset of ten images from the 48 that were used by Zhou and Firestone, 2019 and identified four competitive response alternatives (from amongst the 1000 categories in ImageNet) for each of these images"), the authors should state the procedure they used to do this. Why did they choose only 10 images? How did they choose the response alternatives? They give some examples of their negative criteria (e.g., no semantic overlap), but that still leaves out a lot of the selection process. Of course, if the answer is just that they selected in advance the images and labels that they thought would show low agreement, it's important to say so explicitly (though that would, of course, undermine aspects of the experiment's interpretation). I'm also confused why DMB excluded "basketball" from the "baseball" image; they are both kinds of balls, but of course people would be unlikely to visually confuse them – isn't this a (likely unintentionally) self-serving experimental decision?

33) I may well have an overly sensitive ear here, but there is a feeling throughout the paper that DMB believe Zhou and Firestone, 2019 behaved in a sneaky or obfuscatory way in reporting their results. Multiple colleagues have shared with me a similar reading of DMB's paper (after seeing their publicly posted preprint), wondering why it is so sharply worded and insinuatory. I have to say I agree. This is especially unfortunate because nothing could be farther from the truth: Zhou and Firestone, 2019 were transparent about all of these analyses, and just in case we weren't, we proactively made all of our data publicly available so that researchers could know exactly what we did – that, of course, is how DMB acquired the data in the first place. (DMB do not mention this either; it would be informative to the reader, and perhaps more collegial of DMB, to state that the reason they were able to reanalyze Zhou and Firestone, 2019's data was because of Zhou and Firestone, 2019's proactive transparency in making them public.) I hope that a revision, whether here or elsewhere, can be more respectful of other researchers' motivations and not insinuate hidden analyses or selective reporting.

For me, the tone is present throughout, such that it is hard to point out every example. Here are some:

- "Our first step, in trying to understand the surprisingly large agreement between humans and DCNNs observed by Zhou and Firestone, 2019, was to reassess how they measured this agreement" [subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”]. But there was no need for DMB to find themselves "trying to understand" these analyses, as if those analyses were somehow obscure or hidden; the analyses, code, and data were made publicly available alongside the paper itself.

- "We noticed that Z and F used images designed to fool DCNNs trained on images from ImageNet, but did not consider the adversarial images designed to fool a network trained on MNIST dataset" [subsection “Experiment 3: Different types of adversarial images”]. Again, DMB write as if they are suspicious or something. But the reasons are simply that (a) Nguyen et al., highlight their ImageNet images much more (e.g., in their Figure 1), and (b) MNIST-trained networks aren't usually claimed to resemble human vision in the same way as ImageNet-trained networks. Moreover, Zhou and Firestone, 2019 do "consider the adversarial images designed to fool a network trained on MNIST dataset"; that's Zhou and Firestone, 2019's Experiment 5. So this language is not only unnecessary, but also even false.

- "when we examined their results more carefully, the level of agreement was much lower than reported" [Discussion section]. There are two problems here. First, what does "more carefully" mean in this context? More carefully than Zhou and Firestone, 2019? That really seems to imply that Zhou and Firestone, 2019 made an error or something, which DMB do not in fact believe as far as I know. DMB simply prefer another measure, not a "more careful" one, and as DMB acknowledge, Zhou and Firestone, 2019 already carry out some of their preferred analyses. And "much lower than reported" is simply inaccurate; it's fine that DMB prefer a different measure, but that's not the same as Zhou and Firestone, 2019 falsely or inaccurately reporting theirs. All instances of "lower than reported" simply must be revised; Zhou and Firestone, 2019 reported everything accurately – DMB just prefer a different analysis.

- The editor and other reviewers have also flagged "statistically unsound" and related language; I agree that this is inappropriate as well.

34) The Discussion section says "If human classification of these images strongly correlates with DCNNs, as Zhou and Firestone, (2019) observed". But Zhou and Firestone, 2019 do not observe this, for all of the reasons stated above. And this is an especially unfortunate example, since it not only misunderstands Zhou and Firestone, 2019 but also uses a technical term in our field – "correlate", and even "strongly correlate" – that doesn't correspond to anything Zhou and Firestone, 2019 did. Again, Zhou and Firestone, 2019's conclusion is that there is more overlap than would be expected by chance.

eLife. 2020 Sep 2;9:e55978. doi: 10.7554/eLife.55978.sa2

Author response

Essential revisions:

While the reviewers felt that the paper was of sufficient interest for publication (and a fraction were very enthusiastic about its contributions to the literature and its eventual acceptance at eLife), there were five key points that would be essential to address before the article could be accepted, enumerated below. Additionally, we also ask the authors to also address the specific comments in the section below.

1) All three reviewers felt strongly that the language/framing of the paper should be tempered to more constructively reflect the relationship of this paper to the literature and to prevent misinterpretation. Specifically, although the discussion about which statistical methods should be computed to support the claims is key to the manuscript and an important point, there are several places (see specific comments for details) where the reader could get the impression that the authors are claiming that Zhou and Firestone, 2019 computed statistical tests incorrectly (rather than taking a different approach than suggested by the authors here) by using language such as ""statistically unsound." Discussion of the question of which tests to compute would be valuable to this field at the interface of ML and psychology, and so we ask the authors to provide clarity to the text accordingly.

We agree that the term “statistically unsound” is inappropriate and we have removed this phrase. We have also changed some other terms that, on reflection, were too strong. In order to be more constructive, we have now motivated the section on ‘Reassessing the level of agreement’ differently and refrained from comparing the percent of agreement measured using the two methods. We now make it clear that the two types of statistical analyses answer different questions: while the method of computing agreement used by [Zhou and Firestone, 2019] may be suitable for assessing whether agreement between humans and DCNNs was statistically above chance, it is degree of agreement liable to be mistaken as a suitable method for determining the degree of agreement. We have given examples of why measuring agreement in this manner is inappropriate if the goal is to measure the degree of agreement and why the alternative measure, the mean agreement, overcomes these misinterpretations. We hope this discussion of different methods of assessing agreement will be useful for future research investigating agreement between humans and ML algorithms.

2) In their original paper, Zhou and Firestone, 2019 write "We conclude that human intuition is a more reliable guide to machine (mis)classification than has typically been imagined." We ask the authors to comment upon whether that the authors' re-analyses (which still show significantly above-chance classification in Zhou and Firestone, 2019's experiments; e.g., in the authors' Table 2) do support this modest conclusion, even though those analyses are still consistent with a fairly low overall level of human-DCNN agreement.

We indeed find that the agreement between human and DCNN classification is above chance in many cases and discussed in detail why one may obtain this result. In short, our experiments show that this above-chance agreement may be due to (a) the difference between the experimental design under which humans and DCNNs are tested, and (b) some superficial features, such as colour, present within the adversarial images.

If Zhou and Firestone, 2019 were only testing the claim that all adversarial images are completely uninterpretable, then yes, our findings claim. Rather, they were also claiming that universal surprisingly reliable “human and machine classification are robustly related” (Abstract), that are consistent with their conclusion. However, we would like to note that ZF were not just making this very modest “human intuition is a surprisingly reliable guide to machine (mis)classification” (Abstract), that their results suggest a “core visual features surprisingly agreement with the machine’s choices” and that “adversarial examples truly do share with the images they are mistaken for”, emphasis added in all cases).

These claims are important as they directly address the central question ZF want to investigate: “Does the human mind resemble the machine-learning systems that mirror its performance?” (Abstract). It is important to emphasize that ZF claimed that the observed agreement was not due to superficial features of the adversarial images. This was the main conclusion of their Experiment 2. So the main theoretical claim of ZF is that the agreement they reported highlights some important “meaningful” similarities between CNNs and humans. We show that the design of Experiment 2 was flawed (they did not use a good measure for selecting foil images as detailed in our article) and the low level agreement was indeed due to superficial features, or due to images that were not adversarial at all. In sum, our findings are inconsistent with their stronger conclusions (e.g., “surprisingly universal agreement”), and challenge their claim of how the agreement comes about. This is important given that a growing number of neuroscientists, psychologists, and computer scientists are claiming that CNNs are the best current model of human vision.

3) Similarly, the reviewers often had difficulty understanding how subjectively the "best" and "worst" adversarial images were chosen for the experiments? Ideally, this would be done by, say, polling a random group of people, would everyone agree (measured statistically) with these rankings? In absence of such polling, though, we ask the authors to more explicitly discuss which set of adversarial images are used where and to discuss more fully how these choices lead to the conclusions they draw.

We have now carried out the experiment suggested by the reviewers, where we first polled a random group of participants for “best” and “worst” adversarial images and then used these images for testing a second group Experiment 2 of participants. The results from this experiment echo our previous results in : we again find that agreement drops from above-chance to below-chance when “best” images are replaced by “worst” even though both set of images are confidently classified by DCNNs. As this experiment is better controlled, we have replaced the experiment reported in the previous version of the manuscript with the new experiment.

We would also like to note here that, in hindsight, the labels “best” and “worst” weren’t the best choice (though they are now appropriate given the pre-study). What we wanted to examine was whether agreement between humans and DCNNs is robust for some categories, irrespective of the adversarial image chosen from that category. So all we wanted was an “alternative” adversarial image, rather than the “worst” one. What we showed (and now replicate) is that the specific adversarial image chosen matters – i.e. the agreement between humans and DCNNs is not robust even for a subset of categories.

4) The authors consider the existence of adversarial images as a strong counterpoint to the possibility that humans and DCNNs categorize images in the same way. An alternative is that, for many cases, humans and DCNNs are categorizing images in similar ways, but that for some edge cases, DCNNs fail, and we have not yet figured out how to build a DCNN that recognizes when it is being fooled. But what does it mean to have the same way of categorizing images? In this paper, the implicit view is that if humans and the DCNN disagree on this suite of particular experiments, then they are working in fundamentally different ways. Under this definition, do all humans categorize images in one particular way? Do all DCNNs, including DCNNs with different architectures? To make their claim, the authors need to set clear criteria by which to judge that image categorization is the same across networks, and show that variability across DCNNs is smaller than the variability between humans and DCNNs.

The reviewers raise two very interesting issues and we respond to them in order:

Are adversarial images edge cases? We don’t believe so. There are several reasons: (i) the adversarial images considered in this study are classified by DCNNs with high confidence amongst 1000 output categories showing that the network makes no distinction between these images and other images within the test set, (ii) there are a large variety of adversarial attacks (see Akhtar and Mian, [2018]) and, indeed, countless adversarial images for a given output class, and (iii) adversarial attacks are not limited to one particular architecture, but pervasive across different manifestations of convolutional networks.

Still, research on adversarial attacks is ongoing and it is possible that as architectures and learning algorithms improve and image databases increase in size, it becomes increasingly difficult to generate adversarial images. Therefore, we have revised our manuscript to make sure that we are not claiming that adversarial images are a current fundamental problem for DCNNs but continue to be a problem for current architectures. Does poor agreement necessarily mean humans and DCNNs work in fundamentally different ways? It is true that we took the poor human-CNN agreement as evidence that CNN and human object classification are very different, and the point of the reviewer is well taken in this regards. And indeed, we do not wish to suggest that DCNNs necessarily agree with each other on adversarial attacks. In our experience, there are many adversarial attacks on which there is no agreement between DCNNs.

However, it is also the case that many adversarial attacks are frequently transferable. In fact, a number of studies are trying to investigate why and under what conditions adversarial attacks transfer, see Goodfellow et al., [2014], Tram`er et al., [2017], Demontis et al., [2019]. So, as the reviewers suggest, a stronger test for judging whether humans and DCNNs are working in fundamentally different ways would be to choose adversarial images that transfer across networks and test human-DCNN agreement on these images. We have now carried out exactly this experiment.

We chose 10 state-of-the-art DCNNs and 20 adversarial images that at least 9 our out 10 network classify with high confidence in the same way (high between-network agreement). In a similar experiment to Experiment 1, we observed that even when network-network agreement is high, human-network agreement remained poor and the degree of agreement between network and humans did not depend on the amount of agreement between Experiment 5 networks. We have added to the manuscript where we report these results.

5) Related, the particular adversarial images to which a network is sensitive are reflective of the training set that determined the DCNN. The DCNN in this study was trained on the ImageNet dataset. How would another artificial network, trained on a different image dataset, classify these adversarial images? This would establish which of their findings are expected of any pair of networks with different training sets.

To address 4) and 5), it would be helpful to observe the performance of additional DCNNs, trained on completely separate image banks, on each of the four experiments. For example, in Experiment 4, participants had to distinguish between adversarial images generated for ResNet-18 and AlexNet. For 3), how well does ResNet-18 identify AlexNet adversarial images? For 4) If you had two instantiations of AlexNet ("AlexNet1" and "AlexNet2") but trained them on disjoint sets of images from the same classes, would AlexNet1 agree with AlexNet2 on adversarial images generated for AlexNet2? If it turns out that, for instance, AlexNet1 agrees with AlexNet2 at chance level, then the interpretation of chance-level human performance is not valid.

Again, this is an excellent point that did not occur to us. In response to the related point 4 we have conducted a new Experiment 5 experiment to compare DCNN-DCNN to human-DCNN agreement ( ). The reviewers’ suggestion about comparing networks trained on different datasets is also a good one. There is emerging evidence from machine learning that suggests that many adversarial examples from one training set should transfer to other training sets. See, for example, Goodfellow et al., [2014], who note that “To explain why multiple classifiers assign the same class to adversarial examples, we hypothesize that neural networks trained with current methodologies all resemble the linear classifier learned on the same training set. This reference classifier is able to learn approximately the same classification weights when trained on different subsets of the training set, simply because machine learning algorithms are able to generalize. The stability of the underlying classification weights in turn results in the stability of adversarial examples.”

However, the influence of training sets on adversarial attacks (and DCNN representations, in general) is still an active field of investigation, so we have modified the manuscript to acknowledge that the different visual experiences of humans and DCNNs could be one of the factors that influences the poor human-DCNN agreement (see the penultimate paragraph in the Discussion section). In which case, improving the training of DCNNs (and perhaps additionally modifying their architectures) may lead to higher agreement and DCNNs that provide a better theory of the ventral visual pathway.

1) DMB's summary of Zhou and Firestone, 2019's result in the Introduction is ambiguous and open to misreading. Readers might hear DMB's paragraph as stating that Zhou and Firestone, 2019 reported that humans agreed with machines 90% of the time they were asked, which of course Zhou and Firestone, 2019 do not report. A more precise and accurate summary of Zhou and Firestone, 2019's result would be something like this:

2). "The authors took a range of published adversarial images that were claimed to be uninterpretable by humans, and in a series of experiments, they showed those images to human subjects next to the DCNN's preferred label and various foil labels. They reported that, over the course of an experimental session, a high percentage of participants (often close to 90%) chose the DCNN's preferred label more often than would be expected by chance."

We have reworded the section to remove any ambiguity. The new wording is “…in a series of experiments, they showed those images to human subjects next to the DCNN’s preferred label and various foil labels. They reported that, over the course of an experimental session, a high percentage of participants (often close to 90%) chose the DCNN’s preferred label at above-chance rates.” Please note that we have not used the phrase “than would be expected by chance” suggested by the reviewers, but instead “at above-chance rates” for two reasons: firstly, this phrase is an exact quote from Zhou and Firestone, 2019, where they state, “98% of observers chose the machine’s label at abovechance rates”, and secondly, because the 90% figure does not refer to percentage of participants who responded more often than expected by chance as ZF counted participants who agreed exactly at chance levels as well as participants who did not agree significantly above chance (please see subsection ‘Reassessing the level of agreement in Zhou and Firestone, 2019’).

3). The final sentence of the Abstract needs to be tempered somewhat. DCNNs are a "promising model," in particular for many aspects of lower-level vision, but they are not a complete model.

We have now revised the final sentence of the abstract to “We conclude that adversarial images still pose a challenge to theorists using DCNNs as models of human vision.”

4) Introduction: Correct the spelling of inferotemporal cortex.

Done.

5) Introduction: The paper revolves on the analysis of the classification of adversarial images, yet beyond showing examples of types of adversarial image, there is not a clear definition of what an adversarial image is and to what extent adversarial images are classified into types (e.g., 'fooling' or 'naturalistic'). Are these all the types of adversarial images?

We have now provided a definition and a reference for adversarial attacks in the Introduction. There is no formal classification into types, we use ’fooling’ as the term introduced by Nguyen et al., [2015] for images which contain no objects but are confidently classified as a specific class. Naturalistic is a term we use to show adversarial attacks which do contain objects seen in the real world but are perturbed in some way to become adversarial.

6) Subsection “Reassessing the level of agreement in Zhou and Firestone (2019)” and Materials and methods section: Why are choices are independent in the null model? Was the task constructed so that choices may repeat? If so, then this is fine. On the other hand, if only 48 images and 48 labels are presented, a participant may treat this as a 'matching' task inducing correlations between choices. For 48 items, the correction to the null distribution of number of correct choices is minor, but it would not be for a smaller number of items. Either way, a clearer description of the task design is warranted.

Indeed, the task is designed in such a way that the trials are independent and the participant can independently choose a label (amongst 48) on each trial. In order to clarify this, we have added the following line: “Each trial is independent, so a participant can choose any of the 48 labels for each image”.

7) Subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”and Table 1: Please explain what the 'Images' column means in this table. Up to this point, the text has referred only to the participant-level analysis.

We have now removed this Table to avoid problems with incorrect characterization of Zhou and Firestone, 2019’s methods (see response to comment (1) under Essential revisions above).

8) Subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”: grammar – missing definite article.

Added.

9) Subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”: grammar – colloquial usage.

We are not quite sure what the reviewer meant. Could you kindly clarify?

10) Discussion section: This final conclusion, that DCNNs that also incorporate a number of additional features would be less sensitive to adversarial attacks, is interesting. I would recommend reading and possibly citing work on "capsule networks" from Hinton/DeepMind.

Thanks – we have added a citation to capsule networks when discussing how different architectures may solve some of the current issues (Discussion section).

11) Appendix 1—figure 4: check cross-reference – there's a "?"

Done.

12) Critiques of scientific research, including this one, are important, valuable, and welcome. At the same time, a core obligation of any such critique is to present its target charitably, and to engage its strongest arguments. I'm worried that this paper does not do that in its current form. A key piece of context that, in my opinion, DMB's paper loses sight of is that Zhou and Firestone, 2019 were directly motivated by Nguyen et al.'s statement that their fooling images were "completely" and "totally unrecognizable to human eyes" – emphasis on "completely" and "totally". Zhou and Firestone, 2019 noticed that many of the images did seem to have some features in common with their target class, and that Nguyen et al. had collected no human data on this; and so Zhou and Firestone, 2019 simply set out to show that the fooling images in Nguyen et al., were not, as a rule, "totally unrecognizable". It's my strong opinion, even after reading and accepting much of what DMB say, that Zhou and Firestone, 2019's data and analyses (and even DMB's reanalyses) still support Zhou and Firestone, 2019's modest conclusion, and I strongly encourage the authors to consider the same. Note that even Nguyen himself agreed; he wrote to us after reading Zhou and Firestone, 2019's paper to say: "After reading this paper, I develop a stronger intuition that humans can decipher very well those robust, and model-transferrable AXs" and that "I personally agree that the phrase 'totally unrecognizable to human eyes' might have been an overstatement. Thanks to your work for pointing it out!". I agree with that statement too.

Indeed, throughout the paper, DMB write statements like Zhou and Firestone, 2019's "conclusions are not justified" [Introduction], but without actually stating what Zhou and Firestone, 2019's conclusions are. In fact, Zhou and Firestone, 2019's conclusions are clear and explicit:

- "We conclude that human intuition is a more reliable guide to machine (mis)classification than has typically been imagined" (final paragraph of Zhou and Firestone, 2019's Introduction, where "than has typically been imagined" refers to Nguyen et al.'s claim).

- "This implies at least some meaningful degree of similarity in the image features that humans and machines prioritize – or can prioritize – when associating an image with a label" (first paragraph of Zhou and Firestone, 2019's Discussion).

and

- "The present results suggest that this particular challenge to notions of human-machine similarity may not be as simple as it appears (though there may of course be other reasons to doubt the similarity of humans and machines)" (first paragraph of Zhou and Firestone, 2019's Discussion section).

In other words, Zhou and Firestone, 2019's conclusion is not that there's a high level of agreement (though see more on that below), or that DCNNs are good models of human vision, etc.; it is simply that the images are not "totally unrecognizable" in the way Nguyen et al. had claimed. And there can be no doubt that this is Zhou and Firestone, 2019's conclusion, since the only time Zhou and Firestone, 2019 even use the word "conclude" is in the above quote.

It is my strong opinion, perhaps my strongest opinion throughout this review, that DMB have not painted that conclusion in an accurate, fair, or charitable way. And so it is also my very strong opinion that the paper must be revised to state that conclusion accurately, and to consider whether the data as a whole (including Zhou and Firestone, 2019's and DMB's analyses) truly are inconsistent with it. Human-machine agreement can be low (as DMB suggest) while still being consistent with Zhou and Firestone, 2019's conclusion that the images are not "totally unrecognizable". There is not as much disagreement between Zhou and Firestone, 2019 and DMB as DMB makes it seem, and there's no reason not to acknowledge that agreement when it exists.

We are not privy to the authors’ intentions, nor to the personal correspondence with Ahn Nguyen and, like most researchers in the field, only have what’s written in the paper to go by. It is true that Nguyen et al., (2015) had used the term “totally unrecognizable” in their paper. However, they did not make the claim that all images produced by their algorithm are totally unrecognizable, rather “It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects” (emphasis added). Our investigation shows that this statement still holds. There are many images on which agreement is at chance or below-chance levels. The reviewer claims that the entire intent of Zhou and Firestone, 2019 was to show that some images had some features that were common with the target class. But this observation is already present in Nguyen et al., (2015), who write: “In this paper we focus on the fact that there exist images that DNNs declare with near-certainty to be of a class, but are unrecognizable as such. However, it is also interesting that some generated images are recognizable as members of their target class once the class label is known.”

The motivation underlying our manuscript is simply to investigate the theoretical question: are there important similarities between CNN and human object categorisation? This also seems to be the question that interests Zhou and Firestone, 2019 – the first line of their Abstract reads: “Does the human mind resemble the machine-learning systems that mirror its performance?” When ZF find that the agreement between humans and DCNNs is above chance, presumably this is interesting because they think their results help answer this question – i.e. the above-chance agreement implies that human mind does resemble the DCNN. This is the theoretically important position on which we differ. We believe, our results show that the above-chance agreement can arise even though the two systems fundamentally differ from each other.

Still, it is not our goal or our place to speculate on Zhou and Firestone’s intentions. So we have revised the manuscript, so that it is clear that the motivation underlying our work is the above theoretical question. We have also substantially rewritten the section where we reanalyse results from Zhou and Firestone, 2019. Instead of focusing on how our analysis compares to theirs, we have discussed the relative merits (and goals) of the two analyses.

13) The authors claim that Zhou and Firestone, 2019 reported a "surprisingly large agreement between humans and DCNNs" [Introduction]; "the reported level of agreement is nearly 90%" [Results section]. These statements are inaccurate, or at least imprecise. Zhou and Firestone, 2019 do not report a "level of agreement of nearly 90%": "level of agreement" is a new phrase that DMB use (appearing nowhere in Zhou and Firestone, 2019's paper) that makes it seem as though Zhou and Firestone, 2019 reported something they did not, and that also makes it seem that Zhou and Firestone, 2019's measure is aimed at the same quantity as DMB's measure. Additional clarity here would be appreciated, since this comes up later (e.g., in saying that the agreement was "lower than reported", which I worry is misleading – the two measures simply report different quantities).

As noted above we have reworded this point about 90% agreement to avoid any ambiguity. We would like to note here that, even though Zhou and Firestone, 2019 may not have used the term “level of agreement”, they do make statements that imply a large degree of agreement throughout their manuscript. For example, in their abstract, they state that “Human intuition may be a surprisingly reliable guide to machine (mis)classification” and in the main text they state “98% of observers chose the machine’s label at above-chance rates, suggesting surprisingly universal agreement with the machine’s choices”. In our manuscript (subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”), we have clarified why the statistic used by Zhou and Firestone, 2019 should not be used to make such statements.

14) Similarly, DMB wait too long to note that Zhou and Firestone, 2019 already computed DMB's "average agreement" measure; indeed, it was the very first analysis in Zhou and Firestone, 2019's paper, but DMB only later acknowledge this, instead starting with Zhou and Firestone, 2019's Experiment 3 two pages earlier. In Experiment 1, Zhou and Firestone, 2019 state: "Classification "accuracy" (i.e., agreement with the machine's classification) was 74%, well above chance accuracy of 50% (95% confidence interval: [72.9, 75.5%]; two-sided binomial probability test: p < 0.001)". Indeed, this alone shows how it is misleading of DMB to claim that Zhou and Firestone, 2019 report a "level of agreement" near ceiling; Zhou and Firestone, 2019 are clear that "agreement with the machine's classification" is 74%, just as DMB report.

Zhou and Firestone, 2019 only report average agreement for Experiment 1, and we make this point clearly. Please also see responses to comments (12) and (13) above.

15) I stated above that I still think Zhou and Firestone, 2019's analyses support their conclusions. To see why, consider an analogy. Suppose someone claimed that dieting is "completely" and "totally" ineffective for losing weight, as Nguyen et al., claimed that their fooling images were "totally unrecognizable to human eyes". If you were skeptical (as Zhou and Firestone, 2019 were), you could assign subjects to a diet and see if they lose weight. There are then two standard approaches to evaluating whether the diet was "totally ineffective". You could (1) ask how much weight the cohort lost as a whole, and whether that amount deviates significantly from what would be expected under the null; or you could (2) ask how many people lost weight, and ask whether that number significantly deviates from what would be expected under the null. For example, you could (1) discover that the cohort lost 10% of their bodyweight on average, and that this number deviated significantly from 0%; or you could (2) discover that 95% of dieters lost some amount of weight (whether that amount was a lot or a little), and that this number significantly deviated from 50%.

Zhou and Firestone, 2019 take both approaches – (1) and (2) – in their Experiment 1, but then settle on (2) for the rest, because it seemed better suited to refute the claim that adversarial images are "completely unrecognizable to humans". By contrast, (1) is an approach DMB prefer, as is evident from Table 2. That's fine; as DMB show in that table, it also confirms Zhou and Firestone, 2019's conclusions by showing significantly above-chance average agreement. (I comment on Table 1 below.) What should be clear, however, is that these two approaches are both valid, and indeed have certain strengths and weaknesses over one another. For example, if approach (1) finds that subjects lost 10% of their bodyweight on average, that still might not refute the claim that the diet is ineffective, since it could be that most subjects gained a small amount of weight while a minority of subjects lost a very large amount of weight – a possibility that would make this measure ill-tailored to refuting claims of ineffectiveness (since it causes weight gain in most people). Conversely, if you used approach (2) and found that 95% of people lost some amount of weight, then that wouldn't tell you how much weight the cohort (or any one person) lost; but it is still informative, because if 95% of people lose weight on a diet (whether that amount is 1 pound or 50), then whatever you think of this diet, it's hard to plausibly claim that it's "completely" and "totally" ineffective (even if it might not be very effective overall). And, of course, both approaches combined would be best. But, again, they're both valid: If 98% of subjects pick the machine's label numerically more often than not, and if 98% is significantly higher than 50%, then it just can't be true that these images are "completely" and "totally" undecipherable – even if any given subject is only agreeing a bit, and even if that agreement is not even statistically significant within a single subject (just like one needn't say of any given dieter that they lost a "significant" amount of weight; if 95% of dieters lose weight, then that is telling, as long as 95% significantly differs from 50% in the sample).

So DMB are being too critical. There are multiple ways to understand these data, depending on the researchers' goals. DMB are correct to say that Zhou and Firestone, 2019's analyses in Experiment 3 "assign the same importance to a participant that agrees on 2 out of 48 trials as a participant who agrees on all 48 trials with the DCNN" [84]; but this just isn't a problem for Zhou and Firestone, 2019's research question. Zhou and Firestone, 2019's goal was to refute the claim that the fooling images in Nguyen et al. were "completely unrecognizable to humans". Perhaps DMB do not share this goal; but that is no reason to call Zhou and Firestone, 2019's approach "misleading and statistically unsound". Both Zhou and Firestone, 2019's approach and DMB's approach are fine. And, again, both approaches ended up coming out Zhou and Firestone, 2019's way, as DMB confirm in Table 2.

The paper should be revised to reflect this. There is not actually a large disagreement here between Zhou and Firestone, 2019 and DMB as regards to how Zhou and Firestone, 2019's data interact with Nguyen et al.'s claim (which is what Zhou and Firestone, 2019 were after). To give an example (though of course DMB could use language other than mine), the paper could say something like "Whereas the analyses Zhou and Firestone, 2019 use may have been sufficient to refute previous claims that such images were "totally unrecognizable to human eyes", those analyses are still consistent with a fairly low overall level of human-machine agreement." (But of course, much else would then have to change too, since DMB are so persistent in calling Zhou and Firestone, 2019's conclusions unjustified.)

We have responded to this point above (comments (2), (12) and (13)) – the reviewer is mischaracterizing the claim of Nguyen et al., as well as the claims made by Zhou and Firestone, 2019.

16) A central issue with the re-analyses is how much DMB's discussion focuses on Zhou and Firestone, 2019's Experiment 3b. In my opinion, this was a strange experiment to choose, especially when it comes to giving a charitable critique. The entire purpose of Zhou and Firestone, 2019's Experiment 3b was to make human-machine agreement as difficult as possible, and this is the single least-discussed experiment (of 8) in Zhou and Firestone, 2019's entire paper – not the cornerstone of Zhou and Firestone, 2019's claims in any way. When Zhou and Firestone, 2019 observed above-chance human-machine agreement in Experiment 1, there was a worry that above-chance agreement when given two alternatives wasn't so impressive. So, Experiments 3a and 3b presented all the labels for every image at once, to see if subjects could show above-chance agreement even in extremely taxing circumstances. Zhou and Firestone, 2019 were very explicit about this: Zhou and Firestone, 2019 say "These results suggest that humans show general agreement with the machine even in the taxing and unnatural circumstance of choosing their classification from dozens of labels displayed simultaneously". So Zhou and Firestone, 2019 already described this task as "taxing and unnatural" and unlikely to produce high human-machine agreement. And indeed it's certain that subjects do not actually read all the labels before making judgments, because they often answer in just a few seconds – not enough time to have looked at all the labels. So, of course agreement will be low – that was the point of that experiment.

Indeed it is possible to see here – http://www.czf.perceptionresearch.org/adversarial/expts/texture48.html – how imposing that is, whereas Experiment 1 – here http://www.czf.perceptionresearch.org/adversarial/expts/texture.html – feels much more natural. Yet, subjects in Experiment 3a and 3b still showed a mean agreement of ~10%, rather than the chance level of ~2%. We thought that was impressive, given the circumstances. DMB may disagree, but that is simply a difference in opinion or taste, not an undermining of Zhou and Firestone, 2019's claims. (Indeed, DMB later describe this result as "striking"; strikingly high? If so, why not say so earlier?)

To be fairer and more accurate, the relevant claims in DMB's paper should also include Zhou and Firestone, 2019's Experiment 1 as an example. For example, subsection “Reassessing the level of agreement in Zhou and Firestone (2019)” should first describe how Zhou and Firestone, 2019 analyzed Experiment 1, and then describe Experiment 3. Figure 1.1 should include Experiment 1 and Experiment 3. And so on. Otherwise, I worry that DMB might be cherry picking, choosing the examples from Zhou and Firestone, 2019 that they feel are weakest, rather than the ones they feel are strongest. For example, DMB state in subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)” that "For the rest of the images agreement is at or below chance levels". But that's only true of Experiment 3; for Zhou and Firestone, 2019's Experiment 1, which was not designed to be so taxing on subjects, even DMB's binomial analysis shows that agreement is significantly above chance on over 85% of images. It is potentially misleading (if not simply false) to claim that "For the rest of the images agreement is at or below chance levels". This discussion should be more balanced, and either focus on Experiment 1 or at least include parallel analyses for Experiment 1's results whenever DMB discuss Experiment 3.

The reason for focusing on Experiment 3 was, in fact, exactly because the results there were most impressive. In our view, this is the experiment that comes closest to testing participants under similar conditions to DCNNs: participants see a large set of foils to the target class. Experiment 1 only shows one foil and, as demonstrated by our Experiment 1, choosing a random foil may inflate the degree of agreement. See responses (12) and (13) for other points about reanalysis.

17) Beyond the re-analyses, I'm worried that DMB present their views as alternatives to Zhou and Firestone, 2019, when in fact many of these views were already explicitly articulated by Zhou and Firestone, 2019. I've already emphasized how this is true of Zhou and Firestone, 2019's main conclusion. But another example is DMB's suggestion that human-machine agreement on these images reflects "participants making educated guesses based on some superficial features (such as colour) within images and the limited response alternatives presented to them" [Introduction]. Truly, this is already Zhou and Firestone, 2019's exact view about those experiments. Again, with quotes, Zhou and Firestone, 2019 say that subjects "may have achieved this reliable classification not by discerning any meaningful resemblance between the images and their CNN-generated labels, but instead by identifying very superficial commonalities between them (e.g., preferring "bagel" to "pinwheel" for an orange yellow blob simply because bagels are also orange-yellow in color)". That applies to Experiments 1 and 3, which use the same labels.

DMB subtly acknowledge this elsewhere, but otherwise they present this hypothesis as if it is original to them. Consider subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)” or subsection “Experiment 5: Transferable adversarial images”, which says of Elsayed et al., that "These findings are consistent with our observation that some adversarial images capture some superficial features that can be used by participants to make classification decisions" (emphasis added). But this is already Zhou and Firestone, 2019's exact hypothesis. Both Zhou and Firestone, 2019 and DMB believe subjects are making educated guesses. (Indeed, Zhou and Firestone, 2019 think that may be the right way to describe what the CNN is doing too: Zhou and Firestone, 2019 write "both the CNNs' behavior and the humans' behavior might be readily interpreted as simply playing along with picking whichever label is most appropriate for an image", and "CNNs are.… forced to play the game of picking whichever label in their repertoire best matches an image (as were the humans in our experiments)".

Perhaps "consistent with Zhou and Firestone, 2019's and our observation" would be a better statement for this and other examples. But it's simply not right to portray this as DMB's own or original interpretation; it is precisely Zhou and Firestone, 2019's interpretation, as these quotes show. This is another example where DMB portray disagreement that may not really exist, and so should attribute this interpretation to Zhou and Firestone, 2019.

The reviewer’s statement that ZF claimed that agreement was based on making educated guesses based on superficial features is false. The quote mentioned by the reviewer (“may have achieved... superficial commonalities between them”) is taken from the motivation of their Experiment 2, which was designed to test whether agreement was driven by superficial features and the findings were taken to refute this hypothesis. The authors wrote: “Again, human observers agreed with the machine’s classifications: 91% of observers tended to choose the machine’s first choice over its second choice, and 71% of the images showed human-machine agreement (Figure 2D). Evidently, humans can appreciate deeper features within adversarial images that distinguish the CNN’s primary classification from closely competing alternatives.” Thus, the view that participants are making educated guesses based on some superficial features is the exact opposite of the conclusion drawn by Zhou and Firestone, 2019'.

18) "Using a statistic that treats 45% agreement as chance is liable to be misinterpreted by readers" [subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”]. Perhaps this is true; I agree that Zhou and Firestone, 2019 could have been clearer. But doesn't this support the opposite of DMB's point? The reader that DMB are imagining would normally take chance to be 50% (as Zhou and Firestone, 2019's figures show, perhaps unclearly), whereas the true value of above-chance-performing participants is 88% or 90%. So the fact that chance is 45% rather than 50% makes the result more impressive, not less impressive, since 88% is even farther above chance than the reader was led to believe.

What are the possible interpretations of the statement “98% of observers chose the machine’s label at above-chance rates, suggesting surprisingly universal agreement with the machine’s choices” (Zhou and Firestone, 2019, emphasis added)? One possible interpretation is that if you ask 100 humans to classify an adversarial image, 98 humans (on average) will choose the same label as the machine. This would indeed be a surprisingly universal agreement with the machine’s choice. However, this interpretation turns out to be false. The crucial bit of the statement lies in the phrase “at above-chance rates” – i.e., 98 out of 100 participants agree with the machine if agreement is evaluated as same choice on 50% or more of trials. In each of these trials, participants were given two choices, one of which was the label chosen by the machine and the other one a random label from ImageNet. Put this way, the results are less surprising. As the reviewer agrees (point 17 above) such levels of agreement can arise out of participants making educated guesses based on superficial features present within these images. Indeed, when various experimental factors are controlled, the agreement is scarcely above chance. We did not want to imply that Zhou and Firestone, 2019 intentionally mischaracterized their results (though perhaps the use of the phrase “surprisingly universal” was unfortunate), but that such a misinterpretation of their results and analysis is possible. Our reanalysis tries to present various facets of the data to minimize such misinterpretation. We have revised the manuscript to make this more clear.

19) In the very next paragraph, DMB make the exact kind of misleading statement that they just charged Zhou and Firestone, 2019 with, and in a way that, I worry, makes this part of the reanalysis potentially misleading. DMB say "Measured in this manner (independent 2-tailed binomial tests with a critical p-value of 0.05 for each participant), the agreement in Experiment 3a between DCNN and participants drops from 88% to 57.76%" [94]. But in fact there is no "drop", because chance changes across these two measures. For the 88% measure, chance is 45%; but for the 57.76% measure, chance is 5% (or even 2.5%), since that's the α value of the test that DMB run. In other words, you would expect 5% of subjects to perform significantly different from chance under the null. So DMB have moved the standard without properly alerting the reader. Indeed, 57.76% when chance is 5% is, if anything, more impressive than 88% when chance is 50%. This paragraph should be revised to reflect this. It could say something like, "Measured in this manner, the agreement in Experiment 3a between DCNN and participants moves from 88% (where chance is 45%) to 57.76% (where chance is 5%)"; it would then be clear that "drops" isn't appropriate. They should also report this for Experiment 1, in line with my comment #3 above.

We have now deleted this paragraph as well as Table 1 and rewritten the Reanalysis section to reflect that the two statistical method allow one to answer different questions.

20) This same issue appears in Table 1. To be more interpretable, that table should not only report these %s but should also report the chance level for each %, just as DMB do for Table 2. That will help the reader understand how to interpret what DMB call the "drop" from higher numbers to lower numbers. I strongly recommend this; the table is, I worry, highly misleading without this correction – i.e., a column for something like "what would be expected by chance" (which, I gather is 5%). Indeed, a problem with the analysis in Table 1 is that, in some sense, there's no "null hypothesis"; there's no standard by which the analysis can decide whether the cohort of images was "totally unrecognizable" or not. The value of Zhou and Firestone, 2019 analysis, and DMB's Table 2, is that it's clear how to reject the null hypothesis: i.e., if significantly more images are numerically above chance than are numerical below chance (Zhou and Firestone, 2019), or if mean agreement is significant above chance (Zhou and Firestone, 2019 and DMB). But DMB's Table 1 analysis doesn't really work that way. So that's why it's crucial to portray what chance is for those tests, in the table itself, so that readers can see that it was 85% agreement when chance was 5%, or 57% when chance was 5%, etc.

We have now removed, what was, Table 1 and instead focused this section on discussing mean agreement as the more appropriate measure for the degree of agreement between humans and DCNNs.

21) A final issue with the reanalyses themselves, and in some ways the biggest one, is just that Zhou and Firestone, 2019's conclusions survive them, in ways that make it extremely confusing to me why DMB draw the conclusions they do. As Table 2 shows, every one of Zhou and Firestone, 2019's experiments continues to show significantly above-chance human-machine agreement, even on the analytical approach that DMB prefer. And DMB acknowledge this: They say "it is nevertheless the case that even these methods show that the overall agreement was above chance" [subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”]. First, DMB should not wait this long to say so; they should say as early as possible that their analyses, like ours, show significantly above-chance agreement. But second, this demonstrates that Zhou and Firestone, 2019's conclusions, as stated above, are secure after all. I must repeat again that Zhou and Firestone, 2019's claims, made explicitly in their paper, are as follows: "that human intuition is a more reliable guide to machine (mis)classification than has typically been imagined", and that the results "impl[y] at least some meaningful degree of similarity in the image features that humans and machines prioritize-or can prioritize-when associating an image with a label". DMB confirm that these conclusions are even more robust than Zhou and Firestone, 2019 suggested, since they are supported by Zhou and Firestone, 2019's approach and DMB's.

We indeed find that the mean agreement is above-chance for the experiments carried out by Zhou and Firestone, 2019. We dedicate a substantial portion of the paper discussing why exactly this may be. The experiments we carry out are designed to tease apart these reasons. We show that the above-chance mean agreement is partly due to how the experiments and the stimuli set are chosen and partly due to (some superficial) properties of adversarial images.

22) Moving to the experiments: DMB's abstract states that "it is easy to generate images with no agreement" [16]. I don't see how this claim is justified by their experiments. First, half or more of DMB's experiments reflect a failure to "generate images with no agreement". Experiment 1 shows above-chance agreement: "A single sample t-test comparing the mean agreement level to the fixed value of 25% did show the difference was significant (t(99) = 3.00, p = .0034, d = 0.30)" [179]. Experiment 2 does so as well: when the best and worst cases are combined, their average agreement is above chance – it's only through picking images that DMB think look undecipherable that they were able to find undecipherable images (note that Zhou and Firestone, 2019 never state that all images will be undecipherable, but rather that the procedure for producing such images will tend to produce decipherable images in general; Experiment 2 confirms this). Experiment 3 does so as well: "Participants were slightly above chance for indirectly-encoded MNIST images (t(197) > 6.30, p <.0001, d = 0.44)" [subsection “Experiment 3: Different types of adversarial images”]. (Only Experiment 4 does not; I'll return to that later.)

This was confusing. Why, if it is "easy" to generate images with no agreement, did DMB so frequently fail to do so? This conclusion should be altered to something like "While difficult, it is possible to generate images with no agreement"; but surely not "easy"!

The phrase “easy to generate images with no agreement” pertains to Experiment 4. This is the only experiment in which we have generated adversarial images ourselves. All other images were taken from Nguyen et al., (2015). However, we agree that “easy” is an informal expression, so we have replaced this with the more precise phrase: “we find that there are well-known methods of generating adversarial images where humans show no agreement with DCNNs.”

23) Experiment 1 is hard to interpret as run; or, if it is, it doesn't show a lack of human-machine agreement. The authors chose labels that they subjectively felt were good competitors for the DCNN's label, and found that agreement dropped but was still significantly above chance. This is unsurprising and no threat to Zhou and Firestone, 2019's view, for at least three reasons.

First, this conclusion was already explicitly reached by Zhou and Firestone, 2019, who also considered this and whose Experiment 2 showed that human-machine agreement drops with more competitive labels. DMB mention Zhou and Firestone, 2019's Experiment 2, and criticize it on other grounds; but those grounds are independent of this aspect of Zhou and Firestone, 2019's discussion. In other words, even granting DMB's criticism of Zhou and Firestone, 2019's Experiment 2, Zhou and Firestone, 2019 still perfectly anticipate the conclusion of DMB's Experiment 1. But again, as above, DMB do not credit Zhou and Firestone, 2019 with this, and instead present this as though it is original to DMB. DMB should acknowledge that their Experiment 1 confirms the results of Zhou and Firestone, 2019's Experiment 2 – that more competitive labels should reduce agreement.

Second, DMB's version (but not Zhou and Firestone, 2019's version) is much more difficult to interpret because it is almost a form of "double-dipping": The researchers, DMB, are human beings with visual systems who can appreciate which images look least like certain labels; so they picked the foil labels they thought fit best, and then discovered that other humans (their subjects) agreed. But Zhou and Firestone, 2019 never claimed that humans would pick the DCNN's label as the literal #1 label among all 1000! Again, Zhou and Firestone, 2019's claim is only that humans favor the DCNN's label better than would be expected if the images were "totally unrecognizable to human eyes". So, this experiment is perfectly consistent with Zhou and Firestone, 2019's view. I keep returning to this because DMB have made Zhou and Firestone, 2019's paper the cornerstone of their own paper. If, instead, they just presented four new interesting experiments, they could interpret those experiments in their chosen way, and readers could decide to be persuaded or not. But instead, DMB present these experiments as refuting Zhou and Firestone, 2019. In that case, they have to get Zhou and Firestone, 2019's claims right. All Zhou and Firestone, 2019 need is for the humans to think the DCNN's label fits better than previously thought; Zhou and Firestone, 2019 don't need it to be the very best label.

Third, and most crucially, DMB's Experiment 1 still shows above-chance agreement! So it seemingly shows the opposite of DMB's claim (that their experiments show it is easy to generate images with "no agreement"), and continues to support Zhou and Firestone, 2019.

The goal of Experiment 1 was not to discredit Zhou and Firestone but to understand whether humans will agree with DCNN classification if they made their decision under the same conditions as the DCNN. Obviously, it’s impractical to run an experiment where humans have to choose amongst 1000 labels. Therefore, Zhou and Firestone chose the alternative labels randomly. Choosing alternative labels in this fashion is understandable. However, the results from Experiment 1 shows that, had the experiment with 1000 alternative labels been feasible, it would have shown much lower level of agreement as responses would have been distributed over these competing labels. Zhou and Firestone may not be interested in this question, but we feel that it is an important question to address when considering whether there are meaningful similarities in human and DCNN object recognition.

The reviewer is however entirely correct on the dangers of double-dipping in this experiment. To mitigate this effect, we chose alternative competitive labels so that they were not semantically related to each other (see Appendix 2—figure 1). Even so, we agree that even though we find that agreement is close to chance in Experiment 1, this does not necessarily mean that it will be close to chance when the participants see all 1000 categories. Therefore, we do not make this claim in our manuscript. The critical finding of this experiment is that agreement drops considerably when labels are not randomly selected, which shows that participants cannot clearly identify a single category from these images that DCNNs classify with 99% confidence.

24) Experiment 2 is also hard to interpret, for the same reason as Experiment 1, and for an additional reason. It shows that some images are easy to decipher (showing above chance classification), but some are hard to decipher (showing below chance classification). Note that Zhou and Firestone, 2019 explicitly predict this, but again DMB fail to acknowledge this. Zhou and Firestone, 2019 state: "A small minority of the images in the present experiments.… had CNN-generated labels that were actively rejected by human subjects, who failed to pick the CNN's chosen label even compared to a random label drawn from the image set. Such images better meet the ideal of an adversarial example, since the human subject actively rejects the CNN's label". So it is no surprise that it is possible to find images of the "worst-case" type by having a human pick them out; Zhou and Firestone, 2019 already said that should happen, when they wrote "An important question for future work will be whether adversarial attacks can ever be refined to produce only those images that humans cannot decipher, or whether such attacks will always output a mix of human-classifiable and human-unclassifiable images; it may well be that human validation will always be required to produce such truly adversarial images". DMB's Experiment 2 is perfect evidence for Zhou and Firestone, 2019's prediction; they show, just as Zhou and Firestone, 2019 say, that a process that tends to generate decipherable images will also generate some undecipherable ones that a researcher could pick out. This is yet another example of DMB making a claim as if it is original to them, rather than crediting Zhou and Firestone, 2019 with that claim and noting that DMB's results confirm Zhou and Firestone, 2019's claims or predictions. It continues to confuse me why DMB frame things this way. There is no need to write as though there is a disagreement here.

Again, the point of Experiment 2 is not to discredit ZF but to examine the following important question: do humans and CNNs consistently agree with each other on a subset of image categories or only on a subset of images. E.g., do humans and DCNNs agree on what a ‘Tile roof’ looks like. The adversarial image we used in Experiment 1 (random labels condition) for Tile roof showed an agreement of 75% between participants and the DCNN. So, in Experiment 2 we chose a different adversarial image for the same category. In this case, we found that the agreement dropped to ∼17% (i.e. below chance) showing that the agreement wasn’t consistent for this category. This same pattern is reproduced for many categories and overall agreement for alternative images is below chance.

We think that this is informative, and incidentally, contradicts Zhou and Firestone’s claim that human intuition is a reliable guide to machine (mis)classification.

25) Indeed, a relevant difference between Zhou and Firestone, 2019's experiments and DMB's Experiment 2 is that Zhou and Firestone, 2019 used images that they didn't even "choose"; they simply used the ones Nguyen et al. displayed in their paper. DMB show that, if the researcher selects a subset of those images with the intent to pick the undecipherable ones, then it is possible to do so. But that's not what's at issue; what's at issue is whether the algorithmic process that generates such images tends to produce only undecipherable images, or a mix of decipherable and undecipherable images. Zhou and Firestone, 2019 chose their images in an unbiased/random way (at least, relying on Nguyen et al.,'s presentation), and found decipherability. That's a crucial difference: "unbiased" vs "biased" selection of images.

Again, the question is not whether DMB’s choice or Zhou and Firestone’s choice (or Nguyen’s choice, for that matter) is the better one, but whether the choice on certain categories matters. Clearly it does. This is problematic if one infers the agreement on certain categories to mean that humans and DCNNs share representations for some categories. Also, we do not claim that all adversarial images are undecipherable. On the contrary, we argue that many adversarial images are clearly interpretable (subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”).

26) Can the authors make their experiment code available? I apologize if I missed it, but I only saw the data and images in their OSF archive.

We have uploaded all materials (stimuli) as well as data collected onto OSF. The reason for not uploading the code is that the experiments were conducted in PsyToolkit and PsychoPy (through Pavlovia) which have both undergone several version updates, which would render the code unworkable. We have therefore provided detailed descriptions of the experimental procedure in the Methods section which, along with the uploaded Materials and methods section, should be sufficient for easy replication. Should the reviewer or any reader need the code that we used, we will be happy to provide this upon request.

27) Experiment 3 also contracts DMB's stated conclusions, and confirms Zhou and Firestone, 2019's hypothesis yet again; even though "To our [DMB's] eyes, these MNIST adversarial images looked completely uninterpretable" [subsection “Experiment 3: Different types of adversarial images”], they still were interpretable! As DMB state, "Participants were slightly above chance for indirectly-encoded MNIST images (t(197) > 6.30, p <.0001, d = 0.44)" [subsection “Experiment 3: Different types of adversarial images”]. Indeed, DMB later contradict this result by saying "Experiment 3 showed that it is straightforward to obtain overall chance level performance on the MNIST images"; but that is not what happened, as subsection “Experiment 3: Different types of adversarial images” shows. It was not straightforward; the images, as a group, were deciphered above chance.

There is no contradiction in our claims and conclusions, and Experiment 3 provides no support for Zhou and Firestone, 2019 claims. We found 13.53% agreement for indirect and 10.43% agreement for directly generated images (when chance was 10%). Appendix 2—figure 3(C) illustrates the true nature of the “interpretable” indirectly encoded MNIST results. There was a tendency for participants to choose the digit ’0’ (it was the most frequent choice for 6/20 images and made up a total of 16.8% of responses) and digit ’1’ (a total of 14.26%). Looking at 16/20 stimuli, accounting for that preference, the average agreement falls to exact chance (10.2%). Looking at all the stimuli reveals quite large variability which also negates any claim of general interpretability of these images. The results with the direct and indirect encoded MNIST images clearly challenge the claim that humans have meaningful insights into how DCNNs classify these images. We believe that simply pointing out significance without judging variability, systematic but unrelated tendencies, and effect sizes is not informative enough.

28) Another issue with Experiment 3 is that the study used some labels that would be highly unfamiliar to subjects (or, at least, it seems to have done so; again, the experiment code would be helpful). For example, Appendix 2—figure 3 highlights that subjects were reluctant to call a certain image a "Lesser Panda". The authors seem to interpret this as meaning that subjects believed the image did not look like a Lesser Panda. But, of course, an alternative is that the subjects don't know what a Lesser Panda is. Isn't this an alternative explanation? If so, it would have nothing to do with decipherability, and instead to do with whether subjects know what certain ImageNet labels refer to. Indeed, ImageNet gives "Red Panda" as an alternative, but DMB chose to use "Lesser Panda"; why? The image contains a central red patch; I'd strongly predict that subjects would have classified it above chance if DMB hadn't chosen the much more obscure label "Lesser Panda".

We don’t think the obscurity of the label is an explanation for why we observe a chance-level or below-chance agreement on some of the images. Firstly, the labels for other images in this experiment where we observe ator below-chance performance are ‘cheetah’, ‘golden retriever’, ‘stopwatch’ and ‘soccer ball’. There is no reason to suspect that participants don’t know what these categories look like any more so than categories that show above-chance agreement. (All these category labels are already available along with the images uploaded on OSF). Secondly, even if participants don’t know what a ‘Lesser Panda’ is, surely they know what a Panda is – that is enough information to distinguish it from alternative labels, such as ‘centipede’, ‘stopwatch’, ‘cheetah’, etc.

Moreover, if the reviewer is correct and agreement increases on swapping the label from ‘Lesser Panda’ to ‘Red Panda’ because the image contains a central red patch, this is entirely in line with our argument in the manuscript: participants choose labels by making educated guesses based on superficial features (such as colour) present within these images. It also directly contradicts the conclusion of Experiment 2 in ZF, where they tested whether agreement was due to “superficial commonalities” and found instead that “humans can appreciate deeper features within adversarial images that distinguish the CNN’s primary classification from closely competing alternatives” (Zhou and Firestone, 2019').

29) Experiment 4 is the most interesting contribution of the paper. Indeed, considering everything I have written above – which I acknowledge has been quite negative – Experiment 4 seems interpretable and really does show chance-level classification. This is the part of the paper that could make a new and meaningful contribution, in way that is not misleading, does not misconstrue Zhou and Firestone, 2019's conclusions, and does not show above-chance deciphering. Zhou and Firestone, 2019 do have a reply to this – it is a version of the "acuity" reply, which DMB consider for their indirectly encoded images (and rightly reject), but not for their directly encoded ones. DMB cite evidence that suggests this: Elsayed et al., showed that when properties of the primate retina are incorporated into a CNN that is adversarially attacked, those attacks do look quite decipherable to humans. But the fact that we would give this reply doesn't really bear on this review of DMB's paper, or its publishability. So even though I disagree with DMB's interpretation of Experiment 4, I have no problem with it in the way I do with the rest of the paper.

We are pleased the reviewer likes this experiment.

30) DMB's item-level analysis was described in a way I found confusing or maybe even misleading. Consider subsection “Reassessing the basis of the agreement in Zhou and Firestone (2019)”: "agreement on many images (21∕48) was at or below chance levels. This indicates that the agreement is largely driven by a subset of adversarial images". But what DMB call a "subset" was in fact a majority of images! 27/48 here, and 41/48 in Experiment 1. (Indeed, it's not made explicit where this number comes from; in Experiment 3, 39/48 are numerically above chance and 9/48 are numerically below chance. The authors should clarify when they are referring to numerically above chance and when significantly above chance, and to always report what chance is for these analyses.) Again, I don't mean to say this disrespectfully, but it frequently feels that DMB are going out of their way to describe Zhou and Firestone, 2019's results in uncharitable ways. DMB note that 85% of images in Experiment 1 are significantly agreed-with above chance, and that 57% are significantly above chance in Experiment 3b. (And if they combined the data from 3a and 3b, which they do elsewhere, they would find that this total is closer to 70%). Calling a majority of images (85%, 70%, or 57%) a "subset" is of course literally true, but it implies that it's somehow a small number of images, when it fact it's most of the images! Especially when chance for all of these statistics is only 5%. Please refer to it that way, rather than imply it's some kind of small number.

27/48 is a subset, and we provide the numbers in the relevant text. We don’t see how any of this is misleading.

30) Please also annotate the lines in Appendix 1—figure 1, with labels, so that it is immediately clear to the reader that the black line is chance.

Done.

31) The authors should always make clear when the stimuli they show to readers were chosen algorithmically or chosen by the authors' own subjective impression of them. For example, Figure 3 could give the impression that authors have some procedure to generate best-case and worst-case images; indeed, I originally interpreted the figure that way. But in fact, I now understand that this just reflects their own choices about which images look least like their target class. So this caption should say so – something like, "Example of best-case and worst-case images for the same category ('penguin'), as judged by the present authors, and as used in Experiment 2". And so on elsewhere, including the generation of labels. Another example is Appendix 2—figure 2, which says "these were judged to contain the least number of features in common with the target category". It would be clearer to say "we judged these images to contain the least…". I know this does happen in some places, but even there it is confusing (for example, DMB say they picked images from "each category"; but in fact I believe it's just each image from one of 10 categories, right? It's worth being especially clear here on both counts).

Please see response (3) to Essential revisions above.

32) Similarly, in subsection “Experiment 1: Response alternatives” ("We chose a subset of ten images from the 48 that were used by Zhou and Firestone, 2019 and identified four competitive response alternatives (from amongst the 1000 categories in ImageNet) for each of these images"), the authors should state the procedure they used to do this. Why did they choose only 10 images? How did they choose the response alternatives? They give some examples of their negative criteria (e.g., no semantic overlap), but that still leaves out a lot of the selection process. Of course, if the answer is just that they selected in advance the images and labels that they thought would show low agreement, it's important to say so explicitly (though that would, of course, undermine aspects of the experiment's interpretation). I'm also confused why DMB excluded "basketball" from the "baseball" image; they are both kinds of balls, but of course people would be unlikely to visually confuse them – isn't this a (likely unintentionally) self-serving experimental decision?

The number of images (10) was decided upon to match Experiment 3, which includes MNIST stimuli which has 10 categories. Since that experiment had a 2x2 design we wanted all conditions to have the same number of stimuli. Consequently, the same categories were used in both experiments. The images were selected at random from the 48 used by Zhou and Firestone (Experiment 3b) and then checked for whether or not they contain images like ’chainlink fence’ which trivially look like interpretable images from the category. Additionally, we checked the average agreement as computed by Zhou and Firestone to make sure the selection not biased towards low agreement as computed by Zhou and Firestone. In fact, 9 out of 10 images chosen show above chance agreement as computed by Zhou and Firestone (90% compared to 81.25% of the overall set from their experiment). The average agreement being 7.87% which was not significantly different to the average for the entire 48 image set (9.96%). The label choice for the competitive label condition excluded other objects from the categories closely related to the target category exactly in order not to bias results in favour of the hypothesis that competitive labels would decrease agreement levels (e.g. not choosing ’acoustic guitar’ as the foil for ’electric guitar’).

33) I may well have an overly sensitive ear here, but there is a feeling throughout the paper that DMB believe Zhou and Firestone, 2019 behaved in a sneaky or obfuscatory way in reporting their results. Multiple colleagues have shared with me a similar reading of DMB's paper (after seeing their publicly posted preprint), wondering why it is so sharply worded and insinuatory. I have to say I agree. This is especially unfortunate because nothing could be farther from the truth: Zhou and Firestone, 2019 were transparent about all of these analyses, and just in case we weren't, we proactively made all of our data publicly available so that researchers could know exactly what we did – that, of course, is how DMB acquired the data in the first place. (DMB do not mention this either; it would be informative to the reader, and perhaps more collegial of DMB, to state that the reason they were able to reanalyze Zhou and Firestone, 2019's data was because of Zhou and Firestone, 2019's proactive transparency in making them public.) I hope that a revision, whether here or elsewhere, can be more respectful of other researchers' motivations and not insinuate hidden analyses or selective reporting.

For me, the tone is present throughout, such that it is hard to point out every example. Here are some:

- "Our first step, in trying to understand the surprisingly large agreement between humans and DCNNs observed by Zhou and Firestone, 2019, was to reassess how they measured this agreement" [subsection “Reassessing the level of agreement in Zhou and Firestone (2019)”]. But there was no need for DMB to find themselves "trying to understand" these analyses, as if those analyses were somehow obscure or hidden; the analyses, code, and data were made publicly available alongside the paper itself.

- "We noticed that Z and F used images designed to fool DCNNs trained on images from ImageNet, but did not consider the adversarial images designed to fool a network trained on MNIST dataset" [subsection “Experiment 3: Different types of adversarial images”]. Again, DMB write as if they are suspicious or something. But the reasons are simply that (a) Nguyen et al., highlight their ImageNet images much more (e.g., in their Figure 1), and (b) MNIST-trained networks aren't usually claimed to resemble human vision in the same way as ImageNet-trained networks. Moreover, Zhou and Firestone, 2019 do "consider the adversarial images designed to fool a network trained on MNIST dataset"; that's Zhou and Firestone, 2019's Experiment 5. So this language is not only unnecessary, but also even false.

- "when we examined their results more carefully, the level of agreement was much lower than reported" [Discussion section]. There are two problems here. First, what does "more carefully" mean in this context? More carefully than Zhou and Firestone, 2019? That really seems to imply that Zhou and Firestone, 2019 made an error or something, which DMB do not in fact believe as far as I know. DMB simply prefer another measure, not a "more careful" one, and as DMB acknowledge, Zhou and Firestone, 2019 already carry out some of their preferred analyses. And "much lower than reported" is simply inaccurate; it's fine that DMB prefer a different measure, but that's not the same as Zhou and Firestone, 2019 falsely or inaccurately reporting theirs. All instances of "lower than reported" simply must be revised; Zhou and Firestone, 2019 reported everything accurately – DMB just prefer a different analysis.

- The editor and other reviewers have also flagged "statistically unsound" and related language; I agree that this is inappropriate as well.

We have changed some words that the reviewer objects to (“statistically unsound”), but we do not understand the reaction. None of them (apart from “statistically unsound”) seem to us inappropriate nor suggest that Zhou and Firestone, 2019 behaved in a sneaky or obfuscatory way. To address reviewer 1’s concern we have also changed the sentence “when we examined their results more carefully” to “when we examined their results in more detail”.

34) The Discussion section says "If human classification of these images strongly correlates with DCNNs, as Zhou and Firestone, (2019) observed". But Zhou and Firestone, 2019 do not observe this, for all of the reasons stated above. And this is an especially unfortunate example, since it not only misunderstands Zhou and Firestone, 2019 but also uses a technical term in our field – "correlate", and even "strongly correlate" – that doesn't correspond to anything Zhou and Firestone, 2019 did. Again, Zhou and Firestone, 2019's conclusion is that there is more overlap than would be expected by chance.

Please see our responses (12) and (13) above and (2) in Essential revisions.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Dujmovic M, Malhotra G, Bowers JS. 2020. What do adversarial images tell us about human vision? Open Science Framework. a2sh5 [DOI] [PMC free article] [PubMed]

Supplementary Materials

Transparent reporting form

elife-55978-transrepform.docx^{(246.2KB, docx)}

Data Availability Statement

Data, scripts and stimuli for all of the experiments are available via the Open Science Framework (https://osf.io/a2sh5).

The following dataset was generated:

Dujmovic M, Malhotra G, Bowers JS. 2020. What do adversarial images tell us about human vision? Open Science Framework. a2sh5

[bib1] Akhtar N, Mian A. Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access. 2018;6:14410–14430. doi: 10.1109/ACCESS.2018.2807385. [DOI] [Google Scholar]

[bib2] Alcorn MA, Li Q, Gong Z, Wang C, Mai L, Ku W-S, Nguyen A. Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 4845–4854. [DOI] [Google Scholar]

[bib3] Athalye A, Engstrom L, Ilyas A, Kwok K. Synthesizing robust adversarial examples. arXiv. 2017 https://arxiv.org/abs/1707.07397

[bib4] Baker N, Lu H, Erlikhman G, Kellman PJ. Deep convolutional networks do not classify based on global object shape. PLOS Computational Biology. 2018;14:e1006613. doi: 10.1371/journal.pcbi.1006613. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Biederman I. Recognition-by-components: a theory of human image understanding. Psychological Review. 1987;94:115–147. doi: 10.1037/0033-295X.94.2.115. [DOI] [PubMed] [Google Scholar]

[bib6] Biederman I, Cooper EE. Evidence for complete translational and reflectional invariance in visual object priming. Perception. 1991;20:585–593. doi: 10.1068/p200585. [DOI] [PubMed] [Google Scholar]

[bib7] Biederman I, Cooper EE. Size invariance in visual object priming. Journal of Experimental Psychology: Human Perception and Performance. 1992;18:121–133. doi: 10.1037/0096-1523.18.1.121. [DOI] [Google Scholar]

[bib8] Biederman I, Ju G. Surface versus edge-based determinants of visual recognition. Cognitive Psychology. 1988;20:38–64. doi: 10.1016/0010-0285(88)90024-2. [DOI] [PubMed] [Google Scholar]

[bib9] Blything R, Vankov I, Ludwig C, Bowers J. Extreme translation tolerance in humans and machines. Conference on Cognitive Computational Neuroscience.2019. [Google Scholar]

[bib10] Cadieu CF, Hong H, Yamins DL, Pinto N, Ardila D, Solomon EA, Majaj NJ, DiCarlo JJ. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLOS Computational Biology. 2014;10:e1003963. doi: 10.1371/journal.pcbi.1003963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports. 2016;6:27755. doi: 10.1038/srep27755. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Cichy RM, Kaiser D. Deep neural networks as scientific models. Trends in Cognitive Sciences. 2019;23:305–317. doi: 10.1016/j.tics.2019.01.009. [DOI] [PubMed] [Google Scholar]

[bib13] Dodge S, Karam L. A study and comparison of human and deep learning recognition performance under visual distortions. 2017 26th International Conference on Computer Communication and Networks (ICCCN) IEEE; 2017. pp. 1–7. [DOI] [Google Scholar]

[bib14] Eickenberg M, Gramfort A, Varoquaux G, Thirion B. Seeing it all: convolutional network layers map the function of the human visual system. NeuroImage. 2017;152:184–194. doi: 10.1016/j.neuroimage.2016.10.001. [DOI] [PubMed] [Google Scholar]

[bib15] Elsayed G, Shankar S, Cheung B, Papernot N, Kurakin A, Goodfellow I, Sohl-Dickstein J. Adversarial examples that fool both computer vision and time-limited humans. Advances in Neural Information Processing Systems; 2018. pp. 3910–3920. [Google Scholar]

[bib16] Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann FA, Brendel W. Imagenet-trained cnns are biased towards texture; increasing shape Bias improves accuracy and robustness. arXiv. 2018 https://arxiv.org/abs/1811.12231

[bib17] Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv. 2014 https://arxiv.org/abs/1412.6572

[bib18] Goodfellow IJ, Papernot N, Huang S, Duan R, Abbeel P, Clark J. Attacking machine learning with adversarial examples. [June 18, 2002];2017 https://openai.com/blog/adversarial-example-research/

[bib19] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing Human-Level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision; 2015. pp. 1026–1034. [DOI] [Google Scholar]

[bib20] Hosseini H, Poovendran R. Semantic adversarial examples. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2018. pp. 1614–1619. [DOI] [Google Scholar]

[bib21] Hummel JE, Stankiewicz BJ. Categorical relations in shape perception. Spatial Vision. 1996;10:201–236. doi: 10.1163/156856896X00141. [DOI] [PubMed] [Google Scholar]

[bib22] Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. Adversarial examples are not bugs, they are features. arXiv. 2019 https://arxiv.org/abs/1905.02175

[bib23] Karmon D, Zoran D, Goldberg Y. Lavan: localized and visible adversarial noise. arXiv. 2018 https://arxiv.org/abs/1801.02608

[bib24] Khaligh-Razavi SM, Kriegeskorte N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLOS Computational Biology. 2014;10:e1003915. doi: 10.1371/journal.pcbi.1003915. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems; 2012. pp. 1097–1105. [Google Scholar]

[bib26] Kubilius J, Schrimpf M, Nayebi A, Bear D, Yamins DL, DiCarlo JJ. Cornet: modeling the neural mechanisms of core object recognition. bioRxiv. 2018 doi: 10.1101/408385. [DOI]

[bib27] Malhotra G, Evans BD, Bowers JS. Hiding a plane with a pixel: examining shape-bias in CNNs and the benefit of building in biological constraints. Vision Research. 2020;174:57–68. doi: 10.1016/j.visres.2020.04.013. [DOI] [PubMed] [Google Scholar]

[bib28] Nguyen A, Yosinski J, Clune J. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. pp. 427–436. [DOI] [Google Scholar]

[bib29] Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings. 2016 IEEE European Symposium on Security and Privacy (EuroS&P); 2016. pp. 372–387. [DOI] [Google Scholar]

[bib30] Peterson JC, Abbott JT, Griffiths TL. Adapting deep network features to capture psychological representations: an abridged report. International Joint Conference on Artificial Intelligence; 2017. pp. 4934–4938. [Google Scholar]

[bib31] Peterson JC, Abbott JT, Griffiths TL. Evaluating (and improving) the correspondence between deep neural networks and human representations. Cognitive Science. 2018;42:2648–2669. doi: 10.1111/cogs.12670. [DOI] [PubMed] [Google Scholar]

[bib32] Pomerantz JR, Portillo MC. Grouping and emergent features in vision: toward a theory of basic gestalts. Journal of Experimental Psychology: Human Perception and Performance. 2011;37:1331–1349. doi: 10.1037/a0024330. [DOI] [PubMed] [Google Scholar]

[bib33] Rajalingham R, Issa EB, Bashivan P, Kar K, Schmidt K, DiCarlo JJ. Large-Scale, High-Resolution comparison of the core visual object recognition behavior of humans, monkeys, and State-of-the-Art deep artificial neural networks. The Journal of Neuroscience. 2018;38:7255–7269. doi: 10.1523/JNEUROSCI.0388-18.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Rauber J, Brendel W, Bethge M. Foolbox: a Python toolbox to benchmark the robustness of machine learning models. arXiv. 2017 https://arxiv.org/abs/1707.04131

[bib35] Ritter S, Barrett DG, Santoro A, Botvinick MM. Cognitive psychology for deep neural networks: a shape Bias case study. JMLR. org.Proceedings of the 34th International Conference on Machine Learning.2017. [Google Scholar]

[bib36] Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc; Red Hook, United States. 2017. pp. 3859–3869. [Google Scholar]

[bib37] Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. Performance-optimized hierarchical models predict neural responses in higher visual cortex. PNAS. 2014;111:8619–8624. doi: 10.1073/pnas.1403112111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Yamins DL, DiCarlo JJ. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience. 2016;19:356–365. doi: 10.1038/nn.4244. [DOI] [PubMed] [Google Scholar]

[bib39] Zhou Z, Firestone C. Humans can decipher adversarial images. Nature Communications. 2019;10:1334. doi: 10.1038/s41467-019-08931-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

What do adversarial images tell us about human vision?

Marin Dujmović

Gaurav Malhotra

Jeffrey S Bowers

Roles

Abstract

Introduction

Figure 1. Examples of two types of adversarial images.

Results

Reassessing the level of agreement in Zhou and Firestone, 2019

Table 1. Mean DCNN-participant agreement in the experiments conducted by Zhou and Firestone, 2019.

Reassessing the basis of the agreement in Zhou and Firestone, 2019

Experiment 1: Response alternatives

Figure 2. Average levels of agreement in Experiment 1 (error bars denote 95% confidence intervals).

Experiment 2: Target adversarial images

Figure 3. Example of best-case and worst-case images for the same category (‘penguin’) used in Experiment 2.

Figure 4. Average levels of agreement in Experiment 2 (error bars denote 95% confidence intervals).

Experiment 3: Different types of adversarial images

Figure 5. Examples of images from Nguyen et al., 2015 used in the four experimental conditions in Experiment 3.

Figure 6. Agreement (mean percentage of images on which a participant choices agree with the DCNN) as a function of experimental condition in Experiment 3 (error bars denote 95% confidence intervals).

Experiment 4: Generating fooling images for ImageNet

Figure 7. Average levels of agreement in Experiment 4 (error bars denote 95% confidence intervals).

Experiment 5: Transferable adversarial images

Figure 8. Results for images that are confidently classified with high network-to-network agreement on Alexnet, Densenet-161, GoogLeNet, MNASNet 1.0, MobileNet v2, Resnet 18, Resnet 50, Shufflenet v2, Squeezenet 1.0, and VGG-16.

Discussion

Materials and methods

Reassessing agreement: Blindfolded participants

Experiment 1

Experiment 2

Experiment 3

Experiment 4

Experiment 5

Statistical analyses

Power analysis

Online recruitment

Data availability

Acknowledgements

Appendix 1

Appendix 1—figure 1. Agreement across adversarial images from Experiment 3b in Zhou and Firestone, 2019.

Appendix 1—figure 2. Participant responses ranked by frequency (Experiment 3b).

Appendix 1—figure 3. Participant responses ranked by frequency (Experiment 3b).

Appendix 1—figure 4. Per-item histograms of response choices from Experiment 3b in Zhou and Firestone, 2019.

Appendix 1—figure 5. Per-item histograms of response choices from Experiment 3b in Zhou and Firestone, 2019.

Appendix 1—figure 6. Per-item histograms of response choices from Experiment 3b in Zhou and Firestone, 2019.

Reanalysis of Zhou and Firestone, 2019

Appendix 2

Supplementary information for Experiments 1–5

Appendix 2—figure 1. Experiment 1 stimuli and competitive alternative labels.

Appendix 2—figure 2. An item-wise breakdown of agreement levels in Experiment 2 as a function of experimental condition and category.

Appendix 2—figure 3. An item-wise breakdown of agreement levels for the four conditions in Experiment 3.

Appendix 2—figure 4. Experiment 5 stimuli and competitive alternative labels.

Appendix 2—figure 5. Experiment 5 stimuli and competitive alternative labels.

Funding Statement

Contributor Information

Funding Information

Additional information

Competing interests

Author contributions

Ethics

Additional files

Data availability

References

Decision letter

Roles

Author response

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Experiment 4: Generating fooling images for `ImageNet`

Figure 8. Results for images that are confidently classified with high network-to-network agreement on `Alexnet, Densenet-161, GoogLeNet, MNASNet 1.0, MobileNet v2, Resnet 18, Resnet 50, Shufflenet v2, Squeezenet 1.0, and VGG-16`.