Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Jun 15;16(6):e1007973. doi: 10.1371/journal.pcbi.1007973

XDream: Finding preferred stimuli for visual neurons using generative networks and gradient-free optimization

Will Xiao 1,2,*, Gabriel Kreiman 2,3
Editor: Alona Fyshe4
PMCID: PMC7316361  PMID: 32542056

Abstract

A longstanding question in sensory neuroscience is what types of stimuli drive neurons to fire. The characterization of effective stimuli has traditionally been based on a combination of intuition, insights from previous studies, and luck. A new method termed XDream (EXtending DeepDream with real-time evolution for activation maximization) combined a generative neural network and a genetic algorithm in a closed loop to create strong stimuli for neurons in the macaque visual cortex. Here we extensively and systematically evaluate the performance of XDream. We use ConvNet units as in silico models of neurons, enabling experiments that would be prohibitive with biological neurons. We evaluated how the method compares to brute-force search, and how well the method generalizes to different neurons and processing stages. We also explored design and parameter choices. XDream can efficiently find preferred features for visual units without any prior knowledge about them. XDream extrapolates to different layers, architectures, and developmental regimes, performing better than brute-force search, and often better than exhaustive sampling of >1 million images. Furthermore, XDream is robust to choices of multiple image generators, optimization algorithms, and hyperparameters, suggesting that its performance is locally near-optimal. Lastly, we found no significant advantage to problem-specific parameter tuning. These results establish expectations and provide practical recommendations for using XDream to investigate neural coding in biological preparations. Overall, XDream is an efficient, general, and robust algorithm for uncovering neuronal tuning preferences using a vast and diverse stimulus space. XDream is implemented in Python, released under the MIT License, and works on Linux, Windows, and MacOS.


This is a PLOS Computational Biology Software paper.

Introduction

What stimuli excite a neuron, and how can we find them? Consider vision as a paradigmatic example, the selection of stimuli to probe neural activity has shaped the understanding of how visual neurons represent information. It is practically impossible to exhaustively evaluate neuronal responses to images, due to the combinatorially large number of possible images. Instead, investigators have traditionally selected stimuli guided by natural image statistics, behavioral relevance, theoretical postulates about internal representations, intuitions from previous studies, and serendipitous findings. Stimuli selected in this way underlie our current understandings of how circular center-surround receptive fields [1] give rise to orientation tuning [2], then to encoding of more complex shapes such as curvatures [3, 4], and further to selective responses to complex objects such as faces [57].

Despite the progress made in understanding visual cortex by testing limited sets of hand-chosen stimuli, these experiments could be missing the true feature preferences of neurons. In other words, there could be other images that drive visual neurons better than those found so far. Such images could lead us to revisit our current descriptions of feature tuning in visual cortex.

A recently introduced method shows promise to begin bridging the gap. Named XDream (eXtending DeepDream with real-time evolution for activation maximization), this method combines a genetic algorithm and a deep generative neural network [8]—both inspired by previous work [912]—to evolve images that trigger high activation in neurons [13]. XDream can generate strong stimuli for neurons in macaque inferior temporal (IT) and primary visual cortex (V1).

The performance and design options of XDream have not been thoroughly evaluated, due to the time-intensiveness of neuronal recordings and the difficulty to fully control experimental variables. To overcome these challenges, here we test the performance of XDream using state-of-the-art in silico models of visual neurons in lieu of real neurons, in the same spirit of [14]. Specifically, we use convolutional neural networks (ConvNets) pre-trained on visual recognition tasks as an approximation to the computations performed along ventral visual cortex [1517]. Using these models as a proxy for real neurons allows us to compare synthetic stimuli with a large set of reference images, to evaluate XDream’s performance across processing stages, model architectures, and training regimes, to empirically optimize algorithm design and parameter choices in a systematic fashion, and to disentangle the effects of neuronal response stochasticity.

Although there is a rich literature in computer science on feature visualization [1821], we focus on the more biologically relevant scenario where there is no information about the architecture and weights of the target model, and where we only have access to a few, potentially stochastic, activation values from the neurons. These conditions reflect those prevailing in neuronal recordings and are fundamentally different from the assumptions made in computer science studies.

Under these realistic constraints, we show that XDream still reliably and efficiently uncovers preferred features of units with a wide range of response properties, generalizing to different processing stages within a network, different network architectures, and different training datasets. Furthermore, XDream performed equally well with a wide range of algorithmic and parameter choices. Based on these results, we suggest parameters to use and results that can be expected when using XDream to investigate neuronal tuning properties. Our findings suggest that XDream is a general and robust method for investigating neuronal preferences in visual cortex.

Design and implementation

Overview

XDream combines an image generator (e.g., the generator in a generative adversarial network), a target neuron (e.g., a unit in a ConvNet), and a non-gradient-based optimization algorithm (e.g., a genetic algorithm) in a closed loop. In each iteration, the optimization algorithm proposes a set of codes, the image generator synthesizes the codes into images, the images are evaluated by the target neuron to produce one scalar score per image, and the scores are used by the optimization algorithm to propose a new set of codes (Fig 1). Importantly, no optimization gradient is needed from the neuron.

Fig 1. Overview of the XDream method.

Fig 1

a), XDream combines an image generator, a target neuron, and a non-gradient-based optimization algorithm. b,c), An example experiment targeting CaffeNet layer fc8, unit 1. b), mean activation achieved over 500 generations, 20 images per generation (10,000 total image presentations). c), Images obtained at a few example generations indicated by minor x-ticks in b). The activation to each image is labeled above the image and indicated by the color of the margin. d), The top 5 images among 10,000 random images from ImageNet (ILSVRC12 dataset, >1.4 M images). The number of random images is matched to the number of images presented during optimization. The top image in all >1.4 M images is shown in Fig 2b.

The image generators and optimization algorithms are detailed below. The code is implemented in Python 3 and runs on Linux, Windows, and MacOS, although the former two platforms are required to use GPU acceleration. The main dependency is Caffe [22] (https://caffe.berkeleyvision.org/) or PyTorch (https://pytorch.org/), which are required for neural network computation. Other dependencies are standard Python packages and listed in requirements.txt in the repository, including: numpy, h5py, opencv-python, scipy, and scikit-image.

Image generators

An image generator is a function that outputs an image given some representation of that image (an image code) as input. We tested the family of DeePSiM generators developed in [8]; they are generative adversarial networks trained to invert each layer of AlexNet [23]. The pre-trained models are available at https://lmb.informatik.uni-freiburg.de/people/dosovits/code.html. We have converted the models into PyTorch for convenience for future research. Links to the converted models are available in the code repository (see Code availability below). We used the image generator inverting the fc6 layer by default except in S4 Fig, where we compared different generators. An alternative version of the DeePSiM-fc6 generator was trained on the Places-365 dataset using code from [8] and a pre-trained classifier [24].

Fitness function

The key metric XDream optimizes is a scalar value we refer to as fitness, which is associated with each image. In the neuroscience context, a fitness function can be the stimulus-evoked spike count for a neuron in visual cortex. In the current study, the fitness function is the activation of the target unit in a ConvNet.

Optimization algorithms

An optimization algorithm in the context of XDream is a function that iteratively proposes a set of n image codes (real-valued vectors) or codes for short, ci, i = 1, …, n, and then uses their corresponding fitness values yi, i = 1, …, n to propose a new set of codes expected to have higher fitness. We used a genetic algorithm by default, but also considered two other algorithms: finite-difference gradient descent (FDGD) and natural evolution strategies [25] (NES). Implementation details for the optimization algorithms are available in S1 Text.

Computing environment

Neural network computations were performed on NVIDIA GPUs. Portions of this research were conducted on the O2 High Performance Compute Cluster supported by the Research Computing Group, at Harvard Medical School. See http://rc.hms.harvard.edu for more information.

Results

Random exploration of stimulus space is inefficient

A common approach for exploring neuronal selectivity is to use arbitrarily selected images, often from a limited number of categories (for example in [7, 26]). Thus, we considered random exploration as a baseline for comparison. We used the AlexNet architecture as the target model [23] (implemented as CaffeNet; S1 Table) and sampled images from ImageNet [27] (ILSVRC12 dataset, 1,431,167 images), a large dataset common in computer vision that also contains the training set of CaffeNet. We randomly sampled n images either from all of ImageNet or from 10 categories randomly selected from the 1,000 training categories in ImageNet (n/10 images per category). For units in different layers of the network, we evaluated the activation values in response to these images and calculated the relative activation, defined as the ratio between the activations in the n random images and the maximum activation in all of ImageNet. By definition, the relative activation for the best image in ImageNet is 1, which is also an upper bound on the observed relative activation values when using random sampling. Randomly selected images typically yielded relative activation values well below 1 (S1 Fig). As expected, the maximum observed relative activation increased with n but only did so slowly, with near-logarithmic growth. Moreover, for later layers (e.g., fc8), sampling from only 10 categories yielded significantly worse results than sampling completely randomly, which we hypothesize is because the small number of categories imposes a bottleneck on the diversity of high-level features represented. In neuroscience studies, category selection is clearly not completely random: Investigators may have intuitions and prior knowledge about the types of stimuli that are more likely to be effective. To the extent that those intuitions are correct, they can enhance the search process. However, those intuitions are seldom guided by systematic examination of stimulus space and could well miss important types of stimuli.

XDream can find strong stimuli for neurons

XDream has three key components: an image generator representing the search space; an objective function given by the activation of the target unit guiding the search; and an optimization algorithm performing the search (Fig 1a). In each generation, the generator creates images from their latent representations (codes), the target unit activation is evaluated for each of the generated images, and the optimizer refines the codes based on the activation values. Initialized randomly (examples shown in Fig 1a), the algorithm is iterated for 10,000 total image presentations, a relatively small and accessible number in a typical neuroscience experiment [13]. Crucially, the algorithm does not use any prior knowledge about the architecture or weights of the target model.

An example experiment with unit 1 in the output layer (layer fc8) of CaffeNet is shown in Fig 1b and 1c. In 500 generations of 20 images each, the activation of the target unit increased rapidly and saturated at approximately generation 300. Fig 1c shows example images at a few generations (log-spaced to show a range of activations), illustrating the evolution of the images from the initial noise pattern to the final image. In the following analyses, we concentrate on the best image in the last generation, which we refer to as the optimized image. However, it is worth noting that responses to all the 10,000 unique images during the evolution may illuminate features of the neuron’s tuning (see Discussion).

How strong was the activation achieved by XDream-generated images? We compared the optimized image to images from ImageNet. Unit 1 in layer fc8 was trained to be a “goldfish” detector. Correspondingly, when we randomly sampled 10,000 images from ImageNet, the best images are photos of goldfish (Fig 1d). The highest activation value observed in this random sample was 30.67. The best image from ImageNet for this unit was a picture of a goldfish and elicited an activation of 40.55 (Fig 2b). Consistent with S1 Fig, the best image found by random sampling produced a much lower activation value than the best example in ImageNet. In comparison, the optimized image generated by XDream elicited an activation of 72.42. In other words, using a limited number of presentations, XDream generated images that elicited higher activation than any natural image from ImageNet. We refer to such images with relative activation > 1 as super stimuli.

Fig 2. XDream generalizes across layers, architectures, and training sets.

Fig 2

a), Violin plot showing the distributions of relative activation (activation of optimized stimulus relative to highest activation in >1.4 M ImageNet images) over 100 randomly selected units per layer. For each target model, we investigated early, middle, late, and output layers (see S1 Table for the specific layers). The violin contours indicate kernel density estimates of the distributions, white circles indicate the medians, thick bars indicate first and third quartiles, and whiskers indicate 1.5× interquartile ranges. For comparison, grey boxes (interquartile ranges) and lines (medians) show the distribution of maximum relative activation for 10,000 random ImageNet images. The horizontal dashed line corresponds to the best ImageNet image. b), Optimized (top row) and best ImageNet (bottom row) images and activations for 10 example units across layers and architectures. For output units, corresponding category labels are shown below the images.

XDream generalizes across layers, architectures, and training sets

The default generative network used in XDream was trained to invert the internal representations at layer fc6 of CaffeNet, which was in turn trained on ImageNet [8]. Could this generator allow XDream to generalize to other network layers, architectures, and training sets? If XDream is specific to certain layers and architectures, or specific to ImageNet-trained networks, this may limit its applicability to real neurons.

We first assessed whether XDream could extrapolate to other layers in CaffeNet by selecting 100 units respectively from the early, middle, late, and output layers of CaffeNet (Fig 2a). XDream was able to find optimized images that are better than the best randomly selected images across all layers (p < 10−16, false discovery rate (FDR) corrected for 28 tests in this section). The optimized images were also significantly better than the best images in ImageNet (p < 10−9, FDR corrected).

Next, we tested 100 units from each of 4 layers from 5 different network architectures: ResNet-v2 152- and 269-layer variants [28], Inception-v3 [29], Inception-v4, and Inception-ResNet-v2 [30]. These models were all trained on ImageNet. XDream was able to generate better images than the best random images for the vast majority of units across all layers and architectures (Fig 2a; p < 10−8 across layers) except the early layer of Inception-v3 (p = 0.2) and of Inception-ResNet-v2 (p = 0.09). With the same exceptions, XDream generated super stimuli for all other tested layers (p = 0.01 for the early layer of Inception-v4, p = 2 × 10−4 for the middle layer of Inception-ResNet-v2, and p < 10−9 for all other layers). Example optimized images for units in different layers and architectures are shown in Fig 2b and S2 Fig. Furthermore, several generators trained on different layer representations performed equally well across classifier layers (S4 Fig).

Finally, we tested the ability of XDream to optimize unit responses when the generator and target networks are trained on different datasets. We tested PlacesCNN [31], a network with the same architecture as CaffeNet but trained on a different dataset, PlacesCNN. PlacesCNN also contains photographic images, but they mainly depict scenes rather than objects. Again, XDream was able to find super stimuli across all layers in this network (Fig 2a, last four distributions; p < 10−6 across layers), even when using a generative network trained on different images. Conversely, when using a generator trained on the Places dataset, XDream still performed similarly well in optimizing CaffeNet and PlacesCNN (S4 Fig).

These results show that XDream can efficiently create images that trigger high activations in a target unit without making assumptions about the type of images a unit may prefer and without any knowledge of the target model architecture or connectivity, suggesting that XDream may well be applicable to biological neurons. Furthermore, XDream generalizes across layers in a ConvNet, while different layers roughly correspond to areas along the ventral visual stream [17, 32, 33], suggesting that XDream may also generalize to several ventral stream areas. Consistent with this observation, results from [13] indicated that XDream can find optimized stimuli for V1 as well as inferior temporal cortex (IT) neurons.

XDream is robust to different initial conditions

XDream starts the search from an initial generation of image codes. In Fig 2, we always initialized the algorithm using the same set of 20 random image codes, 6 of which are shown in Fig 1a. Does the choice of initial conditions affect the results?

To address this question, we first tested how much the particular choice of random initial codes matters. For each target unit, we repeated the experiment using 10 different random initializations and compared the optimized relative activation to that of the original random initialization. Different initial conditions produced slightly better or worse relative activation values centered around a mean difference of 0, and the standard deviation of the fractional change was lower than 10% (Fig 3a).

Fig 3. Comparison of different initializations.

Fig 3

a,b,c), Effect of using different random initializations. a), Distributions of fractional change in optimized activation if 10 different random initializations are used. b), Left, relative activation in response to images interpolated (in the code space) between two optimized images from two different random initial conditions. Right, activation normalized to the endpoints (location 0 or 1), highlighting the change in activation away from the endpoints. c), Optimized images from different initializations for 3 example units in the output layer (one unit per row). Activation values are shown above each image. d), Good versus bad initializations. For each target unit, its best, middle, or worst 20 images from ImageNet were used as the initial generation. The images were converted to the image code space using either an optimization method (“opt”) or an inversion method (“ivt”; Methods). Left to right within the opt and ivt groups are results from initialization with the worst, middle, and best 20 images. Random initialization is shown for comparison. The open and solid violins show the distributions, in the first and last generation respectively, of relative activation over 100 units in each layer.

Similar activation values notwithstanding, the optimized images were different on a pixel level (Fig 3b); they may comprise an “invariance manifold” that contains similar, but not identical, images eliciting comparable activation values. What might this invariance manifold look like? To explore this question, in CaffeNet layers conv2, conv4, fc6, and fc8, we linearly interpolated between two separately optimized images (from different initializations) in the image code space, and measured target unit activation in response to the interpolated images (Fig 3b). The interpolated images were much stronger stimuli compared to the majority of ImageNet images. However, particularly in layers fc6 and fc8, the interpolation midpoint activated the units less strongly than either endpoint, suggesting either that the sets of strong stimuli are disjoint, or that the invariance manifolds may have non-convex geometry. Studies have reported visual neurons that prefer seemingly unrelated stimuli. It remains an interesting open question to identify whether there exists a feature representation space in which neuronal tuning functions have “simple” geometry.

Next, we tested whether there are particularly good or bad ways of choosing the initial stimuli. We selected, separately for each target unit, the 20 ImageNet images that led to the highest, middle, and lowest activation values and used those images to form the initial population (Fig 3d). To convert images into image codes comprising the initial population, we used either the “opt” or the “ivt” algorithm (Methods). Initializing with better or worse natural images did not improve the optimized images in the conv2 layer (p = 0.87 and 0.19 for “opt” and “ivt,” respectively, FDR-corrected for 8 tests in this and the next sentence). In higher layers, initializing with the best natural images led to slightly higher relative activation values (Fig 3d; Table 1; p < 5 × 10−3 for “opt” and p < 10−10 for “ivt” across layers). We speculate that the improvement in higher layers is because units in deeper layers are progressively more selective, making it more difficult to optimize their responses. Therefore, more optimal initializations are beneficial. However, in an actual neurophysiology experiment, it is unlikely that the investigator would know, a priori, such good stimuli as the best of 1.4 M images. Meanwhile, initializing with the middle or worst natural stimuli were similar to initializing with random images codes. Therefore, initializing randomly seems reasonable.

Table 1. Effect of using good vs. bad initialization.

Encoding alg. Measure Layer
conv2 conv4 fc6 fc8
opt slope 0.010 0.037 0.047 0.056
p-value 0.87 0.004 0.004 7 × 10−5
ivt slope 0.044 0.113 0.241 0.353
p-value 0.19 7 × 10−11 4 × 10−22 5 × 10−77

For each unit, the 20 worst, middle, and best images from ImageNet, as ranked by that unit, were used to initialize the genetic algorithm. The images were converted to image codes using one of two encoding algorithms, “opt” or “ivt” (see Methods). The slope was calculated, by linear regression, for relative activation (median across 100 random units each layer) as a function of the initialization ({0,1,2} for {worst, middle, best}, respectively). Thus, the slope quantifies the improvement in relative activation when a better initialization is used (worst → middle or middle → best).

To summarize, initializing the algorithm with different random conditions resulted in only a small variation in the optimized image activation, and the optimized images were similar, although not identical, at the pixel level. Initializing with prior knowledge has little to no effect on the optimized image activation, unless the seed is comparable to the best image in ∼ 1 M images and only in later layers.

Different optimization algorithms can be incorporated into XDream, but the genetic algorithm consistently works well

An important component of XDream is the optimization algorithm. The results shown thus far were based on using a genetic algorithm as the optimization algorithm, a choice inspired by previous work [911]. Here, we compared the genetic algorithm to two additional algorithms, a naïve finite-difference gradient descent algorithm (FDGD; Methods) and Natural Evolution Strategies (NES; [25], Methods). NES has been used in a related problem [34]. FDGD and NES were significantly worse than the genetic algorithm in CaffeNet conv2 (p < 10−13, FDR corrected for 20 tests here and in the next section) and conv4 layers (p < 10−3). Yet, both FDGD and NES were significantly better than the genetic algorithm in CaffeNet fc6 (p < 10−16), fc8 (p < 10−16), and Inception-ResNet-v2 classifier layers (p < 10−12; Fig 4a).

Fig 4. Comparison of optimization algorithms and their robustness to noise.

Fig 4

We compared 3 gradient-free optimization algorithms (Methods): a genetic algorithm, finite-difference gradient descent (FDGD), and Natural Evolution Strategies (NES; [25]). Left and right half of each violin, respectively, correspond to noiseless and noisy units. Dashed lines inside the violins indicate quartiles of the distribution. Otherwise, format of the plot is as in Fig 2a. b), The performance of the genetic algorithm gradually improves with decreasing amounts of noise within a neurophysiologically relevant range. Format of the plot is as in Fig 2a except that the violins are horizontal. On the right, 3 alternative scales for the y-axis are shown, for comparison with common ways of assessing noise.

XDream is robust to noise in neuronal responses

An important difference between model units and real neurons is the lack of noise in model unit activations. Upon presenting the same image, a model unit returns a deterministic activation value. In contrast, in biological neurons, the same image can evoke different responses on repeated presentations (even though trial-averaged responses may be highly consistent; see [35]). To test whether XDream could still find super stimuli with noisy units, we implemented a simple model of stochasticity in the units by using the true activation value to control the rate of a homogeneous Poisson process, from which the “observed” activation value on a single trial was drawn (Methods). Homogeneous Poisson processes have been used extensively to model stochasticity in cortical neurons [36].

As expected, performance deteriorated when noise was added (Fig 4a, noisy condition). However, XDream using the genetic algorithm was still able to find optimized stimuli better than random exploration for most layers (p < 10−10 for all tested layers except p = 0.19 for CaffeNet fc8, FDR-corrected for 5 tests) and was also able to find super stimuli for some layers (p < 10−5 for CaffeNet conv4 and fc6 layers; FDR-corrected for 5 tests).

Noise in the unit activations affected different optimization algorithms to different extents. The genetic algorithm was at least as good as, and often superior to, both alternative optimization algorithms when considering noisy units. The NES algorithm performed similarly to the genetic algorithm in CaffeNet fc8 layer and Inception-ResNet-v2 classifier layer (p = 0.03 and 0.65, respectively), but was worse in the other 3 tested layers (p < 10−14). The FDGD algorithm was particularly sensitive to noise, performing worse than the genetic algorithm in all layers tested (p < 10−6) and frequently failing to find good stimuli.

In the noisy conditions examined thus far, we assumed that in each presentation, model units yielded approximately 20 spikes for a “good” stimulus (defined as the expected best image in 2,500 random ImageNet images). This choice was motivated by what may be realistically expected when recording from biological neurons (e.g., firing rate of 100 spikes per second to a good stimulus over a 200 ms observation window), but this number will be dependent on individual neurons and specific experimental designs. This number matters because, for a homogeneous Poisson process, its standard deviation-to-mean ratio is inversely proportional to the square root of the rate parameter (average expected number of spikes), and thus a higher firing rate means a higher signal-to-noise ratio. To characterize the performance of XDream under different noise conditions, we varied the rate parameter as defined by the expected max spike number and measured XDream performance on the different noise levels (Fig 4b). The empirical level of noise was quantified with commonly used measures such as trial-to-trial self-correlation, standard deviation-to-mean ratio, and signal-to-noise ratio (SNR). As the amount of noise decreased, the performance of XDream gradually approached its noiseless performance. Notably, even with a high level of noise (5 spikes for a good stimulus, self-correlation of 0.08, and SNR of 2), XDream was able to find super stimuli for around half of the target units in all but the deepest layer (fc8) tested.

Availability and future directions

The code for XDream can be obtained directly from https://github.com/willwx/XDream/.

In the computer science literature, activation maximization is a well-known approach for visualizing features represented by units in a ConvNet [12, 21, 3739]. However, the techniques are only applicable to networks that provide optimization gradients. In other words, perfect knowledge is assumed of the target network architecture and weights. Clearly, such requirements are not met in current neuroscience experiments.

Recently, several other studies have focused on similar goals to the ones in XDream, but with a different approach [32, 33, 40, 41]. In that approach, a ConvNet-based model is first fitted to predict neuronal responses to a set of training images. Then, standard white-box activation maximization techniques are applied to the ConvNet model. The relation between this approach and XDream is similar to the relation between the so-called “substitute model” approach and what, in comparison, we may call a “direct” approach, in research on black-box adversarial attacks. A promising future direction is to combine the two approaches to leverage their unique advantages: unlimited queries (after training) and efficient optimization with substitute models; avoiding model extrapolation and transferability problems with direct optimization.

The results presented here are based on maximizing activation values, whereas the results shown in [13] are based on maximizing spike counts. Activation values and firing rates are commonly-used proxies for internal representation in machine learning and neuroscience, respectively. However, other putative neural codes can be studied, such as pooled activation across multiple units, increase sparseness of the representation across units, match a pre-specified pattern of population firing, correlated firing, synchronized firing of nearby units, maximize power in a certain frequency band in local field potentials, etc. XDream is agnostic to the underpinning of the objective function as long as it is image-specific, quantitatively defined, and computable in real time. Thus, the same algorithm can be readily applied to investigate different putative neural coding mechanisms.

Finally, it is worth remembering that the identification of an optimal stimulus, or even a diverse set of them, still does not automatically lead to a full characterization of the function of a neuron. Finding preferred stimuli, or “feature visualization” in computer science parlance, has guided thinking about the function of individual neurons in both neuroscience and deep learning [6, 21]. However, optimal stimuli reflect but do not disentangle critical issues like tuning features, invariant features, and context dependence; these questions need to be distinguished by subsequent hypothesis-driven investigation [4244]. A method to automatically find preferred stimuli of neurons can suggest initial hypotheses about a poorly-understood visual area, or motivate re-thinking about an extensively-studied region. Of note, during the optimization process, XDream does test thousands of related images, covering the target unit’s response levels both widely and densely (Fig 1c). Closer analyses of these images may reveal richer information about the tuning surface of a neuron (e.g., invariances) than what is reflected by the single best image.

In summary, XDream is able to discover preferred features of visual units without assuming any knowledge about the structure or connectivity of the system under study. Thus, XDream can be a powerful tool for elucidating the tuning properties of neurons in a variety of visual areas in different species, even where there is no prior knowledge about the neuronal preferences. Furthermore, we speculate that the general framework of XDream can be extended to other sensory domains, such as sounds, language, and music, as long as good generative networks can be built.

Supporting information

S1 Fig. Expected maximum relative activation in response to random natural images.

We measured the max relative activation expected in two random sampling schemes. “Random” refers to picking a given number of images randomly from the ImageNet dataset (blue). “10 categories” refers to first randomly picking 10 categories out of the 1000 ImageNet categories and then picking randomly from those categories (gray). We considered 4 layers from the CaffeNet architecture. Lines indicate the median relative activation (activation divided by the highest activation for all ImageNet images). Shading indicates the 25th- to 75th-percentiles among 100 random units per layer.

(TIF)

S2 Fig. Optimized and best ImageNet images for other example neurons across architectures and layers.

Two neurons were randomly selected per layer per architecture (S1 Table). Format is the same as in Fig 2.

(TIF)

S3 Fig. The image generator can approximate arbitrary images, and XDream can find these images using only scalar distance as a loss function.

This figure reproduces Supplementary Figure 1 in [13]. The generative network is challenged to synthesize arbitrary target images (row 1) using one of two encoding methods, “opt” (row 2) and “ivt” (row 3; Methods). In addition, XDream can discover the target image efficiently (within 10,000 test image presentations) by using the genetic algorithm to minimize the mean squared difference between the target image and any test image as a loss function, either in pixel space (row 4) or in CaffeNet pool5 representation space (row 5).

(TIF)

S4 Fig. Comparison of image generators.

a) We tested each of the family of image generators from [8] as the image generator in XDream, together with a generator directly representing images as pixels. Format of the plot is the same as in Fig 2a. b), The same generator architecture (DeePSiM-fc6) was trained on ImageNet and Places365, respectively, and tested on classifiers trained on either dataset. Each half of a violin corresponds to one generator, and dashed lines inside the violins indicate quartiles of the distribution; otherwise, format of the plot is the same as in Fig 2a.

(TIF)

S5 Fig. Comparison of hyperparameters in the genetic algorithm.

In each plot, one hyperparameter was varied while the others were held constant at default values indicated by the open circles. Dots indicate the mean of relative activation across 40 target neurons, 10 neurons each in 4 layers specified in S4 Table. Blue and orange lines indicate noiseless and noisy target units, respectively. Light colored lines indicate the mean across the 10 units within each architecture and layer. Light gray shading indicates the linear portion of a symmetrical log plot, which is used in order to show zero values.

(TIF)

S6 Fig. Testing XDream on a toy model that mimics the extra-classical effect of surround suppression.

We took two feature channels (first column, rows 2 & 3) from the conv1 layer of AlexNet and tiled each spatially with positive and negative weights to create a central, circular excitatory region and a concentric suppressive ring, analogous to an excitatory classical receptive field (RF) and a suppressive extraclassical RF (first row). By maximizing responses of the constructed units, XDream created stimuli that are spatially confined and agreed with the varying RF sizes (rows 2 & 3). We also created a unit that preferred a horizontal pattern in the center and a vertical pattern in the surround; XDream was able to uncover this preference pattern as well (row 4).

(TIF)

S1 Table. Target networks and layers.

For each network, 4 layers from what is roughly the early, middle, late stages of processing, together with the output layer before softmax, were selected as targets. PlacesCNN has the same architecture as CaffeNet but is trained on the Places-205 dataset [31]. CaffeNet is as implemented in https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet, PlacesCNN as in [31], and the remaining as in https://github.com/GeekLiB/caffe-model.

(PDF)

S2 Table. Optimized hyperparameter values for the genetic algorithm.

Hyperparameters used in the experiments in this paper, obtained as described in Methods separately for each generative network and for noiseless and noisy targets.

(PDF)

S3 Table. Optimized hyperparameter values for the FDGD and NES algorithms.

Hyperparameters used in the experiments in this paper, obtained as described in Methods separately for the noiseless and noisy case. The generative network was always deepsim-fc6.

(PDF)

S4 Table. Inferior temporal cortex-like layers.

From each layer, 10 units were randomly selected and used in hyperparameter evaluation.

(PDF)

S1 Text. Methods and additional experiments & discussion.

(PDF)

Data Availability

All data underlying the findings described in their manuscript have been made available in the publicly available GitHub repository at https://github.com/willwx/XDream.

Funding Statement

W.X. and G.K. are supported by the Center for Brains, Minds and Machines funded by NSF528STC award CCF-1231216 and also by NIH R01EY026025. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. https://www.nsf.govhttps://www.nih.gov.

References

  • 1. Kuffler S. Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology. 1953;16:37–68. 10.1152/jn.1953.16.1.37 [DOI] [PubMed] [Google Scholar]
  • 2. Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology. 1962;160(1):106–154. 10.1113/jphysiol.1962.sp006837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Gallant JL, Braun J, Van Essen DC. Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. Science. 1993;259(5091):100–103. 10.1126/science.8418487 [DOI] [PubMed] [Google Scholar]
  • 4. Pasupathy A, Connor CE. Population coding of shape in area V4. Nature Neuroscience. 2002;5(12):1332–1338. 10.1038/972 [DOI] [PubMed] [Google Scholar]
  • 5. Logothetis NK, Sheinberg DL. Visual object recognition. Annual Review of Neuroscience. 1996;19:577–621. 10.1146/annurev.ne.19.030196.003045 [DOI] [PubMed] [Google Scholar]
  • 6. Desimone R, Albright T, Gross C, Bruce C. Stimulus-selective properties of inferior temporal neurons in the macaque. Journal of Neuroscience. 1984;4(8):2051–2062. 10.1523/JNEUROSCI.04-08-02051.1984 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Tsao DY, Freiwald WA, Tootell RBH, Livingstone MS. A Cortical Region Consisting Entirely of Face-Selective Cells. Science. 2006;311(5761):670–674. 10.1126/science.1119983 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dosovitskiy A, Brox T. Generating Images with Perceptual Similarity Metrics based on Deep Networks. In: Advances in Neural Information Processing Systems; 2016. p. 658–666.
  • 9. Yamane Y, Carlson ET, Bowman KC, Wang Z, Connor CE. A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neuroscience. 2008;11(11):1352–1360. 10.1038/nn.2202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Carlson ET, Rasquinha RJ, Zhang K, Connor CE. A Sparse Object Coding Scheme in Area V4. Current Biology. 2011;21(4):288–293. 10.1016/j.cub.2011.01.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Vaziri S, Carlson ET, Wang Z, Connor CE. A Channel for 3D Environmental Shape in Anterior Inferotemporal Cortex. Neuron. 2014;84(1):55–62. 10.1016/j.neuron.2014.08.043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Nguyen A, Dosovitskiy A, Yosinski J, Brox T, Clune J. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In: Advances in Neural Information Processing Systems; 2016. p. 3387–3395.
  • 13. Ponce CR, Xiao W, Schade P, Hartmann TS, Kreiman G, Livingstone MS. Evolving Images for Visual Neurons Using a Deep Generative Network Reveals Coding Principles and Neuronal Preferences. Cell. 2019;177:999–1009. 10.1016/j.cell.2019.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Pospisil DA, Pasupathy A, Bair W. ‘Artiphysiology’ reveals V4-like shape tuning in a deep network trained for image classification. eLife. 2018;7:e38242 10.7554/eLife.38242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Yamins DLK, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences. 2014;111(23):8619–8624. 10.1073/pnas.1403112111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, Issa EB, et al. Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv. 2018.
  • 17. Cadena SA, Denfield GH, Walker EY, Gatys LA, Tolias AS, Bethge M, et al. Deep convolutional models improve predictions of macaque V1 responses to natural images. PLOS Computational Biology. 2019;15(4):1–27. 10.1371/journal.pcbi.1006897 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Erhan D, Bengio Y, Courville A, Vincent P. Visualizing Higher-Layer Features of a Deep Network. University of Montreal. 2009;1341(3). [Google Scholar]
  • 19.Zeiler MD, Fergus R. Visualizing and Understanding Convolutional Networks. In: European conference on computer vision. Springer; 2014. p. 818–833.
  • 20.Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, et al. Intriguing properties of neural networks. arXiv. 2013.
  • 21.Olah C, Mordvintsev A, Schubert L. Feature Visualization. Distill. 2017.
  • 22.Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv. 2014.
  • 23.Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in Neural Information Processing Systems 25; 2012. p. 1097–1105.
  • 24. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;40(6):1452–1464. 10.1109/TPAMI.2017.2723009 [DOI] [PubMed] [Google Scholar]
  • 25. Wierstra D, Schaul T, Glasmachers T, Sun Y, Peters J, Schmidhuber J. Natural Evolution Strategies. Journal of Machine Learning Research. 2014;15:949–980. [Google Scholar]
  • 26. Liu H, Agam Y, Madsen J, Kreiman G. Timing, timing, timing: Fast decoding of object information from intracranial field potentials in human visual cortex. Neuron. 2009;62:281–290. 10.1016/j.neuron.2009.02.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. 2015;115(3):211–252. 10.1007/s11263-015-0816-y [DOI] [Google Scholar]
  • 28. He K, Zhang X, Ren S, Sun J. Identity Mappings in Deep Residual Networks. Lecture Notes in Computer Science. 2016; p. 630–645. 10.1007/978-3-319-46493-0_38 [DOI] [Google Scholar]
  • 29.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the Inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818–2826.
  • 30.Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In: Thirty-First AAAI Conference on Artificial Intelligence; 2017. p. 4278–4284.
  • 31.Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A. Learning Deep Features for Scene Recognition using Places Database. In: Advances in Neural Information Processing Systems 29; 2014. p. 487–495.
  • 32. Bashivan P, Kar K, DiCarlo JJ. Neural population control via deep image synthesis. Science. 2019;364 (6439). 10.1126/science.aav9436 [DOI] [PubMed] [Google Scholar]
  • 33.Abbasi-Asl R, Chen Y, Bloniarz A, Oliver M, Willmore BDB, Gallant JL, et al. The DeepTune framework for modeling and characterizing neurons in visual cortex area V4. bioRxiv. 2018.
  • 34.Ilyas A, Engstrom L, Athalye A, Lin J. Black-box Adversarial Attacks with Limited Queries and Information. In: Dy J, Krause A, editors. Proceedings of the 35th International Conference on Machine Learning. vol. 80 of Proceedings of Machine Learning Research; 2018. p. 2137–2146.
  • 35. Koch C. Biophysics of Computation. New York: Oxford University Press; 1999. [Google Scholar]
  • 36. Gabbiani F, Cox S. Mathematics for Neuroscientists. London: Academic Press; 2010. [Google Scholar]
  • 37.Simonyan K, Vedaldi A, Zisserman A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv. 2013.
  • 38.Olah C, Satyanarayan A, Johnson I, Carter S, Schubert L, Ye K, et al. The Building Blocks of Interpretability. Distill. 2018.
  • 39.Carter S, Armstrong Z, Schubert L, Johnson I, Olah C. Activation Atlas. Distill. 2019.
  • 40.Walker EY, Sinz FH, Froudarakis E, Fahey PG, Muhammad T, Ecker AS, et al. Inception in visual cortex: in vivo-silico loops reveal most exciting images. bioRxiv. 2018.
  • 41. Malakhova K. Visualization of information encoded by neurons in the higher-level areas of the visual system. J Opt Technol. 2018;85(8):494–498. 10.1364/JOT.85.000494 [DOI] [Google Scholar]
  • 42. Kobatake E, Tanaka K. Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. Journal of Neurophysiology. 1994;71:856–867. 10.1152/jn.1994.71.3.856 [DOI] [PubMed] [Google Scholar]
  • 43. Chang L, Tsao DY. The Code for Facial Identity in the Primate Brain. Cell. 2017;169:1013–1028.e14. 10.1016/j.cell.2017.05.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann FA, Brendel W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv. 2018.
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007973.r001

Decision Letter 0

Wolfgang Einhäuser, Alona Fyshe

25 Jan 2020

Dear Mr. Xiao,

Thank you very much for submitting your manuscript "XDream: finding preferred stimuli for visual neurons using generative networks and gradient-free optimization" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the constructive reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We apologize for the length of time these reviews took, but I can assure you we were working to obtain reviews for the entire time your paper was under review.  We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Alona Fyshe, Ph.D.

Associate Editor

PLOS Computational Biology

Wolfgang Einhäuser

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors produce an analysis of the XDream method that uses generative networks with a genetic algorithm for obtaining stimuli that will optimally drive neurons. They test this method by using it on pre-trained deep neural networks. They show that XDream can consistently find stimuli that will drive units in pre-trained networks to responses greater than those of the image set from which the networks were trained. This result is approximately independent of depth in the network and is independent of initial conditions. It is relatively robust to the choice of generative network, as long as that generator produces a "high-level" representation. These results are somewhat robust to the choice of optimization but are impacted by adding noise to the representation.

This looks at an important issue for visual neuroscience, that of finding stimuli that drive neurons in a manner that is not plagued by various experimental and theoretical biases (cf. Olshausen-Field "What is the other 85% of V1 doing?" and Carandini, et al 2005, "Do we know what the early visual system does?"). The XDream tool is an example of an emerging methodology for dealing with this problem and studying its capabilities and drawbacks is therefore important and useful to the field. The paper is clear. I recommend publication after the authors have addressed the feedback below.

1) The most important issue the authors should address is to provide some acknowledgement or discussion of the notion of optimal or "preferred" stimuli in the first place. By their methodology, they are looking for a single image that will optimally drive neurons, but this is probably not the right way to think about the coding problem, certainly not for neurons and probably not for the units in deep networks as well. The "single best image" approach is going to gloss over issues like contextual dependence. These optimal stimuli may in fact be a very small set that drives responses because they combine some "feature" for which the neuron/unit codes with exactly the right context to enhance the response. Depending on the size of this space and the nature of the context, one might wind up with an image that is not very informative of the actual coding properties of the neuron or unit. The authors use the phrase "true feature preference" in the introduction, but need to acknowledge that their method may not find it either. Furthermore, this problem is compounded in the case of real neurons, in which there are feedback and lateral inputs that produce extra-classical effects. These effects may or may not appear in the models that the authors are testing, since they are all feedforward networks. This set of questions will almost certainly lead to the "invariance manifold" that the authors discussed.

None of this takes away from the authors' work, but users of the method should be aware of these issues.

2) Related to point 1, the authors should comment on the possible ethological relevance of super-stimuli. It is not surprising that such stimuli exist. Given that each unit produces a 1D parameterized function of image space, one should be able to find a point in the input space that produces a more extreme result than any finite set of inputs. Do these have a useful meaning in terms of describing that function?

3) The authors showed that the generative model in XDream was "expressive", but those demonstrations also show what the model does not capture, namely high-frequency content of images (fine details are lost in all the examples in (what should be) Figure 3). Is this a limitation for the methodology? It seems so, since the model shouldn't then be able to capture optimal stimuli for neurons that respond to fine details. I suspect this is not a severe practical limitation, but it is there.

4) The authors should acknowledge the bias that is built into their methodology by looking at samples from fixed image databases. ImageNet does have a particular structure and, unless I'm mistaken, all the tested generators were trained on this structure (implicitly by inverting AlexNet). This will bias towards finding features similar to those necessary for describing ImageNet in particular. This is not just an issue of whether the generative model can reproduce an image when forced to (e.g. the Figure that should be Figure 3) but whether it will tend to promote certain features over others. ImageNet will provide an implicit bias.

5) The claim about robustness to noise is overstated. Figure 6 makes the technique look quite susceptible to Poisson noise, depending upon target layer.

Minor issues:

1) Figure 1b and 1c: there are 9 minor tick marks and ten example images.

2) Figure 2 and 3 appear to be swapped (at least in the pdf I received).

3) Figure 2: label the random ImageNet samples in the Figure (grey boxes). label the dotted line as "maximum ImageNet response" or something similar. This will make the figure easier to read and digest.

4) line 112: ...we qualitative[ly] assessed...

5) lines 136-7: delete either "it can generate" or "can be generated"

6) Figure 4: label the open and solid violin plots on the Figure

7) Table 1: the caption appears to be a Latinate placeholder.

8) lines 211-9: this section needs to be rewritten. There needs to be a reference to Table 1, otherwise "slope" is introduced before the reader has seen such a thing. It isn't in a figure or preceding discussion. One is left to infer what variables the linear regression is being performed upon.

Reviewer #2: I will upload the pdf with my comments in it, where I have found typos and made wording suggestions.

Review of XDream: finding preferred stimuli for visual neurons using generative networks and gradient-free optimization (Will Xiao and Gabriel Kreiman)

by Gary Cottrell

This paper uses an existing method for finding optimal visual stimuli for neurons that has previously been applied in Macaque in a paper in Cell in 2019. The goal here is to elucidate how robust the method is by applying it to several deep network models with varying architectures and at different layers of the networks. Using network models allows for extensive experimentation that is impractical in biological preparations. The model proves to be very robust to various parameter regimes, and finds stimuli that drive the neuron much more than any of the over 1 million images in the ImageNet dataset. One of the most interesting findings here is that, using different image generators or different initial conditions, the model finds multiple images that drive the neurons similarly, and these images resemble one another to the human eye. Hence they suggest that there is an optimal image manifold in the latent space of images. Unfortunately, this point was already made in the prior paper with actual monkey visual neurons.

Since the authors postulate that there is an invariance manifold, it would be useful to test this idea by looking at what is generated by a linear interpolation of the codes found from different initial conditions, and how well those interpolated images drive the model neuron. While it is unlikely that the manifold is linear, since the images are similar, they are probably nearby in this space.

In the paragraph on other things one could study with this approach, such as correlated firing, synchronized firing, LFPs, etc., it would be helpful to say what you would optimize in a couple of cases. E.g., you might say, “for example, we could optimize based on increased correlations in firing rates between neurons” or some such.

The paper could be improved by moving more of the information into the methods section or into supplementary material. There are many details that make the exposition rather tough sledding for the reader. In particular, the section on the effects of different generators has a couple of caveats, e.g., “except for CaffeNet conv2”, and “The pixel-based image generator, compared to generative neural networks, worked more poorly in all target layers other than CaffeNet conv2 except when compared to deepsim-norm2 (p = 1 compared to deepsim-norm2; p > 0.14 compared to other generators in CaffeNet conv2; p < 10^−4 in all other comparisons; FDR-corrected for 32 tests comparing each generator to raw-pixel in each target layer).” This amount of detail makes my eyes glaze over. I think this section doesn’t add a lot to the paper, and could profitably be moved to the supplementary material.

Similarly, the section on different optimizers doesn’t flow well. The GA works better on some layers and the other two algorithms work better in some other layers. I’m not sure I care, and I’m not sure what the take-home message is. Again, there are similar “this works better in this layer and that works better in other layers” results in the noise experiments. I think the noise experiments are important, as this is a more realistic case. Unfortunately, the results in Figure 6 are not encouraging. I would disagree with the header for this section: “XDream is robust to noise in neuronal responses.” It doesn’t seem that robust. I wonder if there is a fix for this - for example, can you average the rate over some time interval of the Poisson process instead of simply sampling from it? Unfortunately, I don’t know enough about Poisson processes to know if this is a good suggestion or not.

One concern is that, although the authors state that “we focus on the more biologically relevant scenario where there is no information about the architecture and weights of the target model, and where we only have access to a few, potentially stochastic, activation values from the neurons.” In fact, they don’t have access to only a few activations. The model is used to generate 10,000 images to find the optimal one. This seems biologically unrealistic. I skimmed the previous paper, and there they used many fewer generations for the monkeys - 200. Since this is what is apparently possible in biological preparations, it seems like they should evaluate how much is gained in 200 generations and compare that to the 500 generations they used here. This would provide a better estimate of what is possible today.

Originality

The originality is tempered somewhat by the fact that this is investigating an existing method that has already been used in Macaque visual cortex. Obviously, this is different in the sense that using convnets as the preparation allows for much more extensive experimentation with the method than would be possible with a biological one, which is the point of this paper. For example, they can test the neuron’s response to the over one million images in ImageNet, and then compare the results of this brute force search to the effectiveness of XDream. They vary just about everything and still find that the method works well. So it is original in that sense, but I have to think that the big-font headline is the first paper.

Innovation

Again, the innovation here is to benchmark their method using in silico models. This kind of thing has been done before with other analyses of deep net features.

High importance to researchers in the field


To the extent to which other researchers may start to use this method, the importance here is that neuroscientists can get some assurance that the method is robust, and that they needn’t worry too much about optimizing the meta-parameters.

“Similar activation values notwithstanding, the optimized images were different on a pixel level (Fig 4b); they may comprise an ”invariance manifold” for each neuron that contains similar but not identical images eliciting comparable activation values (see Discussion).” I think this is an important point, so it would be good to test the hypothesis by looking along a line between some of the image codes to see if there are optimal images between these. This would be of interest to researchers using this method.

Significant biological and/or methodological insight


There is not much in the way of biological insight here, but the paper does demonstrate that this methodology is robust.

Rigorous methodology

This is quite a rigorous test of the model.

Substantial evidence for its conclusions

The evidence that this approach works well over a variety of convnet architectures and layers is extensive. On the other hand, I think the evidence for robustness to noise is weak. This is a point the authors should address in a revision of the paper.

Minor comments:

I’m not quite sure what this sentence means: “the standard deviation was lower than 10% of the activation values (Fig 4a).”

I’m not sure the slopes in Table 1 give a very intuitive idea of how much improvement you get by using good vs. bad initializations. A bar chart might be better. Also, while the text suggests that there isn’t much difference, the slopes seem fairly big as you go deeper in the net using the ivt method. I don’t know if that conclusion of mine is warranted. I.e., I would better understand this if the actual differences in medians were shown instead of the slope. Perhaps another violin plot would work here? It seems slightly counterintuitive to conclude that there is only slight variation in the optimized image activation while the p-values for the differences are on the order of 10^-77.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Garrison W Cottrell

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: PCOMPBIOL-D-19-01642_reviewer.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007973.r003

Decision Letter 1

Wolfgang Einhäuser, Alona Fyshe

12 May 2020

Dear Mr. Xiao,

Thank you very much for submitting your manuscript "XDream: finding preferred stimuli for visual neurons using generative networks and gradient-free optimization" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

It looks like we are very close to consensus on this paper. Reviewer two has a few last comments that need to be addressed.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Alona Fyshe, Ph.D.

Associate Editor

PLOS Computational Biology

Wolfgang Einhäuser

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

It looks like we are very close to consensus on this paper. Reviewer two has a few last comments that need to be addressed

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed my concerns. I am happy to support publication.

Reviewer #2: Review of Xdream revision 1.

by Gary Cottrell

The reviewers have adequately addressed most of my comments, so this version is improved over the previous version, but there are still a few things that need to be addressed, and unfortunately, on the second reading, I have a couple of new questions for you.

I will also upload my marked-up version of the pdf for wording and typos I found.

I should say I only skimmed the supplementary material, but I did have one comment on the discussion: In the second paragraph, you refer to “the direct method”, which is not mentioned in the previous paragraph. You need to reintroduce this idea here, as a neuro person will not have any idea what you are talking about without going back to the main article.

In the paper, you linearly interpolate between the optimized images. In the response, you mention that you don’t do this in pixel space, but in the latent space. You should mention that here.

BTW, David Sheinberg showed some data at CNS in 2003 that may be relevant here. He told me it didn’t make it into a paper, but he found a cell in macaque that responded to both car images and butterfly images, with the strongest response to a particular car and a particular butterfly - very disjoint stimuli! It would be cool to stick that in here somewhere; I’ll upload the data with my review. I should mention that these images were very well known to the monkey.

There’s a typo on page 8/14, line 262, where you refer to Figure 3c, but I think you mean Figure 3d. On that same page, you again mention Table 1, without explanation. The explanation does appear in a caption under table (you should also mention the way you calculated the slope - I assume linear regression). Tables don’t have captions, so you will still need to move this into the main text, presumably before you mention the table.

Line 272 page 8:

With the “ivt” method, these initializations worked similarly in layers conv2 and conv4, not in layers fc6 (p=0.0014) and fc8 (p=2x10^-15).

This doesn’t say how they worked in layers fc6 anbd fc8 (again, these details are boring and irrelevant, since, as you say, PIs are not likely to have optimal images to start with. I still recommend leaving them to the supplementary material). But I’m not even sure what you are saying here. What are you referring to? Is it that opt is better than ivt in these layers? When I looked at the figure, I thought you were referring to how the best initialization gave bigger improvements here over medium and worst initializations for these layers. Please clarify this.

Likewise, the next sentence says that random initializations worked better, but you don’t say better than what, and it sure doesn’t look like they are better than the other results in Figure 3d, at least for medium and bad initializations.

Again, this paragraph doesn’t flow well. The message you want to get across is: In a realistic situation, where the PI doesn’t have access to optimal starting images, random initialization works about as well as anything else. Start by saying that starting with the optimal image does help in some cases, but given that investigators are not likely to have access to these, using random initialization is sufficient. Details in the supplementary material.

Figure 3d has some issues. In the caption, you give the order as best, middle, worst, but you’ve reversed the order from the previous manuscript - it’s now worst, middle, best. Also, the y-axis caption is incorrect - it should be “Relative Activation”, not “Target CaffeNet layer.”

Figure 4B has four different y-axis labels (!). This is not explained in the figure caption (or the text), so you may as well leave the extra three out, as they are confusing and a distraction from the main point.

Page 9: optimization methods: There are some significant differences between the algorithms. Can you comment on why you think that is?

Page 9: noise: There is a big difference of the effect of noise on the algorithm in the hidden layers versus the output. Any idea why? This seems odd, given that you’ve optimized your metaparameters on the output neurons, if I understood the methods correctly.

I trust that the editor can enforce these changes/clarifications; I don’t need to review this again.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Michael Buice

Reviewer #2: Yes: Garrison W Cottrell

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: PCOMPBIOL-D-19-01642_R1_reviewer_gwc_comments.pdf

Attachment

Submitted filename: simon_selectivity.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007973.r005

Decision Letter 2

Wolfgang Einhäuser, Alona Fyshe

21 May 2020

Dear Mr. Xiao,

We are pleased to inform you that your manuscript 'XDream: finding preferred stimuli for visual neurons using generative networks and gradient-free optimization' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Alona Fyshe, Ph.D.

Associate Editor

PLOS Computational Biology

Wolfgang Einhäuser

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007973.r006

Acceptance letter

Wolfgang Einhäuser, Alona Fyshe

9 Jun 2020

PCOMPBIOL-D-19-01642R2

XDream: finding preferred stimuli for visual neurons using generative networks and gradient-free optimization

Dear Dr Xiao,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Expected maximum relative activation in response to random natural images.

    We measured the max relative activation expected in two random sampling schemes. “Random” refers to picking a given number of images randomly from the ImageNet dataset (blue). “10 categories” refers to first randomly picking 10 categories out of the 1000 ImageNet categories and then picking randomly from those categories (gray). We considered 4 layers from the CaffeNet architecture. Lines indicate the median relative activation (activation divided by the highest activation for all ImageNet images). Shading indicates the 25th- to 75th-percentiles among 100 random units per layer.

    (TIF)

    S2 Fig. Optimized and best ImageNet images for other example neurons across architectures and layers.

    Two neurons were randomly selected per layer per architecture (S1 Table). Format is the same as in Fig 2.

    (TIF)

    S3 Fig. The image generator can approximate arbitrary images, and XDream can find these images using only scalar distance as a loss function.

    This figure reproduces Supplementary Figure 1 in [13]. The generative network is challenged to synthesize arbitrary target images (row 1) using one of two encoding methods, “opt” (row 2) and “ivt” (row 3; Methods). In addition, XDream can discover the target image efficiently (within 10,000 test image presentations) by using the genetic algorithm to minimize the mean squared difference between the target image and any test image as a loss function, either in pixel space (row 4) or in CaffeNet pool5 representation space (row 5).

    (TIF)

    S4 Fig. Comparison of image generators.

    a) We tested each of the family of image generators from [8] as the image generator in XDream, together with a generator directly representing images as pixels. Format of the plot is the same as in Fig 2a. b), The same generator architecture (DeePSiM-fc6) was trained on ImageNet and Places365, respectively, and tested on classifiers trained on either dataset. Each half of a violin corresponds to one generator, and dashed lines inside the violins indicate quartiles of the distribution; otherwise, format of the plot is the same as in Fig 2a.

    (TIF)

    S5 Fig. Comparison of hyperparameters in the genetic algorithm.

    In each plot, one hyperparameter was varied while the others were held constant at default values indicated by the open circles. Dots indicate the mean of relative activation across 40 target neurons, 10 neurons each in 4 layers specified in S4 Table. Blue and orange lines indicate noiseless and noisy target units, respectively. Light colored lines indicate the mean across the 10 units within each architecture and layer. Light gray shading indicates the linear portion of a symmetrical log plot, which is used in order to show zero values.

    (TIF)

    S6 Fig. Testing XDream on a toy model that mimics the extra-classical effect of surround suppression.

    We took two feature channels (first column, rows 2 & 3) from the conv1 layer of AlexNet and tiled each spatially with positive and negative weights to create a central, circular excitatory region and a concentric suppressive ring, analogous to an excitatory classical receptive field (RF) and a suppressive extraclassical RF (first row). By maximizing responses of the constructed units, XDream created stimuli that are spatially confined and agreed with the varying RF sizes (rows 2 & 3). We also created a unit that preferred a horizontal pattern in the center and a vertical pattern in the surround; XDream was able to uncover this preference pattern as well (row 4).

    (TIF)

    S1 Table. Target networks and layers.

    For each network, 4 layers from what is roughly the early, middle, late stages of processing, together with the output layer before softmax, were selected as targets. PlacesCNN has the same architecture as CaffeNet but is trained on the Places-205 dataset [31]. CaffeNet is as implemented in https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet, PlacesCNN as in [31], and the remaining as in https://github.com/GeekLiB/caffe-model.

    (PDF)

    S2 Table. Optimized hyperparameter values for the genetic algorithm.

    Hyperparameters used in the experiments in this paper, obtained as described in Methods separately for each generative network and for noiseless and noisy targets.

    (PDF)

    S3 Table. Optimized hyperparameter values for the FDGD and NES algorithms.

    Hyperparameters used in the experiments in this paper, obtained as described in Methods separately for the noiseless and noisy case. The generative network was always deepsim-fc6.

    (PDF)

    S4 Table. Inferior temporal cortex-like layers.

    From each layer, 10 units were randomly selected and used in hyperparameter evaluation.

    (PDF)

    S1 Text. Methods and additional experiments & discussion.

    (PDF)

    Attachment

    Submitted filename: PCOMPBIOL-D-19-01642_reviewer.pdf

    Attachment

    Submitted filename: response2.pdf

    Attachment

    Submitted filename: PCOMPBIOL-D-19-01642_R1_reviewer_gwc_comments.pdf

    Attachment

    Submitted filename: simon_selectivity.pdf

    Attachment

    Submitted filename: response.pdf

    Data Availability Statement

    All data underlying the findings described in their manuscript have been made available in the publicly available GitHub repository at https://github.com/willwx/XDream.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES