Abstract
While many models of biological object recognition share a common set of “broad-stroke” properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model—e.g., the number of units per layer, the size of pooling kernels, exponents in normalization operations, etc. Since the number of such parameters (explicit or implicit) is typically large and the computational cost of evaluating one particular parameter set is high, the space of possible model instantiations goes largely unexplored. Thus, when a model fails to approach the abilities of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea or because the correct “parts” have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such parameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and parameter instantiations, screening those that show promising object recognition performance for further analysis. We show that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision.
Author Summary
One of the primary obstacles to understanding the computational underpinnings of biological vision is its sheer scale—the visual system is a massively parallel computer, comprised of billions of elements. While this scale has historically been beyond the reach of even the fastest super-computing systems, recent advances in commodity graphics processors (such as those found in the PlayStation 3 and high-end NVIDIA graphics cards) have made unprecedented computational resources broadly available. Here, we describe a high-throughput approach that harnesses the power of modern graphics hardware to search a vast space of large-scale, biologically inspired candidate models of the visual system. The best of these models, drawn from thousands of candidates, outperformed a variety of state-of-the-art vision systems across a range of object and face recognition tasks. We argue that these experiments point a new way forward, both in the creation of machine vision systems and in providing insights into the computational underpinnings of biological vision.
Introduction
The study of biological vision and the creation of artificial vision systems are naturally intertwined—exploration of the neuronal substrates of visual processing provides clues and inspiration for artificial systems, and artificial systems, in turn, serve as important generators of new ideas and working hypotheses. The results of this synergy have been powerful: in addition to providing important theoretical frameworks for empirical investigations (e.g. [1]–[6]), biologically-inspired models are routinely among the highest-performing artificial vision systems in practical tests of object and face recognition [7]–[12].
However, while neuroscience has provided inspiration for some of the “broad-stroke” properties of the visual system, much is still unknown. Even for those qualitative properties that most biologically-inspired models share, experimental data currently provide little constraint on their key parameters. As a result, even the most faithfully biomimetic vision models necessarily represent just one of many possible realizations of a collection of computational ideas.
Truly evaluating the set of biologically-inspired computational ideas is difficult, since the performance of a model depends strongly on its particular instantiation–the size of the pooling kernels, the number of units per layer, exponents in normalization operations, etc. Because the number of such parameters (explicit or implicit) is typically large, and the computational cost of evaluating one particular model is high, it is difficult to adequately explore the space of possible model instantiations. At the same time, there is no guarantee that even the “correct” set of principles will work when instantiated on a small scale (in terms of dimensionality, amount of training, etc.). Thus, when a model fails to approach the abilities of biological visual systems, we cannot tell if this is because the ideas are wrong, or they are simply not put together correctly or on a large enough scale.
As a result of these factors, the availability of computational resources plays a critical role in shaping what kinds of computational investigations are possible. Traditionally, this bound has grown according to Moore's Law [13], however, recently, advances in highly-parallel graphics processing hardware (such as high-end NVIDIA graphics cards, and the PlayStation 3's IBM Cell processor) have disrupted this status quo for some classes of computational problems. In particular, this new class of modern graphics processing hardware has enabled over hundred-fold speed-ups in some of the key computations that most biologically-inspired visual models share in common. As is already occurring in other scientific fields [14],[15], the large quantitative performance improvements offered by this new class of hardware hold the potential to effect qualitative changes in how science is done.
In the present work, we take advantage of these recent advances in graphics processing hardware [16],[17] to more expansively explore the range of biologically-inspired models–including models of larger, more realistic scale. In analogy to high-throughput screening approaches in molecular biology and genetics, we generated and trained thousands of potential network architectures and parameter instantiations, and we “screened” the visual representations produced by these models using tasks that engage the core problem of object recognition–tolerance to image variation [10]–[12],[18],[19]. From these candidate models, the most promising were selected for further analysis.
We show that this large-scale screening approach can yield significant, reproducible gains in performance in a variety of basic object recognitions tasks and that it holds the promise of offering insight into which computational ideas are most important for achieving this performance. Critically, such insights can then be fed back into the design of candidate models (constraining the search space and suggesting additional model features), further guiding evolutionary progress. As the scale of available computational power continues to expand, high-throughput exploration of ideas in computational vision holds great potential both for accelerating progress in artificial vision, and for generating new, experimentally-testable hypotheses for the study of biological vision.
Methods
A Family of Candidate Models
In order to generate a large number of candidate model instantiations, it is necessary to parameterize the family of all possible models that will be considered. A schematic of the overall architecture of this model family, and some of its parameters, is shown in Figure 2. The parameterization of this family of models was designed to be as inclusive as possible–that is, the set of model operations and parameters was chosen so that the family of possible models would encompass (as special cases) many of the biologically-inspired models already described in the extant literature (e.g. [1]–[4],[7],[9]). For instance, the full model includes an optional “trace” term, which allows learning behavior akin to that described in previous work (e.g. [4], [20]–[22]). While some of the variation within this family of possible models might best be described as variation in parameter tuning within a fixed model architecture, many parameters produce significant architectural changes in the model (e.g. number of filters in each layer). The primary purpose of this report is to present an overarching approach to high-throughput screening. While precise choices of parameters and parameter ranges are clearly important, one could change which parameters were explored, and over what ranges, without disrupting the integrity of the overarching approach. An exhaustive description of specific model parameters used here is included in the Supplemental Text S1, and is briefly described next.
Model parameters were organized into four basic groups. The first group of parameters controlled structural properties of the system, such as the number of filters in each layer and their sizes. The second group of parameters controlled the properties of nonlinearities within each layer, such as divisive normalization coeffients and activation functions. The third group of parameters controlled how the models learned filter weights in response to video inputs during an Unsupervised Learning Phase (this class includes parameters such as learning rate, trace factors, etc.; see Phase 2: Unsupervised Learning below). A final set of parameters controlled details of how the resulting representation vectors are classified during screening and validation (e.g. parameters of dimensionality reduction, classification parameters, etc.). For the purposes of the work presented here, this class of classification-related parameters was held constant for all analyses below. Briefly, the output values of the final model layer corresponding to each test example image were “unrolled” into a vector, their dimensionality was reduced using Principal Component Analysis (PCA) keeping as many dimensions as there were data points in the training set, and labeled examples were used to train a linear Support Vector Machine (SVM).
Each model consisted of three layers, with each layer consisting of a “stack” of between 16 and 256 linear filters that were applied at each position to a region of the layer below. At each stage, the output of each unit was normalized by the activity of its neighbors within a parametrically-defined radius. Unit outputs were also subject to parameterized threshold and saturation functions, and the output of a given layer could be spatially resampled before being given to the next layer as input. Filter kernels within each stack within each layer were initialized to random starting values, and learned their weights during the Unsupervised Learning Phase (see below, see Supplemental Text S1). Briefly, during this phase, under parametric control, a “winning” filter or filters were selected for each input patch, and the kernel of these filters was adapted to more closely resemble that patch, achieving a form of online non-parametric density estimation. Building upon recent findings from visual neuroscience [18],[23],[24], unsupervised learning could also be biased by temporal factors, such that filters that “won” in previous frames were biased to win again (see Supplemental Text S1 for details).
It should be noted that while the parameter set describing the model family is large, it is not without constraints. While our model family includes a wide variety of feed-forward architectures with local intrinsic processing (normalization), we have not yet included long-range feedback mechanisms (e.g. layer to layer). While such mechanisms may very well turn out to be critically important for achieving the performance of natural visual systems, the intent of the current work is to present a framework to approach the problem. Other parameters and mechanisms could be added to this framework, without loss of generality. Indeed, the addition of new mechanisms and refinement of existing ones is a major area for future research (see Discussion).
Parallel Computing Using Commodity Graphics Hardware
While details of the implementation of our model class are not essential to the theoretical implications of our approach, attention must nonetheless be paid to speed in order to ensure the practical tractability, since the models used here are large (i.e. they have many units), and because the space of possible models is enormous. Fortunately, the computations underlying our particular family of candidate models are intrinsically parallel at a number of levels. In addition to coarse-grain parallelism at the level of individual model instantiations (e.g. multiple models can be evaluated at the same time) and video frames (e.g. feedforward processing can be done in parallel on multiple frames at once), there is a high degree of fine-grained parallelism in the processing of each individual frame. For instance, when a filter kernel is applied to an image, the same filter is applied to many regions of the image, and many filters are applied to each region of the image, and these operations are largely independent. The large number of arithmetic operations per region of image also results in high arithmetic intensity (numbers of arithmetic operations per memory fetch), which is desirable for high-performance computing, since memory accesses are typically several orders of magnitude less efficient than arithmetic operations (when arithmetic intensity is high, caching of fetched results leads to better utilization of a processor's compute resources). These considerations are especially important for making use of modern graphics hardware (such as the Cell processor and GPUs) where many processors are available. Highly-optimized implementations of core operations (e.g. linear filtering, local normalization) were created for both the IBM Cell Processor (PlayStation 3), and for NVIDIA graphics processing units (GPUs) using the Tesla Architecture and the CUDA programming model [25]. These implementations achieve highly significant speed-ups relative to conventional CPU-based implementations (see Figure 1 and Supplemental Figure S1). High-level “outer loop” coordination of these highly optimized operations was accomplished using the Python programming language (e.g. using PyCUDA [26]), allowing for a favorable balance between ease of programming and raw speed (see Supplemental Text S2). In principle, all of the analyses presented here could have been performed using traditional computational hardware; however, the cost (in terms of time and/or money) of doing so with current CPU hardware is prohibitive.
Figure 1 shows the relative speedup and performance/cost of each implementation (IBM Cell on Sony's PlayStation 3 and several NVIDIA GPUs) relative to traditional MATLAB and multi-threaded C code for the linear filtering operation (more details such as the raw floating point performance can be found in the Supplemental Figure S1). This operation is not only a key component of the candidate model family (see below) but it's also the most computationally demanding, reaching up to 94% of the total processing time (for the PlayStation 3 implementation), depending on model parameters (average fraction is 28%). The use of commodity graphics hardware affords orders-of-magnitude increases in performance. In particular, it should be noted that the data presented in this work took approximately one week to generate using our PlayStation 3-based implementation (222x speedup with one system) on a cluster of 23 machines. We estimate that producing the same results at the same cost using a conventional MATLAB implementation would have taken more than two years (see Figure S1).
Screening for Good Forms of Representation
Our approach is to sample a large number of model instantiations, using a well-chosen “screening” task to find promising architectures and parameter ranges within the model family. Our approach to this search was divided into four phases (see Figure 3): Candidate Model Generation, Unsupervised Learning, Screening, and Validation/Analysis of high-performing models.
Phase 1: candidate model generation
Candidate model parameter sets were randomly sampled with a uniform distribution from the full space of possible models in the family considered here (see Figure 2 and Figure S2 for a schematic diagram of the models, and Supplemental Materials for an exhaustive description of model parameters and value ranges that were explored; Supplemental Text S1).
Phase 2: unsupervised learning
All models were subjected to a period of unsupervised learning, during which filter kernels were adapted to spatiotemporal statistics of a stream of input images. Since the family of models considered here includes features designed to take advantage of the temporal statistics of natural inputs (see Supplementary Methods), models were learned using video data. In the current version of our family of models, learning influenced the form of the linear kernels of units at each layer of the hierarchy, but did not influence any other parameters of the model.
We used three video sets for unsupervised learning: “Cars and Planes”, “Boats”, and “Law and Order”. The “Law and Order” video set consisted of clips from the television program of the same name (Copyright NBC Universal), taken from DVDs, with clips selected to avoid the inclusion of text subtitles. These clips included a variety of objects moving through the frame, including characters' bodies and faces.
The “Cars and Planes” and “Boats” video sets consisted of 3D ray-traced cars, planes and boats undergoing 6-degree-of-freedom view transformations (roughly speaking, “tumbling” through space). These same 3D models were also used in a previous study [11]. Video clips were generated where an object would appear for approximately 300 frames, performing a random walk in position (3 degrees of freedom) and rotation (3 degrees of freedom) for a total of 15,000 frames. Examples are shown in Figures 4A and 4B.
For the sake of convenience, we refer to each unsupervised learning video set as a “petri dish,” carrying forward the analogy to high-throughput screening from biology. In the results presented here, 2,500 model instantiations were independently generated in each “petri dish” by randomly drawing parameter values from a uniform distribution (a total of 7,500 models were trained). Examples of filter kernels resulting from this unsupervised learning procedure are shown in Supplemental Figures S3, S4, S5 and S6.
After the end of the Unsupervised Learning Phase, the linear filter kernels were not modified further, and the resulting model was treated as a fixed transformation (e.g. a static image is entered as input, and a vector of responses from the units of the final layer is outputted).
Phase 3: screening
Following the Unsupervised Learning Phase, each “petri dish” was subjected to a Screening Phase to determine which model instantiations produced image representations that are well-suited for performing invariant object recognition tasks.
During the Screening Phase, individual static images were supplied as input to each model, and the vector of responses from the units of its final layer were taken as that model's “representation” of the image. The labeled, “re-represented” images were then reduced in dimensionality by PCA and taken as inputs (training examples) for a classifier (in our case, a linear SVM).
We used a simple “Cars vs. Planes” synthetic object recognition test as a screening task (see [11] for details). In this task, 3D models from two categories (cars and planes), were rendered across a wide range of variation in position, scale, view, and background. The rendered grayscale images (200 by 200 pixels) were provided as input to each model, and a classifier was trained to distinguish car images from plane images (150 training images per category). Performance of each model was then tested on a new set of unlabeled re-represented car and plane images (150 testing images per category). This recognition test has the benefit of being relatively quick to evaluate (because it only contains two classes), while at the same time having previous empirical grounding as a challenging object recognition test due to the large amount of position, scale, view, and background variation [11] (see Figure 5A).
Phase 4: validation
The best models selected during the Screening Phase were submitted to validation tests using other image sets, to determine if the representations generated by the models were useful beyond the immediate screening task. For the present work, four validation sets were used: 1) a new set of rendered cars and planes (generated by the same random process that generated the screening set, but with different specific examplars), 2) a set of rendered boats and animals 3) a set of rendered images of two synthetic faces (one male, one female, [10],[12]), and 4) a modified subset of the standard MultiPIE face recognition test set ([27]; here dubbed the “MultiPIE Hybrid” set). In the case of the rendered sets (sets 1–3), as with the screening set, the objects were rendered across a wide range of views, positions, and scales.
For the “MultiPIE hybrid” set, 50 images each of two individuals from the standard MultiPIE set were randomly selected from the full range of camera angles, lighting, expressions, and sessions included in the MultiPIE set. These faces were manually removed from their backgrounds and were further transformed in scale, position, planar rotation and were composited onto random natural backgrounds. Examples of the resulting images are shown in Figure 5.
For all sets (as with the screening set) classifiers were trained with labeled examples to perform a two-choice task (i.e. Cars vs. Planes, Boats vs. Animals, Face 1 vs. Face 2), and were subsequently tested with images not included in the training set.
While a number of standardized “natural” object and face recognition test sets exist [28]–[34], we made a deliberate choice not to use these sets. Previous investigations [10]–[12],[35],[36] have raised concerns with many of these sets, calling into question whether they appropriately capture the problem of interest. As a result, we chose to focus here on image sets that include substantial image variation by design, be they synthetic (as in our rendered set) or natural (as in the MultiPIE Hybrid set) in origin.
Performance Comparison with Other Algorithms
“V1-like” baseline
Since object recognition performance measures are impossible to interpret in a vacuum, we used a simple V1-like model to serve as one baseline against which model performance can be compared. This V1-like model was taken, without modification, from Pinto et al. [11], and was shown previously to match or exceed the performance of a variety of purpose-built vision systems on the popular (but, we argue, flawed as a test of invariant object recognition) Caltech101 object recognition set and a wide variety of standard face recognition sets (ORL, Yale, CVL, AR, and Labeled Faces in the Wild [10],[12]). Importantly, this model is based on only a first-order description of the first stage of visual processing in the brain, and it contains no mechanisms that should allow it to tolerate the substantial image variation that makes object recognition hard in the first place [11],[19]. Here, this model serves as a lower bound on the amount of trivial regularity that exists in the test set. To be considered promising object recognition systems, models should at least exceed the performance of the V1-like model.
Comparison with state-of-the-art algorithms
To facilitate comparison with other models in the literature, we obtained code for, or re-implemented five “state of the art” object recognition algorithms from the extant literature: “Pyramid Histogram of Oriented Gradients” (PHOG) [37], “Pyramid Histogram of Words” (PHOW) (also known as the Spatial Pyramid [38]), the “Geometric Blur” shape descriptors [39], the descriptors from the “Scale Invariant Feature Transformation” (SIFT) [40], and the “Sparse Localized Features” (SLF) features of Mutch and Lowe [8] (a sparse extension of the C2 features from the Serre et al. HMAX model [7]). In all cases, we were able to reproduce or exceed the authors' reported performance for each system on the Caltech101 test set, which served as a sanity check that we had correctly implemented and used each algorithm as intended by its creators.
Each algorithm was applied using an identical testing protocol to our validation sets. In cases where an algorithm from the literature dictated that filters be optimized relative to each training set (e.g. [38] and [8]), we remained faithful to the authors' published descriptions and allowed this optimization, resulting in a different individually tailored model for each validation set. This was done even though our own high-throughput-derived models were not allowed such per-set optimizations (i.e. the same representation was used for all validation sets), and could therefore theoretically be “handicapped” relative to the state-of-the-art models.
Results
Object Recognition Performance
As a first exploration of our high-throughput approach, we generated 7,500 model instantiations, in three groups of 2,500, with each group corresponding to a different class of unsupervised learning videos (“petri dishes”; see Methods ). During the Screening Phase, we used the “Cars vs. Planes” object discrimination task [11] to assess the performance of each model, and the most promising five models from each set of 2,500 models was submitted to further analysis. The raw computation required to generate, train and screen these 7,500 models was completed in approximately one week, using 23 PlayStation 3 systems [41]. Results for models trained with the “Law and Order” petri dish during the Unsupervised Learning Phase are shown in Figure 6A. As expected, the population of randomly-generated models exhibited a broad distribution of performance on the screening task, ranging from chance performance (50%) to better than 80% correct. Figure 6B shows the performance of the best five models drawn from the pool of 2,500 models in the “Law and Order” petri dish. These models consistently outperformed the V1-like model baseline (Figure 7), and this performance was roughly maintained even when the model was retrained with a different video set (e.g. a different clip from Law and Order), or with a different random initialization of the filter kernel weights (Figure 6C).
Since these top models were selected for their high performance on the screening task, it is perhaps not surprising that they all show a high level of performance on that task. To determine whether the performance of these models generalized to other test sets, a series of Validation tests were performed. Specifically, we tested the best five models from each Unsupervised Learning petri dish on four test sets: two rendered object sets, one rendered face set, and a modified subset of the MultiPIE face recognition image set (see Validation Phase in Methods). Performance across each of these validation sets is shown in Figure 7 (black bars). While the exact ordering of model performance varied somewhat from validation set to validation set, the models selected during the Screening Phase performed well across the range of validation tasks.
The top five models found by our high-throughput screening procedure generally outperformed state-of-the-art models from the literature (see Methods) across all sets, with the best model found by the high-throughput search uniformly yielding the highest performance across all validation sets. Even greater performance was achieved by a simple summing of the SVM kernels from the top five models (red bar, Figure 7). Of note, the nearest contender from the set of state-of-the-art models is another biologically-inspired model [7],[8].
Interestingly, a large performance advantage between our high-throughput-derived models and state-of-the-art models was observed for the MultiPIE hybrid set, even though this is arguably the most different from the task used for screening, since it is composed from natural images (photographs), rather than synthetic (rendered) ones. It should be noted that several of the state-of-the-art models, including the sparse C2 features (“SLF” in Figure 7), which was consistently the nearest competitor to our models, used filters that were individually tailored to each validation test–i.e. the representation used for “Boats vs. Planes” was optimized for that set, and was different from the representation used for the MultiPIE Hybrid set. This is in contrast to our models, which learned their filters from a completely unrelated video data set (Law and Order) and were screened using an unrelated task (“Cars vs. Planes”, see Methods). While even better performance could no doubt be obtained by screening with a subset taken from each individual validation test, the generalizability of performance across a range of different tasks argues that our approach may be uncovering features and representations that are broadly useful. Such generality is in keeping with the models' biological inspiration, since biological visual representations must be flexible enough to represent a massive diversity of objects in order to be useful.
Results for the 2,500 models in each of the other two “petri dishes” (i.e. models trained with alternate video sets during unsupervised learning) were appreciably similar, and are shown in Supplemental Figures S7 and S8, using the same display conventions set forth in Figures 6 and 7.
Discussion
We have demonstrated a high-throughput framework, within which a massive number of candidate vision models can be generated, screened, and analyzed. Models found in this way were found to consistently outperform an experimentally-motivated baseline model (a V1-like model; [10]–[12]), and the representations of visual space instantiated by these models were found to be useful generally across a variety of object recognition tasks. The best of these models and the blend of the five best models were both found to consistently outperform a variety of state-of-the-art machine vision systems for all of the test sets explored here, even without any additional optimization.
This work builds on a long tradition of machine vision systems inspired by biology (e.g. [1]–[4],[7],[9]). However, while this past work has generated impressive progress towards building artificial visual systems, it has explored only a few examples drawn from the larger space of biologically-inspired models. While the task of exploring the full space of possible model instantiations remains daunting (even within the relatively restricted “first-order” class of models explored here), our results suggest that even a relatively simple, brute-force high-throughput search strategy is effective in identifying promising models for further study. In the parameter space used here, we found that a handful of model instantiations performed substantially better than the rest, with these “good” models occurring at a rate of approximately one in five-hundred. The relative rarity of these models underscores the importance of performing large-scale experiments with many model instantiations, since these models would be easy to miss in a “one-off” mode of exploration. Importantly, these rare, high-performing models performed well across a range of object recognition tasks, indicating that our approach does not simply optimize for a given task, but can uncover visual representations of general utility.
Though not conceptually critical to our approach, modern graphics hardware played an essential role in making our experiments possible. In approximately one week, we were able to test 7,500 model instantiations, which would have taken approximately two years using a conventional (e.g. MATLAB-based) approach. While it is certainly possible to use better-optimized CPU-based implementations, GPU hardware provides large increases in attainable computational power (see Figure 1 and Supplemental Figure S1).
An important theme in this work is the use of parametrically controlled objects as a way of guiding progress. While we are ultimately interested in building systems that tolerate image variation in real-world settings, such sets are difficult to create, and many popular currently-available “natural” object sets have been shown to lack realistic amounts of variation [10]–[12]. Our results show that it is possible to design a small synthetic set to screen and select models that generalize well across various visual classification tasks, suggesting that parametric sets can capture the essence of the invariant object recognition problem. Another critical advantage of the parametric screening approach presented here is that task difficulty can be increased on demand–that is, as models are found that succeed for a given level of image variation, the level of variation (and therefore the level of task difficulty), can be “ratcheted up” as well, maintaining evolutionary “pressure” towards better and better models.
While we have used a variety of synthetic (rendered) object image sets, images need not be synthetic to meet the requirements of our approach. The modified subset of the MultiPIE set used here (“MultiPIE Hybrid”, Figure 5) is an example of how parametric variation can also be achieved using carefully controlled photography.
Future Directions
While our approach has yielded a first crop of promising biologically-inspired visual representations, it is another, larger task to understand how these models work, and why they are better than other alternatives. While such insights are beyond the scope of the present paper, our framework provides a number of promising avenues for further understanding.
One obvious direction is to directly analyze the parameter values of the best models in order to understand which parameters are critical for performance. Figure 8 shows distributions of parameter values for four arbitrarily chosen parameters. While in no way conclusive, there are hints that some particular parameter values may be more important for performance than others (for quantitative analysis of the relationship between model parameters and performance, see Supplemental Text S3, Figures S9 and S10). The speed with which large collections of models can be evaluated opens up the possibility of running large-scale experiments where given parameters are held fixed, or varied systematically. Insights derived from such experiments can then be fed back into the next round of high-throughput search, either by adjusting the parameter search space or by fundamentally adjusting the algorithm itself. Such iterative refinement is an active area of research in our group.
The search procedure presented here has already uncovered promising visual representations, however, it represents just the simplest first step one might take in conducting a large-scale search. For the sake of minimizing conceptual complexity, and maximizing the diversity of models analyzed, we chose to use random, brute-force search strategy. However, a rich set of search algorithms exist for potentially increasingly the efficiency with which this search is done (e.g. genetic algorithms [42], simulated annealing [43], and particle swarm techniques [44]). Interestingly, our brute-force search found strong models with relatively high probability, suggesting that, while these models would be hard to find by “manual” trial-and-error, they are not especially rare in the context of our high-throughput search.
While better search algorithms will no doubt find better instances from the model class used here, an important future direction is to refine the parameter-ranges searched and to refine the algorithms themselves. While the model class described here is large, the class of all models that would count as “biologically-inspired” is even larger. A critical component of future work will be to adjust existing mechanisms to achieve better performance, and to add new mechanisms (including more complex features such as long-range feedback projections). Importantly, the high-throughput search framework presented here provides a coherent means to find and compare models and algorithms, without being unduly led astray by weak sampling of the potential parameter space.
Another area of future work is the application of high-throughput screening to new problem domains. While we have here searched for visual representations that are good for object recognition, our approach could also be applied to a variety of other related problems, such as object tracking, texture recognition, gesture recognition, feature-based stereo-matching, etc. Indeed, to the extent that natural visual representations are flexibly able to solve all of these tasks, we might likewise hope to mine artificial representations that are useful in a wide range of tasks.
Finally, as the scale of available computational resources steadily increases, our approach naturally scales as well, allowing more numerous, larger, and more complex models to be examined. This will give us both the ability to generate more powerful machine vision systems, and to generate models that better match the scale of natural systems, providing more direct footing for comparison and hypothesis generation. Such scaling holds great potential to accelerate both artificial vision research, as well as our understanding of the computational underpinnings of biological vision.
Supporting Information
Acknowledgments
We would like to thank Tomaso Poggio and Thomas Serre for helpful discussions; Roman Stanchak, Youssef Barhomi and Jennie Deutsch for technical assistance, and Andreas Klöckner for supporting PyCUDA. Hardware support was generously provided by the NVIDIA Corporation.
Footnotes
The authors have declared that no competing interests exist.
This study was funded in part by The National Institutes of Health (NEI R01EY014970), The McKnight Endowment for Neuroscience, Dr. Gerald Burnett and Marjorie Burnett, and The Rowland Institute of Harvard. Hardware support was generously provided by the NVIDIA Corporation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Fukushima K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics. 1980;36:193–202. doi: 10.1007/BF00344251. [DOI] [PubMed] [Google Scholar]
- 2.Bishop C. Neural networks for pattern recognition. Oxford Univ Pr 2005 [Google Scholar]
- 3.Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nature Neuroscience. 1999;2:1019–1025. doi: 10.1038/14819. [DOI] [PubMed] [Google Scholar]
- 4.Rolls ET, Milward T. A Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance Measures. Neural Computation. 2000;12:2547–2572. doi: 10.1162/089976600300014845. [DOI] [PubMed] [Google Scholar]
- 5.Rolls ET, Deco G. Computational neuroscience of vision. 2002. Oxford University Press New York.
- 6.Haykin S. Neural Networks: A Comprehensive Foundation. 1998. Prentice Hall PTR Upper Saddle River, NJ, USA.
- 7.Serre T, Wolf L, Bileschi S, Riesenhuber M, Poggio T. Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29:411. doi: 10.1109/TPAMI.2007.56. [DOI] [PubMed] [Google Scholar]
- 8.Mutch J, Lowe DG. Object class recognition and localization using sparse features with limited receptive fields. International Journal of Computer Vision (IJCV) 2008;80:45–57. [Google Scholar]
- 9.LeCun Y, Huang F, Bottou L. Learning methods for generic object recognition with invariance to pose and lighting. Computer Vision and Pattern Recognition Conference (CVPR) 2004 [Google Scholar]
- 10.Pinto N, DiCarlo JJ, Cox DD. Establishing Good Benchmarks and Baselines for Face Recognition. 2008. European Conference on Computer Vision (ECCV)
- 11.Pinto N, Cox DD, DiCarlo JJ. Why is Real-World Visual Object Recognition Hard. PLoS Computational Biology. 2008;4:e27. doi: 10.1371/journal.pcbi.0040027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pinto N, DiCarlo JJ, Cox DD. How far can you get with a modern face recognition test set using only simple features? 2009. Computer Vision and Pattern Recognition Conference (CVPR)
- 13.Moore Gordon E. Cramming more components onto integrated circuits. Electronics. 1965;38:114–117. [Google Scholar]
- 14.Kurzak J, Buttari A, Luszczek P, Dongarra J. The PlayStation 3 for High-Performance Scientific Computing. Computing in Science & Engineering. 2008;10:84–87. [Google Scholar]
- 15.Yang R. Scientific Computing on Commodity Graphics Hardware. Lecture Notes in Computer Science. 2004:1100–1105. [Google Scholar]
- 16.Owens JD, Houston M, Luebke D, Green S, Stone JE, et al. GPU computing. Proceedings of the IEEE. 2008;96:879–899. [Google Scholar]
- 17.Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum. 2007;26:80–113. [Google Scholar]
- 18.Cox DD, Meier P, Oertelt N, DiCarlo JJ. Breaking position-invariant object recognition. Nature Neuroscience. 2005;8:1145. doi: 10.1038/nn1519. [DOI] [PubMed] [Google Scholar]
- 19.DiCarlo JJ, Cox DD. Untangling invariant object recognition. Trends in Cognitive Sciences. 2007;11:333–341. doi: 10.1016/j.tics.2007.06.010. [DOI] [PubMed] [Google Scholar]
- 20.Foldiak P. Learning Invariance from Transformation Sequences. Neural Computation. 1991;3:194–200. doi: 10.1162/neco.1991.3.2.194. [DOI] [PubMed] [Google Scholar]
- 21.Einhäuser W, Hipp J, Eggert J, Körner E, König P. Learning viewpoint invariant object representations using a temporal coherence principle. Biological Cybernetics. 2005;93:79–90. doi: 10.1007/s00422-005-0585-8. [DOI] [PubMed] [Google Scholar]
- 22.Franzius M, Wilbert N, Wiskott L. Invariant Object Recognition with Slow Feature Analysis. Proc 18th Intl Conf on Artificial Neural Networks, ICANN'08, Prague. 2008 doi: 10.1162/NECO_a_00171. [DOI] [PubMed] [Google Scholar]
- 23.Li N, DiCarlo JJ. Unsupervised Natural Experience Rapidly Alters Invariant Object Representation in Visual Cortex. Science. 2008;321:1502. doi: 10.1126/science.1160028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yao H, Dan Y. Stimulus Timing-Dependent Plasticity in Cortical Processing of Orientation. Neuron. 2001;32:315–323. doi: 10.1016/s0896-6273(01)00460-3. [DOI] [PubMed] [Google Scholar]
- 25.Lindholm E, Nickolls J, Oberman S, Montrym J. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 2008;28:39–55. [Google Scholar]
- 26.PyCUDA. http://mathema.tician.de/software/pycuda.
- 27.Gross R, Matthews I, Cohn J, Kanade T, Baker S. 2007. The CMU multi-pose, illumination, and expression (Multi-PIE) face database. Technical report, Tech. rep., Robotics Institute, Carnegie Mellon University, 2007. TR-07-08. [DOI] [PMC free article] [PubMed]
- 28.Fei-Fei L, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Conference (CVPR) 2004 [Google Scholar]
- 29.Griffin G, Holub A, Perona P. The caltech-256 object category dataset. 2007. Technical Report 7694, California Institute of Technology. http://authors.library.caltech.edu/7694.
- 30.Huang GB, Ramesh M, Berg T, Learned-Miller E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. October 2007. Technical report, 07–49, University of Massachusetts, Amherst.
- 31.ORL Face Set. http://www.cl.cam.ac.uk/Research/DTG/attarchive/facedatabase.html.
- 32.Yale Face Set. http://cvc.yale.edu.
- 33.CVL Face Set. http://www.lrv.fri.uni-lj.si/facedb.html.
- 34.AR Face Set. http://cobweb.ecn.purdue.edu/aleix/ar.html.
- 35.Shamir L. Evaluation of Face Datasets as Tools for Assessing the Performance of Face Recognition Methods. International Journal of Computer Vision (IJCV) 2008;79:225–230. doi: 10.1007/s11263-008-0143-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ponce J, Berg TL, Everingham M, Forsyth DA, Hebert M, et al. Dataset Issues in Object Recognition. Lecture Notes in Computer Science. 2006;4170:29. [Google Scholar]
- 37.Bosch A, Zisserman A, Munoz X. Representing shape with a spatial pyramid kernel. International Conference on Image and Video Retrieval (CIVR) 2007 [Google Scholar]
- 38.Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Computer Vision and Pattern Recognition Conference (CVPR) 2006 [Google Scholar]
- 39.Berg AC, Malik J. Geometric blur and template matching. Computer Vision and Pattern Recognition Conference (CVPR) 2001 [Google Scholar]
- 40.Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 2004;60:91–110. [Google Scholar]
- 41.Cox DD, Pinto N, Doukhan D, Corda B, DiCarlo JJ. A high-throughput screening approach to discovering good forms of visual representation. Computational and Systems Neuroscience (COSYNE) 2008 doi: 10.1371/journal.pcbi.1000579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Deb K. Multi-Objective Optimization Using Evolutionary Algorithms. 2001. Wiley.
- 43.Rutenbar R. Simulated annealing algorithms: An overview. IEEE Circuits and Devices Magazine. 1989;5:19–26. [Google Scholar]
- 44.Kennedy J, Eberhart R. Particle swarm optimization. IEEE International Conference on Neural Networks 1995 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.