Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2019 Dec;30:100–108. doi: 10.1016/j.cobeha.2019.07.004

Learning to see stuff

Roland W Fleming 1, Katherine R Storrs 1
PMCID: PMC6919301  PMID: 31886321

Highlights

  • Unsupervised deep learning is a powerful framework for studying visual perception.

  • Natural images are structured by ‘latent variables’ (e.g. lighting, reflectance).

  • Learning to encode and predict image structure discovers statistical regularities.

  • These regularities teach the brain about the outside world.

  • Neural networks may reveal cues the brain uses to represent complex materials.

Abstract

Materials with complex appearances, like textiles and foodstuffs, pose challenges for conventional theories of vision. But recent advances in unsupervised deep learning provide a framework for explaining how we learn to see them. We suggest that perception does not involve estimating physical quantities like reflectance or lighting. Instead, representations emerge from learning to encode and predict the visual input as efficiently and accurately as possible. Neural networks can be trained to compress natural images or to predict frames in movies without ‘ground truth’ data about the outside world. Yet, to succeed, such systems may automatically discover how to disentangle distal causal factors. Such ‘statistical appearance models’ potentially provide a coherent explanation of both failures and successes in perception.


Current Opinion in Behavioral Sciences 2019, 30:100–108

This review comes from a themed issue on Visual perception

Edited by Hannah E Smithson and John S Werner

For a complete overview see the Issue and the Editorial

Available online 13th August 2019

https://doi.org/10.1016/j.cobeha.2019.07.004

2352-1546/© 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Beyond inverse optics

Materials such as tweed, leather or scrambled eggs have richly detailed visual appearances (Figure 1a). When we view such materials, we enjoy a vivid impression of their characteristics, such as how they would feel if touched [1, 2, 3,4••,5]. Yet, due to their physical complexity, they pose profound challenges for traditional ‘inverse optics’ theories of perception [6, 7, 8,9,77]. Most theories assume the brain’s goal is to estimate physical quantities, like surface reflectance, orientation or depths [10, 11, 12]. Yet when we perceive complex materials, what exactly is the brain ‘estimating’? Many visual properties—such as how faded denim appears, the ripeness of a pear, or the gracefulness of a ballet dancer—are hard to define in physical terms (cf. ‘tertiary properties’, [13,14]; or ‘affordances’, [15]). Moreover, we cannot be born knowing all these properties and how to infer them—denim and ballet dancers did not exist during evolution. Instead, their appearance characteristics must somehow be learned. For properties like these, we must not only learn how to estimate distal properties from image data, but also what to estimate in the first place.

Figure 1.

Figure 1

Learning to see stuff.

(a) Substances such as tweed, leather, and scrambled eggs evoke rich material impressions. (b) Physical parameters (here, azimuth and elevation angle) determine the retinal image (‘forward optics’). Neighbouring physical parameters can give rise to wildly different images (the tangled pink grid), and most possible images look like meaningless noise (cyan dots). Unsupervised learning can discover ‘statistical appearance models’, comprising latent variables that efficiently capture the variation among natural images. (c) Deep neural networks can learn powerful latent codes capturing natural image variations. After training to encode 70 000 real human faces from the FFHQ dataset ([75]; https://github.com/NVlabs/ffhq-dataset; images are public domain as defined under creative commons CC0 1.0 license), a network was able to generate completely novel face images such as the nine shown, which do not correspond to any existing person (generated by Jordan Suchow using the PixelVAE network described in Ref. [29••]).

This leads to a fundamental question. How do we learn to see the outside world? It cannot be primarily through supervised learning because we never get detailed information about the true state of the world. Most points in the visual field are beyond reach, and we cannot feel, taste or smell the colour of a surface. Indeed, all sensory signals are highly ambiguous, so no one sensory modality can provide the ground truth for the others. Although motor actions allow us to probe the world to learn about how it behaves, we can only ever detect the effects of our actions via the senses. Thus, learning how to see the outside world must somehow proceed without explicit labelled training data (or at best exceedingly sparse data; see also [16] for related arguments). Together, these considerations indicate we need an alternative formulation of vision that goes beyond ‘inverse optics’.

We suggest that perception of complex material and object properties does not arise primarily through densely supervised learning, nor indeed through estimating predefined physical quantities. Rather, perceptual representations emerge through learning to encode and predict the visual input as accurately and efficiently as possible. This may seem like a paradoxical claim, yet we propose that the best way to learn how to infer the distal stimulus (i.e. properties of the outside world) is to get really good at describing the proximal stimulus. Recent advances in unsupervised deep learning (see Box 1: The Modern Deep Learning Framework) provide a powerful framework for implementing and testing this conjecture.

Box 1. The modern deep learning framework.

Deep learning is machine learning using deep neural networks. Neural networks are computer models consisting of many interconnected neuron-like units, usually arranged in processing stages or layers. Each unit combines and non-linearly transforms its input signals to produce a numerical output. Deep neural networks (DNNs) consist of multiple layers, allowing a series of intermediate representations between the network’s input and output. The transition from shallow to deeper networks has enabled ground-breaking progress in simulating human-like perceptual [43,44], cognitive [45,46,47] and linguistic [48,49] abilities, and provides a promising modelling approach in perceptual and cognitive neuroscience [50,51].

Like brains, neural networks need to learn how to perform the tasks that are required of them. Knowledge is embodied in the weights with which each unit combines its inputs, which are initially random. The weights are then incrementally updated via a learning algorithm, such as backpropagation, which adjusts the network’s parameters to fulfil a specific objective function—typically, minimizing error on a particular task.

In supervised learning, the network’s weights are adjusted to bring the outputs for the training inputs (e.g. photos of objects) closer to desired outputs (e.g. corresponding object names). Thus, supervised learning involves labelled training data (e.g., [79]). After training, the network can generalise by assigning appropriate outputs to novel inputs (e.g. recognise an object in a photo it has not seen before).

In unsupervised learning, the network learns to capture high-order statistics of its training inputs, rather than return specific desired outputs. For example, an autoencoder network learns to compress high-dimensional input data within a lower-dimensional ‘latent code’ layer, in such a way that allows it to reconstruct the original input with minimal distortion (Figure 2). Unsupervised learning signals are generally richer than supervised ones; an image autoencoder derives a training signal for every pixel of its attempted reconstruction, whereas a supervised object-recognition network might receive only one object label per image. Unsupervised models therefore attempt to learn all regularities in their data, not only those relevant for a predefined task. This makes unsupervised learning a good candidate method through which to learn visual representations upon which many natural tasks can be performed.

Alt-text: Box 1

Statistical appearance models

In contrast to inverse optics, we suggest that through the vast visual diet of our infancy, we learn to parse visual experience in some less physically principled but more ecologically feasible way. In support of this, psychophysical data suggest material perception is often best explained by recourse to features in images, rather than ground-truth physical properties [17, 18, 19,78]. For example, in gloss perception, variations of shape and lighting cause surfaces with identical reflectances to appear differently glossy (Marlow et al., 2012; [18]). Such illusions are difficult to explain if the goal is to recover physical surface reflectance. Yet, image features like the size, contrast and sharpness of highlights can well predict the ‘erroneous’ gloss judgments. This suggests that visual processes seek to capture and parameterise statistical variations in the proximal image data, rather than estimate distal scene parameters per se.

Specifically, rather than learning mappings between image quantities (‘cues’) and physical quantities, we learn to represent the dimensions of variation within and among natural images, which in turn arise from the systematic effects that distal properties have on the image. For example, a salient difference between images of surfaces with different reflectance properties is that the size, contrast and sharpness of highlights tend to vary. Thus, the visual system learns to separate surfaces with low-contrast bright blotches, from those with high-contrast blotches. All other things being equal, this is a valid way of distinguishing low-gloss from high-gloss materials. Importantly, however, these systematic variations in highlights can be discovered just by observing images, without knowing a priori that there is a distal factor—specular reflectance—that is responsible for the variations. The discovered dimensions of variation might sometimes roughly align with such physical factors, but may also combine and conflate several physical parameters, leading to what seem like ‘illusions’ from an inverse optics perspective. We call internal representations of the ways images vary statistical appearance models [4••,5]. We suggest such internal models provide an efficient and robust representation, on the basis of which many different estimation tasks can be performed.

This concept of statistical appearance models is somewhat abstract. How, in practice, can the brain learn the ‘natural degrees of variation’ between images? Deep learning provides a rigorous means to implement this idea in image-computable form, and to compare such models to human judgments. To appreciate why, it is useful to consider the statistical distribution of natural images.

Learning about real-world images…

Representing and distinguishing between the images we are likely to experience poses a challenging statistical problem (Figure 1b). To be efficient, visual representations should span the tiny subspace occupied by real-world images—capturing all their possible variations—but not too much more.

Consider a visual ‘world’ of 100 × 100 pixel colour images (typical for website thumbnails). We can usually recognise the content of such images, yet there are 30 000 dimensions of possible variation (one for each pixel/channel). Importantly, however, only a tiny proportion of the images in this space represent plausible scenes and objects from the real world. This is because the physical and optical generative processes that create natural images give rise to statistical dependencies across pixels. When three-dimensional objects are illuminated and projected onto the retina they yield images with high-order correlations between pixels.

Because most images in the space are highly unlikely as real images, brains need not encode them in a way that allows us to easily differentiate among them. Indeed, almost all possible images look like near-indistinguishable random noise to human observers. This means that the brain can use a lower-dimensional ‘latent code’ to represent real images more efficiently.

There is a long history of posing sensory processing in terms of efficient coding theory. Since Attneave [20] and Barlow [21] there has been a prominent idea that neural response properties are determined by the goal of efficiently encoding their input data [22, 23, 24]. This has been quite successful at explaining aspects of low-level vision, but until very recently it has not been a feasible approach to higher-level perception of objects and materials.

… discovers latent variables

The key insight that allows us to bridge efficient coding and high-level vision is as follows. Natural images derive their structure from all the generative processes of the natural world: everything from the laws of physics—perspective projection and specular reflection—to the fact that faces have two eyes and bicycles two wheels. The best way to get really efficient and accurate at representing the set of natural images, is to discover latent variables that structure those images. Not in terms of physical laws, but in terms of statistical relationships between elements in the image.

It is more efficient to represent images in terms of latent variables because they describe variations in a much more compact code. For example, viewing an object from different directions generates widely varying images (see Figure 1b). But all these images occupy a 2D manifold, as all possible variations among them can be summarised with just two parameters (uniquely specifying the viewing angle), given a particular object and a fixed viewing distance. A compact representation that specifies relationships between these images in terms of two numbers has learnt something important about the outside world, even if those two latent variables do not correspond exactly to azimuth and elevation.

Thus, discovering latent variables does not necessarily require densely supervised learning, where the visual system is taught explicit mappings between image cues and physical properties. It can also be achieved through unsupervised learning, in which a system discovers regularities in its input data by itself. Through an objective function that seeks to capture the variations in proximal image data as well as possible, we may end up with internal representations that are well suited for describing the distal scene factors that created those images. Although the latent variables discovered by unsupervised learning may not correspond perfectly to the true physical factors of the world, they may provide the basis for the perceptual dimensions that emerge when observers are asked to perform specific tasks (e.g. gloss judgements). Learning such representations requires the inference powers of deep learning... as well as lots and lots of training data.

Case studies in unsupervised visual learning

To acquire statistical appearance models, we need a learning framework that knows nothing about the outside world but gets to observe (potentially many) samples drawn from it. Here we highlight a few promising implementations of unsupervised learning in deep neural networks (although, see also Box 2: Caveats and Open Challenges).

Box 2. Caveats and open challenges.

We have argued that unsupervised learning is an important part of how our rich perceptual impressions emerge, but it is likely not the full story of how we see.

  • Not all visual competences are learned within a single lifetime. Humans can discriminate between simple visual patterns at birth [52, 53, 54] and perhaps even before [55], and fundamental elements of spatial vision, such as the ability to segment objects in depth, may be innate [56]. Adult visual abilities combine those ‘baked in’ by evolution with those learned from experience. Even before birth, spontaneous neural activity may help structure our visual systems via unsupervised learning rules [57,58]. Functional specialisation represents another challenge for theories-based purely on learning [59].

  • We have concentrated on learning through passive visual observation, but there is evidence that active exploration facilitates visual development [42,60, 61, 62, 63]. However, many of our visual abilities are developed before motor control allows us to make precise modifications to the world. Moreover, if vision could not be learnt without action, then congenital tetraplegics—who can barely alter the world through actions during development—would have devastating perceptual deficits, whereas they are actually mild [64]. This suggests that while motor control refines our visual abilities, it is not a sine qua non for seeing the outside world.

  • Reinforcement is another learning objective that does not require access to ground-truth-labelled data. During training, models learn to output actions (e.g. movements) that yield rewards. This approach provides sparser training signals than objectives like reconstruction or prediction, since rewards are relatively scarce, yet can still lead to rich perceptual representations. Recent successes, for example in the domain of video game playing, point to its potential power [45,65]. Evolution may also be thought of as a type of reinforcement learning.

  • Concerns are often raised about the power of neural networks as explanatory models [66, 67, 68]. Yet, like animal models of psychiatric disorders, they provide a useful experimental platform. Neural networks should not be thought of as black boxes, as researchers have been able to discover much about what is represented within the latent codes of trained networks (e.g. [32••,35••]), and use them to make quantitative predictions for perception of novel stimuli (e.g. [33,69••]). Image-computable models can be compared to biological visual systems at many levels of abstraction, from predicting neural activity to behaviour. A good model should exhibit detailed patterns of behaviour (e.g. errors, response times, and sensitivity to specific stimulus manipulations) similar to those found in humans [51,70,71,80]. However, behavioural similarity is necessary but not sufficient to show that a model does visual tasks in the same way brains do. Further refining methods for interrogating, distilling and interpreting the computational strategies learned by networks is an ongoing challenge [72, 73, 74].

Alt-text: Box 2

Data compression

One potentially important objective is to encode images as compactly as possible. Autoencoders [25••] are feedforward networks that reconstruct inputs after compressing them via a ‘bottleneck’ consisting of many fewer units than their inputs (Figure 2). Because of the bottleneck, they must learn a low-dimensional representation from which the originals can still be accurately reconstructed. As a result, they tend to discover latent variables that are good at capturing complex statistical variations across images. This can allow them to disentangle distinct causal contributions to observed data. In Figure 2 we show a ‘toy’ example of an autoencoder learning to separate material classes without explicit labels.

Figure 2.

Figure 2

Unsupervised image compression can discover natural material types.

Top right: schematic of an autoencoder network trained on images of natural textures. Images are passed through four convolutional layers with successively fewer units, before being expanded back to the original dimensionality. The learning objective is to minimise the pixelwise difference between original and reconstructed images. Bottom left: by applying the dimensionality reduction method tSNE [76] to 3000 images depicting fur, gravel, or wool, we see that these categories are highly intermixed in image space. The tSNE algorithm embeds high-dimensional data into two dimensions for visualisation, while preserving local distances between nearby points as faithfully as possible. Bottom right: when the same algorithm is applied to the representations of the images within the trained autoencoder’s latent code, strong clusters emerge corresponding to the natural material types.

Prediction in space

A closely related objective deals not with reconstructing inputs pixel-for-pixel, but in predicting pixels from their local neighbourhoods. For example, autoregressive networks like PixelCNN [26,27] and PixelVAE [28] learn a high-order statistical representation of the training set in their latent codes. They are exceptionally good at generating novel images that emulate the structure of natural images. For example, a network trained on thousands of portrait photos synthesises completely new human faces that are close to photographic quality, some samples of which are shown in Figure 1c [29••]. Importantly, the latent space representations are systematically organised, such that similar latent values yield similar faces, gradually changing appearance from one identity to another via physically plausible variations (e.g. the nose gradually widens and eyebrows thicken, accurately rendered in the image). The richness of the generative model suggests they are well suited to representing complex natural materials. It is extremely intriguing to investigate how the latent representations in such networks relate to human perceptual judgments of feature appearance and stimulus similarity.

Prediction in time

Temporal prediction may be fundamental to how brains learn and perceive. Predictive coding theories propose that an internal generative model creates predictions of future sensory signals, and then differences between the predictions and subsequent sensory feedback (prediction error signals) update and refine the model [30,31]. A recurrent deep neural network trained by predictive error coding, PredNet [32••,33], learnt to predict (i.e. synthesise) future frames in videos of natural environments (Figure 3). PredNet consists of a hierarchy of stages with bidirectional connections, allowing more abstract representations of the movie content to be inferred in deeper layers, and influence predictions at the local pixel level. The network uses long short-term memory (LSTM) units to keep track of long-range temporal dependencies [34]. Intriguingly, PredNet spontaneously discovered, in its deeper layers, higher-level properties of the objects depicted in videos, such as facial identity and pose [32••], and its individual units reproduced certain temporal dynamics of primate visual neurons [33]. Thus, to get good at representing the proximal stimulus unfolding over time, the networks tend to infer distal causes. For example, to predict the un-occlusion of previously invisible shape features as an object rotates, or the sudden expansion of specular highlights as they rush across a moving surface, may require deep understanding about how the world works.

Figure 3.

Figure 3

Unsupervised video prediction can discover physical scene properties.

A recurrent network of the PredNet architecture [32••] trained to predict the next frame in a simple simulated world of rotating checkered cubes. Deeper layers attempt to predict activation in preceding layers (green feedback arrows), while lower layers send up prediction errors (red feedforward arrows) and each layer propagates its current state to the next time point using LSTM units (purple recurrent arrows). Top right: Visualised activations of individual units in response to three frames of a video (brighter pixel values indicate stronger activation to a location in the frame). The unit visualised in the first row responds almost exclusively to the shadow cast by the object, but not to other shadows in the environment or to dark regions on the object. The unit visualised in the second row responds almost exclusively to moving reflectance edges on the object, but not to moving shadow edges or to still edges.

Another impressive variant on the theme of prediction is Generative Query Nets [35••]. During training, the network is queried to render an image of a simulated 3D environment not simply at the next timepoint, but as it would appear from a different viewpoint. This again encouraged high-level latent scene representations, from which it was possible to decode object identities, positions, shapes and colours without any explicit labelled training data.

‘Curiosity-based’ learning is an example of the exciting possibilities that emerge when visual learning is embodied in an agent that does not merely observe passively, but can also act on the world it observes. Motivated by animal learning, the networks actively seek out the most informative parts of those environments during learning [36••]. The network outputs both an action (a movement of itself or another object in the scene), and a pixelwise prediction of what its sensory input should be after performing that action. The ‘curiosity’ objective is implemented by training the network to select actions during training for which the network has minimum confidence in its visual prediction — that is, it selects actions for which it does not yet know the consequences.

We can expect significant advances in unsupervised and weakly-supervised learning in the coming years as they receive increasing attention in machine learning research [37, 38, 39, 40]. Both biological and artificial visual systems may profit by employing hybrid strategies, in which unsupervised learning creates a robust representation of the structures in our visual worlds, and sparse supervision or reward signals tweak these for the performance of specific tasks [41,42].

Conclusions

It is tempting to formulate vision as the estimation of physical quantities, such as size, distance or reflectance. But to understand complex appearance, we need to let go of ‘inverse optics’. The brain does not estimate predefined physical properties. Instead it represents the ‘typical appearance’ of surfaces and objects in the proximal image. That is, it identifies and represents the statistical ways that images differ from one another. When presented with a bobbly woollen sweater, what would it mean to estimate its ‘bobbliness’? And how would we learn to estimate it with no way of ever knowing the ground truth? This is the key insight. In learning to describe the proximal stimulus efficiently and accurately, the visual system discovers latent variables that are responsible for the image structure: everything from the physics of specular reflectance, to the fact that shirts have buttons evenly spaced in a vertical line. The physical properties of materials are just one example of latent factors that give images their structure. By identifying parameters of variation in the proximal stimulus, statistical appearance models provide a route to inferring the outside world without labelled training data.

Unsupervised deep learning offers a framework for implementing this idea. Here we have suggested that learning objectives such as prediction, compression and curiosity give rise to rich internal representations upon which many estimation tasks can be performed. Finding the right unsupervised learning objectives may be the key to explaining both the successes and failures of human material perception, and vision more broadly.

Conflict of interest statement

Nothing declared.

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgements

The DFG (SFB-TRR-135: ‘Cardinal Mechanisms of Perception’, project number 222641018), and an ERC Consolidator Award (ERC-2015-CoG-682859: ‘SHAPE’). The authors are extremely grateful to Jordon Suchow for extensive exchanges, for retraining the PixelVAE network on the FFHQ training set and providing the resulting images presented in Fig. 1.

References

  • 1.Adelson E.H. On seeing stuff: the perception of materials by humans and machines. Rogowitz B.E., Pappas T.N., editors. Proceedings SPIE Human Vision and Electronic Imaging VI. 2001;vol 4299:1–12. [Google Scholar]
  • 2.Anderson B.L. Visual perception of materials and surfaces. Curr Biol. 2011;21:R978–R983. doi: 10.1016/j.cub.2011.11.022. [DOI] [PubMed] [Google Scholar]
  • 3.Zaidi Q. Visual inferences of material changes: color as clue and distraction. Wiley Interdiscip Rev Cogn Sci. 2011;2:686–700. doi: 10.1002/wcs.148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4••.Fleming R.W. Visual perception of materials and their properties. Vis Res. 2014;94:62–75. doi: 10.1016/j.visres.2013.11.004. [DOI] [PubMed] [Google Scholar]; This article argued that rather than estimating the physical properties of objects and materials, the visual system may instead infer ‘statistical appearance models’, a perceptual representation describes the ways that the proximal stimulus associated with a given material typically varies.
  • 5.Fleming R.W. Material perception. Ann Rev Vis Sci. 2017;3:365–388. doi: 10.1146/annurev-vision-102016-061429. [DOI] [PubMed] [Google Scholar]
  • 6.Marr D. Freeman; San Francisco: 1982. Vision. [Google Scholar]
  • 7.Poggio T., Torre V., Koch C. Computational vision and regularization theory. Nature. 1985;317:314–319. doi: 10.1038/317314a0. [DOI] [PubMed] [Google Scholar]
  • 8.Pizlo Z. Perception viewed as an inverse problem. Vis Res. 2001;41:3145–3161. doi: 10.1016/s0042-6989(01)00173-0. [DOI] [PubMed] [Google Scholar]
  • 9•.Kersten D., Mamassian P., Yuille A. Object perception as Bayesian inference. Ann Rev Psychol. 2004;55:271–304. doi: 10.1146/annurev.psych.55.090902.142005. [DOI] [PubMed] [Google Scholar]; One of the clearest articulations of the ‘inverse optics’ framework, posing the perception of surfaces and objects as the optimal inference of physical properties, taking into account inherent uncertainties and ambiguities.
  • 10.Maloney L.T., Wandell B.A. Color constancy: a method for recovering surface spectral reflectance. J Opt Soc Am A. 1986;3:29–33. doi: 10.1364/josaa.3.000029. [DOI] [PubMed] [Google Scholar]
  • 11.Kim S., Burge J. The lawful imprecision of human surface tilt estimation in natural scenes. eLife. 2018;7:31448. doi: 10.7554/eLife.31448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Howard I.P., Rogers B.J. vol 2. University of Toronto Press; 2002. Seeing in depth. (Depth Perception). [Google Scholar]
  • 13.Koffka K. Problems in the psychology of art. Bernheimer R., Carpenter R., Koffka K., Nahm M.C., editors. Art, a Bryn Mawr Symposium. 1940:180–272. Oriole Editions. [Google Scholar]
  • 14.Sinico M. Tertiary qualities, from Galileo to Gestalt psychology. Hist Hum Sci. 2015;28:68–79. [Google Scholar]
  • 15.Gibson J.J. Lawrence Erlbaum Associates; Hillsdale, NJ: 1979. The Ecological Approach to Visual Perception. [Google Scholar]
  • 16.Anderson B.L. Can computational goals inform theories of vision? Top Cogn Sci. 2015;7:274–286. doi: 10.1111/tops.12136. [DOI] [PubMed] [Google Scholar]
  • 17.Doerschner K., Fleming R.W., Yilmaz O., Schrater P.R., Hartung B., Kersten D. Visual motion and the perception of surface material. Curr Biol. 2011;21:1–7. doi: 10.1016/j.cub.2011.10.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Fleming R.W. Visual heuristics in the perception of glossiness. Curr Biol. 2012;22:R865–R866. doi: 10.1016/j.cub.2012.08.030. [DOI] [PubMed] [Google Scholar]
  • 19.Muryy A.A., Fleming R.W., Welchman A.E. ‘Proto-rivalry’: how the binocular brain identifies gloss. Proc R Soc B. 2016;283 doi: 10.1098/rspb.2016.0383. 20160383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Attneave F. Some informational aspects of visual perception. Psychol Rev. 1954;61:183–193. doi: 10.1037/h0054663. [DOI] [PubMed] [Google Scholar]
  • 21.Barlow H.B. Possible principles underlying the transformation of sensory messages. In: Rosenblith W.A., editor. Sensory Communication. MIT Press; Cambridge, MA: 1961. pp. 217–236. [Google Scholar]
  • 22.Olshausen B.A., Field D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381:607. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]
  • 23.Olshausen B.A., Field D.J. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vis Res. 1997;37:3311–3325. doi: 10.1016/s0042-6989(97)00169-7. [DOI] [PubMed] [Google Scholar]
  • 24.Simoncelli E.P., Olshausen B.A. Natural image statistics and neural representation. Ann Rev Neurosci. 2001;24:1193–1216. doi: 10.1146/annurev.neuro.24.1.1193. [DOI] [PubMed] [Google Scholar]
  • 25••.Hinton G.E., Salakhutdinov R.R. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–507. doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]; This article introducedAutoencoders, and explored their possibilities for nonlinear dimensionality reduction and classification.
  • 26.Van den Oord A., Kalchbrenner N., Espeholt L., Vinyals O., Graves A. Advances in Neural Information Processing Systems. 2016. Conditional image generation with pixelCNN decoders; pp. 4790–4798. [Google Scholar]
  • 27.Salimans T., Karpathy A., Chen X., Kingma D.P. PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv. 2017 preprint, arXiv:1701.05517. [Google Scholar]
  • 28.Gulrajani I., Kumar K., Ahmed F., Taiga A.A., Visin F., Vazquez D., Courville A. PixelVAE: a latent variable model for natural images. arXiv. 2016 preprint, arXiv:1611.05013. [Google Scholar]
  • 29••.Suchow J.W., Peterson J.C., Griffiths T.L. Learning a face space for experiments on human identity. arXiv. 2018 preprint, arXiv:1805.07653. [Google Scholar]; APixelVAE trained on a database of 3000 frontal portrait photos, can synthesise novel images that are practically indistinguishable from real photos. Skin tone and texture, hair, physiognomy and human-like eye glints are all faithfully reproduced. Importantly the latent space is systematically organised with meaningful similarities between nearby samples from the space.
  • 30•.Rao R.P., Ballard D.H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci. 1999;2:79. doi: 10.1038/4580. [DOI] [PubMed] [Google Scholar]; A hierarchical neural circuit is formulated that implementspredictive coding, representing at each step only that part of the incoming data which has not been successfully predicted. The circuit model is shown to display emergent behaviour consistent with certain properties of cortical visual neurons, and served as the basis for the PredNet recurrent deep neural network [32••].
  • 31.Srinivasan M.V., Laughlin S.B., Dubs A. Predictive coding: a fresh view of inhibition in the retina. Proc R Soc Lond B. 1982;216:427–459. doi: 10.1098/rspb.1982.0085. [DOI] [PubMed] [Google Scholar]
  • 32••.Lotter W., Kreiman G., Cox D. Deep predictive coding networks for video prediction and unsupervised learning. arXiv. 2016 preprint, arXiv:1605.08104. [Google Scholar]; This article introducedPredNet, a network which learns to predict forthcoming frames in movies. When trained on footage from car-top cameras, it not only correctly predicts how visible objects will move in the image as the car drives forwards, it also anticipates the arrival into view of trees and buildings as the car turns corners. Importantly, when trained on computer graphics images, underlying scene parameters can be decoded from the network’s latent representation.
  • 33.Lotter W., Kreiman G., Cox D. A neural network trained to predict future video frames mimics critical properties of biological neuronal responses and perception. arXiv. 2018 preprint, arXiv:1805.10734. [Google Scholar]
  • 34.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 35••.Eslami S.A., Rezende D.J., Besse F., Viola F., Morcos A.S., Garnelo M., Reichert D.P. Neural scene representation and rendering. Science. 2018;360:1204–1210. doi: 10.1126/science.aar6170. [DOI] [PubMed] [Google Scholar]; This article introducedGenerative Query Networks, a deep learning framework within which networks learn to predict the visual appearance of a simulated environment from a specific novel viewpoint. Trained networks were able to render impressively accurate images from new viewpoints, even when only provided with one or two images of an environment. Furthermore, networks learned factorised latent representations from which distal properties such as object colour, shape and position could be read out.
  • 36••.Haber N., Mrowca D., Fei-Fei L., Yamins D.L. Emergence of structured behaviors from curiosity-based intrinsic motivation. arXiv. 2018 preprint, arXiv:1802.07461. [Google Scholar]; Here ‘curiosity’ is implemented in a deep learning framework. A neural network controls an agent in a simple simulated environment, and learns both to predict the next state of its environment, as well as predicting its own error in this prediction. The agent then chooses to perform actions with the most uncertain consequences, actively seeking out the most informative training data.
  • 37.Garnelo M., Shanahan M. Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Curr Opin Behav Sci. 2019;29:17–23. [Google Scholar]
  • 38.Hassabis D., Kumaran D., Summerfield C., Botvinick M. Neuroscience-inspired artificial intelligence. Neuron. 2017;95:245–258. doi: 10.1016/j.neuron.2017.06.011. [DOI] [PubMed] [Google Scholar]
  • 39.LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521(7553):436. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
  • 40.Marcus G. Deep learning: a critical appraisal. arXiv. 2018 preprint, arXiv:1801.00631. [Google Scholar]
  • 41.Bengio Y., Courville A., Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35:1798–1828. doi: 10.1109/TPAMI.2013.50. [DOI] [PubMed] [Google Scholar]
  • 42.Smith L.B., Slone L.K. A developmental approach to machine learning? Front Psychol. 2017;8:2124. doi: 10.3389/fpsyg.2017.02124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43•.Krizhevsky A., Sutskever I., Hinton G.E. Advances in Neural Information Processing Systems. 2012. Imagenet classification with deep convolutional neural networks; pp. 1097–1105. [Google Scholar]; This landmark study introducedAlexNet, a seven-layer supervised network, which achieved a step-change improvement in the challenging ImageNet object recognition task, in which 1000 different classes of object must be recognised from real world photographs.
  • 44.He K., Zhang X., Ren S., Sun J. Proceedings 21 of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Deep residual learning for image recognition; pp. 770–778. [Google Scholar]
  • 45•.Jaderberg M., Czarnecki W.M., Dunning I., Marris L., Lever G., Castaneda A.G., Sonnerat N. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv. 2018 doi: 10.1126/science.aau6249. preprint, arXiv:1807.01281. [DOI] [PubMed] [Google Scholar]; A breakthrough inreinforcement learning of complex tasks, demonstrating human-level performance in the 3D video game Quake III by deep recurrent neural networks which receive as input only screen pixels and signals indicating key game events.
  • 46.Santoro A., Raposo D., Barrett D.G., Malinowski M., Pascanu R., Battaglia P., Lillicrap T. Advances in Neural Information Processing Systems. 2017. A simple neural network module for relational reasoning; pp. 4967–4976. [Google Scholar]
  • 47•.Silver D., Huang A., Maddison C.J., Guez A., Sifre L., Van Den Driessche G., Dieleman S. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529:484. doi: 10.1038/nature16961. [DOI] [PubMed] [Google Scholar]; A famous example of neural networks achieving performance that exceeds humans on a challenging task. The game of Go is combinatorically too large for conventional game theory solutions (as used in chess). Yet a neural network trained through millions of simulated and recorded games to recognise winning patterns could beat the world champion human player.
  • 48.Wu Y., Schuster M., Chen Z., Le Q.V., Norouzi M., Macherey W., Klingner J. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv. 2016 preprint, arXiv:1609.08144. [Google Scholar]
  • 49.Radford A., Jozefowicz R., Sutskever I. Learning to generate reviews and discovering sentiment. arXiv. 2017 preprint, arXiv:1704.01444. [Google Scholar]
  • 50•.Yamins D.L., DiCarlo J.J. Using goal-driven deep learning models to understand sensory cortex. Nat Neurosci. 2016;19:356. doi: 10.1038/nn.4244. [DOI] [PubMed] [Google Scholar]; A comprehensive and accessible review of work comparing deep neural networks to biological visual and auditory cortex, with a focus on models trained via supervised learning.
  • 51•.Storrs K.R., Kriegeskorte N. Deep learning for cognitive neuroscience. In: Gazzaniga M., editor. The Cognitive Neurosciences. edn 6. MIT Press; Boston: 2019. [Google Scholar]; This chapter provides a detailed review of deep learning methods and their application to perception and neuroscience.
  • 52.Cassia V.M., Turati C., Simion F. Can a nonspecific bias toward top-heavy patterns explain newborns’ face preference? Psychol Sci. 2004;15:379–383. doi: 10.1111/j.0956-7976.2004.00688.x. [DOI] [PubMed] [Google Scholar]
  • 53.Fantz R.L. Pattern vision in newborn infants. Science. 1963;140:296–297. doi: 10.1126/science.140.3564.296. [DOI] [PubMed] [Google Scholar]
  • 54.Valenza E., Simion F., Cassia V.M., Umiltà C. Face preference at birth. J Exp Psychol Hum Percept Perform. 1996;22:892. doi: 10.1037//0096-1523.22.4.892. [DOI] [PubMed] [Google Scholar]
  • 55.Reid V.M., Dunn K., Young R.J., Amu J., Donovan T., Reissland N. The human fetus preferentially engages with face-like visual stimuli. Curr Biol. 2017;27:1825–1828. doi: 10.1016/j.cub.2017.05.044. [DOI] [PubMed] [Google Scholar]
  • 56.Spelke E.S. Principles of object perception. Cogn Sci. 1990;14:29–56. [Google Scholar]
  • 57.Ackman J.B., Crair M.C. Role of emergent neural activity in visual map development. Curr Opin Neurobiol. 2014;24:166–175. doi: 10.1016/j.conb.2013.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Penn A.A., Shatz C.J. Brain waves and brain wiring: the role of endogenous and sensory-driven neural activity in development. Pediatr Res. 1999;45:447. doi: 10.1203/00006450-199904010-00001. [DOI] [PubMed] [Google Scholar]
  • 59.Kanwisher N. Functional specificity in the human brain: a window into the functional architecture of the mind. Proc Natl Acad Sci U S A. 2010;107:11163–11170. doi: 10.1073/pnas.1005062107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Bushnell E.W., Boudreau J.P. Motor development and the mind: the potential role of motor abilities as a determinant of aspects of perceptual development. Child Dev. 1993;64:1005–1021. [PubMed] [Google Scholar]
  • 61.Gibson E.J. Exploratory behavior in the development of perceiving, acting, and the acquiring of knowledge. Ann Rev Psychol. 1988;39:1–41. [Google Scholar]
  • 62.Held R., Hein A. Movement-produced stimulation in the development of visually guided behavior. J Comp Physiol Psychol. 1963;56:872. doi: 10.1037/h0040546. [DOI] [PubMed] [Google Scholar]
  • 63.Schwarzer G. How motor and visual experience shape infants’ processing of objects and faces. Child Dev Perspect. 2014;8:213–217. [Google Scholar]
  • 64.Stiers P., Vanderkelen R., Vanneste G., Coene S., De Rammelaere M., Vandenbussche E. Visual–perceptual impairment in a random sample of children with cerebral palsy. Dev Med Child Neurol. 2002;44:370–382. doi: 10.1017/s0012162201002249. [DOI] [PubMed] [Google Scholar]
  • 65•.Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Petersen S. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]; Like [45], a breakthrough in reinforcement learning of complex tasks. Here deep recurrent networks learn to play 2D arcade video games at a professional human level, receiving only screen pixels and the game score as training input.
  • 66.Kay K.N. Principles for models of neural information processing. NeuroImage. 2017;180:101–109. doi: 10.1016/j.neuroimage.2017.08.016. [DOI] [PubMed] [Google Scholar]
  • 67.Lake B.M., Ullman T.D., Tenenbaum J.B., Gershman S.J. Building machines that learn and think like people. Behav Brain Sci. 2017;40 doi: 10.1017/S0140525X16001837. [DOI] [PubMed] [Google Scholar]
  • 68.Lipton Z.C. The mythos of model interpretability. arXiv. 2016 preprint, arXiv:1606.03490. [Google Scholar]
  • 69••.Gonçalves N.R., Welchman A.E. “What not” detectors help the brain see in depth. Curr Biol. 2017;27:1403–1412. doi: 10.1016/j.cub.2017.03.074. [DOI] [PMC free article] [PubMed] [Google Scholar]; This is a beautiful example of how neural networks can be used to discover general principles of biological information processing. The authors trained a neural network to solve binocular stereopsis. Then, by investigating the statistics of connections within the network, they were able to describe the central principle underlying its success in a single equation, which makes novel predictions about stereopsis at the neural and perceptual levels.
  • 70.Geirhos R., Temme C.R., Rauber J., Schütt H.H., Bethge M., Wichmann F.A. Advances in Neural Information Processing Systems. 2018. Generalisation in humans and deep neural networks; pp. 7538–7550. [Google Scholar]
  • 71.Rahwan I., Cebrian M., Obradovich N., Bongard J., Bonnefon J.F., Breazeal C., Jennings N.R. Machine behaviour. Nature. 2019;568:477. doi: 10.1038/s41586-019-1138-y. [DOI] [PubMed] [Google Scholar]
  • 72.Chen X., Duan Y., Houthooft R., Schulman J., Sutskever I., Abbeel P. Advances in Neural Information Processing Systems. 2016. Infogan: interpretable representation learning by information maximizing generative adversarial nets; pp. 2172–2180. [Google Scholar]
  • 73.Bau D., Zhou B., Khosla A., Oliva A., Torralba A. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Network dissection: quantifying interpretability of deep visual representations; pp. 6541–6549. [Google Scholar]
  • 74.Montavon G., Samek W., Müller K.R. Methods for interpreting and understanding deep neural networks. Digit Signal Process. 2018;73:1–15. [Google Scholar]
  • 75.Karras T., Laine A., Aila T. A style-based generator architecture for generative adversarial networks. arXiv. 2019 doi: 10.1109/TPAMI.2020.2970919. preprint, arXiv:1812.04948v3. [DOI] [PubMed] [Google Scholar]
  • 76.Van der Maaten L., Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–2605. [Google Scholar]
  • 77.Von Helmholtz H. Handbuch der physiologischen Optik. In: Karsten G., editor. Vol. 9. Leipzig; Voss: 1867. (Allgemeine Encyklopädie der Physik). [Google Scholar]
  • 78.Marlow P.J., Kim J., Anderson B.L. The perception and misperception of specular surface reflectance. Current Biology. 2012;22(20):1909–1913. doi: 10.1016/j.cub.2012.08.009. [DOI] [PubMed] [Google Scholar]
  • 79•.Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Berg A.C. Imagenet large scale visual recognition challenge. Int J Comput Vis. 2015;115:211–252. [Google Scholar]; This article presents the renowned ‘ImageNet’ object recognition benchmark competition, which has spurred global progress in computer vision, and against which visual deep learning progress has primarily been measured this decade.
  • 80.Tamura H., Prokott K.E., Fleming R.W. Distinguishing mirror from glass: A'big data'approach to material perception. arXiv preprint arXiv:1903.01671. 2019 doi: 10.1167/jov.22.4.4. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES