Skip to main content
. 2021 Nov 9;12:6456. doi: 10.1038/s41467-021-26751-5

Fig. 8. Schematic of model architectures.

Fig. 8

Blue, trainable neural network units free to represent anything. Pink, latent representation units used for comparison with neurons in response to 2100 face images. Grey, units representing class probabilities. CNN, convolutional neural network. FC, fully connected neural network. N, number of latent units. a Self-supervised models — β-VAE17, autoencoder (AE)35 and variational autoencoder (VAE)36, 58. Models were trained on the mirror flipped versions of the 2100 faces presented to the primates. Face image reproduced with permission from Gao et al.57. b Classifier baseline. Encoder network, same as in (a). Model trained to differentiate between unique 2100 face identities using mirror flipped versions of the 2100 faces augmented with 5 × 5 pixel translations. Face image reproduced with permission from Gao et al.57. c VGG baseline32. Encoder network has larger and deeper CNN and FC modules than in (a) and (b). Representation dimensionality is reduced to match other models either by a projection on the first N principal components (PCs) (VGG (PCA)), or by taking a random subset of N units without replacement (VGG (raw)). VGG was trained to differentiate between 2622 unique faces using a face dataset32 unrelated to the 2100 faces presented to the primates. Face image is representative of the images used to train the model and is reproduced with permission from Liu et al.52. d Active appearance model (AAM)3. Keypoints were manually placed on the 2100 face images. First N/2 PCs over the keypoint locations formed the “shape” latent units. First N/2 PCs over the shape-normalised images formed the “appearance” latent units. Figure adapted with permission from Chang et al.3.