Spatiotemporal dynamics of similarity-based neural representations of facial identity

Mark D Vida; Adrian Nestor; David C Plaut; Marlene Behrmann

doi:10.1073/pnas.1614763114

. 2016 Dec 27;114(2):388–393. doi: 10.1073/pnas.1614763114

Spatiotemporal dynamics of similarity-based neural representations of facial identity

Mark D Vida ^a,^b, Adrian Nestor ^c, David C Plaut ^a,^b, Marlene Behrmann ^a,^b,¹

PMCID: PMC5240702 PMID: 28028220

Significance

Humans can rapidly discriminate among many highly similar facial identities across identity-preserving image transformations (e.g., changes in facial expression), an ability that requires the system to rapidly transform image-based inputs into a more abstract, identity-based representation. We used magnetoencephalography to provide a temporally precise description of this transformation within human face-selective cortical regions. We observed a transition from an image-based representation toward an identity-based representation after ∼200 ms, a result suggesting that, rather than computing a single representation, a given face-selective region may represent multiple distinct types of information about face identity at different times. Our results advance our understanding of the microgenesis of fine-grained, high-level neural representations of object identity, a process critical to human visual expertise.

Keywords: face processing, magnetoencephalography, decoding, representational similarity analysis, face identity

Abstract

Humans’ remarkable ability to quickly and accurately discriminate among thousands of highly similar complex objects demands rapid and precise neural computations. To elucidate the process by which this is achieved, we used magnetoencephalography to measure spatiotemporal patterns of neural activity with high temporal resolution during visual discrimination among a large and carefully controlled set of faces. We also compared these neural data to lower level “image-based” and higher level “identity-based” model-based representations of our stimuli and to behavioral similarity judgments of our stimuli. Between ∼50 and 400 ms after stimulus onset, face-selective sources in right lateral occipital cortex and right fusiform gyrus and sources in a control region (left V1) yielded successful classification of facial identity. In all regions, early responses were more similar to the image-based representation than to the identity-based representation. In the face-selective regions only, responses were more similar to the identity-based representation at several time points after 200 ms. Behavioral responses were more similar to the identity-based representation than to the image-based representation, and their structure was predicted by responses in the face-selective regions. These results provide a temporally precise description of the transformation from low- to high-level representations of facial identity in human face-selective cortex and demonstrate that face-selective cortical regions represent multiple distinct types of information about face identity at different times over the first 500 ms after stimulus onset. These results have important implications for understanding the rapid emergence of fine-grained, high-level representations of object identity, a computation essential to human visual expertise.

Humans can discriminate among thousands of highly similar and complex visual patterns, such as face identity, in less than half a second (1, 2). Efficient within-category discrimination of facial identity is important for real-world decisions (e.g., classifying a person as a friend or stranger) and social interactions. Progress has been made in elucidating the neural mechanisms underlying the discrimination of individual face identities in humans. Using fMRI, these studies demonstrate that individual face identities are represented by spatially distributed patterns of neural activity within occipitotemporal cortex (3–13). Because of the poor temporal resolution in fMRI studies (typically around 2 s), however, our understanding of the neural basis of discrimination among complex visual patterns in humans remains limited. For example, within a given region, different information relevant to discrimination may be represented at different times over the first few 100 ms after stimulus onset. However, current models of the neural basis of face recognition in humans do not typically allow for this possibility, because they usually assign a single functional role to each face-selective region (14).

A few previous studies (15–18) have explored the temporal properties of the neural representation of individual face identities in humans. However, the measurements of representations in these studies were based on variations in the amplitude of neural activity at just one or two time points, typically sampling from different sensors for different time points (16–18) or on the temporal dynamics of signal from a small number of intracranial electrodes in fusiform gyrus (15). Furthermore, those studies that have investigated the nature of the facial identity information encoded in the neural data have used analyses that are limited to relatively low-level visual information (e.g., manual pixel-based measurements of eye or cheek color) (15, 17, 18) and/or analyses that sample from different brain regions for different time points (18). Hence, these studies provide limited information about the temporal dynamics of neural representations of facial identity. To allow fast and accurate discrimination of face identity in the real world, the human visual system must rapidly (within the first few 100 ms) transform image-based inputs into a more abstract, less image-based representation with greater tolerance to identity-preserving image transformations (19, 20). The computations underlying the temporal emergence of these high-level representations of facial identity are largely untapped. Hence, the neural basis of human face recognition cannot be fully understood without further examination of the temporal dimension of the neural representation of face identity.

In the current study, we investigated three important and unanswered questions about the neural basis of within-category discrimination among a large and carefully-controlled set of facial identities in humans. (i) When do spatiotemporal patterns of activity within face-selective cortex carry information sufficient for discrimination of facial identity across changes in facial expression? (ii) What types of information about facial identity (e.g., low-level image-based information or higher level identity-based information, information encoded in human behavioral similarity judgments) are represented by these spatiotemporal patterns of activity, and when? (iii) Where in the brain are these different types of information represented (e.g., in face-selective or control regions)? To investigate these questions, we developed a paradigm that permits the characterization of the representation of a large set of individual face identities from spatiotemporal patterns of neural activity, with extremely high temporal resolution. We used a small-sample design inspired by single-cell recording studies in nonhuman primates (21–23) and psychophysics (24, 25). We recorded comprehensive brain activity with magnetoencephalography (MEG) in four adult human participants while they viewed face images from a large, carefully controlled set (91 face identities, with two facial expressions per identity; Fig. 1), with a sufficiently large number of trials for each face identity (104–112 trials per face identity, 9,464–10,192 trials per participant) to be able to evaluate the representation of individual face identities in each participant (26). We used MEG because it has excellent temporal resolution and sufficient spatial resolution for decoding of fine visual information from spatial patterns of neural activity (26, 27). In each participant, we used an independent functional localizer task in MEG to identify face-selective regions in right lateral occipital cortex and right fusiform gyrus. We also used an anatomical atlas to localize left V1. We selected V1 to serve as a control area because it is known to encode relatively low-level visual information, and we used left V1 instead of right V1 because we expected that left V1 would be less likely to be influenced by interactions with the aforementioned right-hemisphere face-selective regions. Note, however, that the structure of representations in right and left V1 seems to be qualitatively similar (see Results, Left Hemisphere).

Fig. 1. — Examples of stimuli presented in the current study. (A) All faces are presented with a neutral facial expression. Eight additional face identities (models 21, 24–27, and 32–34) from the NimStim Face Stimulus Set (28) and four additional identities from the Psychological Image Collection at Sterling Pain Expressions Set (male models 5, 6, 8, and 10) were included in our stimulus set but cannot be reproduced here under the release terms of those stimulus sets. (B) All details as described for A, with the exception that faces are presented with happy expressions.

To evaluate the extent to which spatiotemporal patterns of activity in each of the aforementioned regions discriminated among the 91 face identities, we used a pairwise k-nearest-neighbor classifier to classify all possible pairs of face identities across changes in facial expression. We then evaluated what information was encoded in the neural data by comparing the pairwise dissimilarity structure of the neural data within each region of interest to each of two representations: (i) an “image-based” representation computed from a neural model simulating the response of simple cells in V1 (29) and (ii) an “identity-based” representation, in which all face pairs have dissimilarity of 0 if they are the same identity, and 1 otherwise (Fig. 2). For each postbaseline time point (where each time point is the starting point of a 60-ms sliding window used for all analyses) and each region of interest, we then examined which of these two representations was more similar to the neural data. To examine the extent to which the spatiotemporal patterns measured in the neural data could account for behavior, we also compared the pairwise structure of the neural data to pairwise behavioral judgments of a subset of the stimuli presented during the MEG experiment.

Fig. 2. — Heat maps showing pairwise distance values for “image-based” and “identity-based” representations of the facial identities in our stimulus set. Each cell of each matrix shows the distance value associated with the comparison of two identities, across a change in facial expression, with hotter colors indicating a greater distance (i.e., larger difference between the representations of each identity in the pair).

Results

Behavior During MEG Face Identity Task.

In each of the 26–28 blocks of the task, participants viewed each of 91 face identities four times (twice per expression) while brain activity was recorded with MEG. Participants were instructed to maintain fixation and to press a button whenever they saw the same face identity repeated, regardless of facial expression. Across all participants and blocks, mean d-prime was 2.21 (SD = 0.52).

MEG.

Functional and anatomical regions of interest.

To localize source points in the MEG source space that responded selectively to faces, we used a one-back localizer task with a block design and with stimuli from five different categories: faces, houses, objects, scrambled objects, and words. Activations from this task (faces > objects) were used to identify face-selective source points within right lateral occipital cortex (rLO-faces) and right fusiform gyrus (rFG-faces), two regions commonly implicated in neuroimaging studies of face perception (see Materials and Methods for details). These two face-selective regions are shown in Fig. 3. In each participant, we also used an anatomical atlas in Freesurfer (30) to identify source points within left V1. We restricted all further analyses to these regions, and to corresponding regions in the opposite hemisphere.

Fig. 3. — Face-selective regions of interest (rLO-faces and rFG-faces) for each participant. All regions are plotted on inflated cortical surface reconstructions generated separately for each participant. rLO-faces is shown with a lateral view of the right hemisphere, and rFG-faces is shown with a ventral view.

Classification of facial identity.

To examine the extent to which each region encoded information about facial identity, we used a binary k-nearest-neighbor classifier (k = 1) to classify each possible pair of facial identities based on the spatiotemporal pattern of activity within each region, and within a 60-ms sliding temporal window. All classification was performed across a change in facial expression (Materials and Methods).

Fig. 4 shows classification accuracy for each region of interest. In all three regions, classification accuracy exceeded chance after ∼50 ms, reaching a peak between 100 and 200 ms, with a clear secondary peak observed in rLO-faces and rFG-faces at ∼250 ms and decreasing back to chance by ∼400 ms. Note that here and elsewhere in this paper time is expressed as the beginning of the 60-ms sliding temporal window used for classification and other analyses of the neural data, as in a previous study using similar methods (15). Hence, the results presented for a given time point can reflect measurements from that time point and up to 60 ms afterward. Between 100 and 200 ms, accuracy was higher in lV1 than in the other two regions. However, there were several periods after 200 ms after which accuracy was higher in the two face-selective regions than in lV1. Together, these results indicate that in all three regions there was sufficient information for cross-expression classification of facial identity between ∼50 ms and 400 ms after stimulus onset.

Comparison with model-based representations.

To investigate what information was encoded in each region, and when, we compared the representational structure of the neural data within each region to two representations with known properties: an image-based representation based on relatively low-level visual properties of the stimuli and a higher level identity-based representation, which solely encodes whether or not two face images differ in identity and is not sensitive to any other property of the images (Fig. 2).

Correlations between the neural data and the image- and identity-based representations are shown in Fig. 5. In lV1, the neural data were significantly more similar to the image-based representation than to the identity-based representation at nearly all time points after stimulus onset. In rLO-faces, the neural data were more similar to the image-based representation between 100 and 200 ms after stimulus onset but were more similar to the identity-based representation at several time points between 200 and 300 ms after stimulus onset. The transition observed after 200 ms seems to reflect a drop in the correlation with the image-based representation, with no corresponding drop in the correlation with the identity-based representation. A similar pattern was observed in rFG-faces, with the exception that the transition to the identity-based representation was less pronounced and did not occur until after 300 ms. Together, these results suggest that spatiotemporal patterns of activity in both early visual cortical regions such as lV1 and face-selective occipital and temporal regions primarily represent image-based properties of face identity between 100- and 200-ms stimulus onset, and that only the latter face-selective regions transition toward a higher level, more identity-specific representation after 200 ms (after 300 ms in rFG-faces). We observed qualitatively similar patterns of correlations (particularly in rLO-faces and lV1) in a comparison of the neural data to layers of a deep neural network trained on our stimuli (Supporting Information and Fig. S1).

Fig. S1. — Correlations between neural data and hidden layers 1 (H1) and 4 (H4) of a deep neural network trained on our stimuli, as a function of time (milliseconds). Separate plots are given for each region of interest (rLO-faces, rFG-faces, and lV1). The horizontal black line indicates postbaseline time points at which correlations differed significantly (FDR $<$ 0.05) between H1 and H4.

Behavioral Similarity Ratings.

Behavioral dissimilarity ratings (Fig. 6) were strongly and positively correlated with both the identity- (r = 0.89) and image-based (r = 0.79) representations but were significantly more strongly correlated with the former than with the latter, P < 0.0001 (31). Correlations between the behavioral and neural data were statistically significant at most postbaseline time points (Fig. 6). Correlations were not significantly stronger for the face-selective regions than for the control region at any postbaseline time points. However, a multiple regression analysis indicated that responses in both face-selective regions predicted behavioral responses after controlling for responses in the control region (lV1) between ∼100 and 250 ms after stimulus onset, and also between 350 and 400 ms in rLO-faces (see Materials and Methods for details). Overall, this pattern indicates that behavioral responses primarily reflect an identity-based representation but may also reflect image-based properties to a lesser extent, and that these behavioral responses can be predicted by responses in face-selective regions during earlier (i.e., between 100 and 200 ms) and later (after 200 ms) time periods in which these regions seem to represent lower and higher level information about facial identity, respectively.

Left Hemisphere.

We extended the analyses described above to the left hemisphere (see Supporting Information and Fig. S2 for details).

Discussion

In the current study, we investigated when spatiotemporal patterns of activity within face-selective cortex carry information sufficient for discrimination of facial identity across changes in facial expression, what type of information about facial identity (e.g., low-level image-based information, higher level identity-based information, and information encoded in pairwise behavioral judgments of the stimuli) is represented in these patterns, and where in the brain (e.g., in face-selective or control regions) these different types of information are represented at different points in time. In two face-selective regions (rLO-faces and rFG-faces) and one control region (lV1) we first measured pairwise classification among all possible pairs of 91 face identities, across changes in facial expression so as to tap into a more abstract representation, invariant over the geometry of the input. We then compared the pairwise similarity structure of the data in each of these regions to an image-based representation based on relatively low-level visual information and to a higher level identity-based representation. We also compared the neural data to behavioral similarity judgments of the stimuli. Between ∼50 and 400 ms, we were able to decode face identity successfully in each of the three regions, with accuracy first peaking at values above 70% between 100 and 200 ms, and with a secondary peak in the face-selective regions between 200 and 300 ms. In all regions, neural responses were more similar to the image-based representation than to the identity-based representation until ∼200 ms. In the face-selective regions only, the correlation with the image-based representation dropped at several time points after 200 ms, so that responses were more similar to the identity-based representation. Behavioral responses were more similar to the identity-based representation than to the image-based representation, and their structure was predicted by responses in the face-selective regions between 100 and 250 ms, after controlling for responses in the control region.

Our finding of successful face identity classification across changes in facial expression between ∼50 and 400 ms in all regions suggests that all regions carried information relevant to discrimination over a relatively large span of time after stimulus onset. This result provides evidence that even very early visual regions can encode information sufficient for discriminating among highly similar complex visual patterns, with some degree of tolerance to identity-preserving transformations. Note, however, that the face images in the current study were carefully aligned to remove obvious cues to identity caused by misalignment, and that this alignment may amplify the extent to which relatively low-level image-based differences can be used for identity discrimination, even across changes in facial expression. Hence, it is not surprising that we were able to decode facial identity across changes in facial expression, even in left V1. That decoding accuracy reached an initial peak between 100 and 200 ms is consistent with previous findings that the N170 component in EEG and the corresponding M170 component in MEG are correlated with characteristics of individual face identities (16–18). In addition, the finding that decoding performance remained above chance at several time points after 200 ms is consistent with findings from intracranial recordings from fusiform gyrus that facial identity could be decoded between 200 and 500 ms after stimulus onset (15). Also, the finding of secondary peaks in classification accuracy in the face-selective regions at around 250 ms seems to be compatible with data showing that subordinate-level processing of faces selectively enhances the N250 ERP component in EEG, which peaks ∼250 ms after stimulus onset (32, 33). In sum, our decoding analysis based on spatiotemporal patterns of activity captured many existing findings in the EEG and MEG literatures, while allowing further analyses of the temporal dynamics of the similarity structure of the neural representation of facial identity.

A particularly compelling aspect of our results is the transition between image-based and identity-based representations observed after 200–300 ms in the face-selective regions, but not in the control region (lV1), and the relation between the timing of this transition and that of classification accuracy. In rLO-faces, there were two obvious peaks in which decoding accuracy exceeded 70%, one between 100 and 200 ms and the second between 200 and 300 ms. At the first peak, the similarity structure of the neural data was more similar to an image-based representation, whereas at the second peak the correlation with the image-based representation dropped, so that the neural data were more similar to the identity-based representation. This pattern was not observed in the control region (lV1), because the data were more similar to the image-based representation than to the identity-based representation at all time points. Previous studies have demonstrated that signals in occipitotemporal regions between 100 and 200 ms are related to image-based properties of faces (16–18). However, the transition toward an identity-based representation within face-selective cortical regions after 200 ms has not been observed in previous studies of discrimination of facial identity, because previous studies either used fMRI (3–13), which lacks the temporal resolution required to resolve the temporal patterns observed in the current study, and/or because their analyses of the neural representations were based on different sensors for different time points (18), and/or were limited to image-based properties of the stimuli (15, 17, 18). The observed pattern provides evidence that spatiotemporal patterns of activity in at least some face-selective regions in human cortex encode qualitatively different information about face identity at different times over the first few 100 ms after stimulus onset, with a transition from a lower level representation to a higher level representation occurring around 200–300 ms. This pattern suggests that models of the neural basis of face recognition that assign a single function to each face-selective cortical region (14) are likely to be incomplete, because they do not account for the possibility that a given face-selective region may play different functional roles at different times. Given that the identity-based representation used in the current study represents any exemplar of the same identity identically, whereas the image-based representation does not, this late transition could reflect the temporal emergence of a neural representation with high tolerance to identity-preserving transformations (19). Such a representation is highly relevant for real-world behavior, because many situations require the system to track a single identity and/or discriminate between identities across identity-preserving image transformations. Further support for the behavioral relevance of this representation comes from our finding that pairwise behavioral judgments of the stimuli were more strongly correlated with the identity-based representation than with the image-based representation, and that responses in the face-selective regions predicted behavior during temporal periods in which responses in the face-selective regions transitioned toward an identity-based representation. Given that the observed transition occurs relatively late, and that it corresponds with a secondary later peak in the classification accuracy function, it seems somewhat unlikely that it arises as a consequence of a single initial feedforward sweep (34) but instead reflects recurrent/feedback processing (19, 35).

One remaining question is how the face-selective regions identified in the current study (rLO-faces and rFG-faces) are related to the corresponding face-selective regions typically identified in fMRI studies of face processing: occipital face area (OFA) and fusiform face area (FFA). Given that all source points within rLO-faces and rFG-faces in the current study are face-selective and are located within the same anatomical subregions as OFA and FFA, respectively, it seems possible that their representations would overlap with those of OFA and FFA. However, at least two differences are likely to limit the degree of overlap. First, MEG and fMRI measure different aspects of neural activity and have different spatial signal distributions, and are therefore likely to be sensitive to different spatial patterns of activity. For example, MEG is less sensitive to signals from deeper and more gyral sources than it is to more superficial and sulcal sources (36). Given that the rFG-faces region used in the current study is deeper and more gyral than rLO-faces, it seems possible that rLO-faces would capture signals from the corresponding fMRI-defined region to a greater extent than rFG-faces. This could account for the lower sensitivity to face identity observed in rFG-faces than in rLO-faces. Second, our MEG data have rich temporal structure, but fMRI data do not, and so the MEG data are unlikely to correspond to the fMRI data at all time points. Hence, although it is possible that the representations measured from rLO-faces and rFG-faces in the current study reflect activity from OFA and FFA, they are likely to reflect different aspects of the representation of facial identity than those standardly measured in OFA and FFA with fMRI.

Taken together, our results provide important information about when spatiotemporal patterns in face-selective cortical regions discriminate among a large and carefully controlled set of face identities across changes in facial expression, and about what type of information is represented in each region, and when. Specifically, our results indicate spatiotemporal patterns of activity in both face-selective and control regions encode information about facial identity between ∼50 and 400 ms after stimulus onset. However, the face-selective regions, but not the control region, seem to encode qualitatively different information about facial identity at different times, with a transition from an image-based representation toward an identity-based representation after 200–300 ms. As described above, these results have implications for understanding the microgenesis of fine-grained, high-level neural representations of object identity, a process critical to human visual expertise (19), and perhaps for distinguishing between feedforward versus recurrent/feedback accounts of visual processing. Overall, the current investigation represents a critical advancement toward understanding the temporal dynamics of visual pattern recognition in the human brain.

Materials and Methods

Participants.

All participants were Caucasian (white European), right-handed, and had normal or corrected-to-normal visual acuity and no history of eye problems. Participants in the MEG experiment were four adults (one female), aged 23–27 y. Participants in the behavioral experiment were seven adult humans (five female), aged 18–28 y, none of whom participated in the MEG experiment. No participants were tested but excluded. Protocols were approved by institutional review boards at Carnegie Mellon University and the University of Pittsburgh. All participants provided written informed consent before each session and received monetary compensation for their participation.

MEG.

Localizer task.

Each participant completed four or five 3.5-h MEG sessions. During the final MEG session, each participant completed a block design category localizer adapted from an existing fMRI localizer used in previous work (12) (Supporting Information). The localizer data were used to identify source points that responded significantly more strongly to faces than to objects (FDR < 0.05) within two anatomical regions defined with an atlas in Freesurfer (30): right lateral occipital cortex and right fusiform gyrus (see Fig. 3 and see Supporting Information for details).

Face identity task.

In each block, participants viewed each of the 91 face identities four times (twice per expression). Participants were instructed to maintain fixation and to respond whenever they saw the same face identity repeated, regardless of facial expression (Supporting Information). During each of the MEG sessions except for the last part of the final MEG session participants performed this task while MEG signals were recorded. Each participant completed between 26 and 28 blocks of the task.

MEG data acquisition and processing.

MEG data were acquired at the University of Pittsburgh Medical Center Brain Mapping Center, with a 306-channel Neuromag (Elektra AB) system. Data were preprocessed using both spatial and temporal filtering approaches. Each participant’s MEG data were then projected onto a cortical surface reconstructed from their anatomical MRI scan (Supporting Information).