Significance
Past work on the neural mechanisms of face processing has focused on how a network of face-selective areas in the macaque monkey brain extracts facial identity. Here, we study the processing of identity, expression, and gaze information. We find that all of these signals are extracted in populations of face cells located in a face area outside the classical network. In this area, we discover a single-cell representation of facial expression that is preserved even with the changes of identity and head orientation. This face area, previously thought to be specialized for the processing of naturalistic facial movements, harbors a heterogeneous population of face cells that can process static faces to extract a wide range of socially meaningful facial information.
Keywords: face-processing systems, facial expression, gaze, facial identity, fMRI-targeted electrophysiology
Abstract
The last two decades have established that a network of face-selective areas in the temporal lobe of macaque monkeys supports the visual processing of faces. Each area within the network contains a large fraction of face-selective cells. And each area encodes facial identity and head orientation differently. A recent brain-imaging study discovered an area outside of this network selective for naturalistic facial motion, the middle dorsal (MD) face area. This finding offers the opportunity to determine whether coding principles revealed inside the core network would generalize to face areas outside the core network. We investigated the encoding of static faces and objects, facial identity, and head orientation, dimensions which had been studied in multiple areas of the core face-processing network before, as well as facial expressions and gaze. We found that MD populations form a face-selective cluster with a degree of selectivity comparable to that of areas in the core face-processing network. MD encodes facial identity robustly across changes in head orientation and expression, it encodes head orientation robustly against changes in identity and expression, and it encodes expression robustly across changes in identity and head orientation. These three dimensions are encoded in a separable manner. Furthermore, MD also encodes the direction of gaze in addition to head orientation. Thus, MD encodes both structural properties (identity) and changeable ones (expression and gaze) and thus provides information about another animal’s direction of attention (head orientation and gaze). MD contains a heterogeneous population of cells that establish a multidimensional code for faces.
Faces are important social stimuli for primates, providing rich and diverse information about others like their identity, mood, or direction of attention (1). Primates have thus evolved specialized neural mechanisms for the processing of faces. These are currently best understood in the macaque visual system. Here, functional MRI (fMRI)-localized, face-selective areas, as in the human brain, are positioned at anatomically stereotypical locations (2–4). Subsequent electrophysiological single-unit recordings from these fMRI-identified face areas showed that they contain very high fractions of face-selective neurons (5, 6). These recordings also showed that face cells in different face areas have qualitatively different properties. Cells in the middle lateral (ML) and middle fundus (MF) face patches (Fig. 1A) are strongly tuned to viewpoint and do very little to identity across viewpoints; cells in the anterior lateral (AL) face area are also strongly tuned to viewpoint but exhibiting mirror-symmetric tuning to multiple viewpoints; and cells in the anterior medial (AM) face area exhibit weak tuning to viewpoint and strong identity selectivity (6). Face areas were later shown to be selectively connected to form a dedicated face-processing network (7, 8).
Fig. 1.
fMRI-targeted electrophysiological recording in dorsal face area MD and example stimuli from FOB. (A) Cartoon of location of MD in the macaque face-processing system. Blue denotes core face areas. The face areas are the following: ML (middle lateral), MF (middle fundus), MD (middle dorsal), AL (anterior lateral), and AM (anterior medial). STS: superior temporal sulcus; D: dorsal; and A: anterior. (B) Coronal MRI images showing electrodes targeting at MD in the left hemisphere of monkey M1 (Left) and the right hemisphere of M2 (Right). Statistical maps of the faces-versus-objects contrast are overlaid on the anatomical images (false discovery rate corrected). AP: anterior–posterior stereotaxic coordinates. (Scale bars, 10 mm.) (C) Example stimuli from the six categories in FOB stimulus set. Monkey faces were obtained from six individuals making four facial expressions shot simultaneously from five different viewpoints. A set of 4 × 5 expression and viewpoint combinations are shown for one individual. There are six individuals in total in the stimulus set. A three-dimensional cube is shown as a schematic of the monkey faces in FOB (Top Right). Human faces consisted of five viewpoints from eight individuals with neutral expression. Exp: expression and ID: identity.
These studies established a fundamental principle of face-processing systems, which is that face areas are functional clusters: Each area is thought to contain a high density of face cells with a homogenous role. We will refer to this property as the clustering principle. This principle is based on findings from a specific set of interconnected face areas (blue areas in Fig. 1A). Yet we do not know if they generalize to other face areas to which no connections have yet been found. One such area is the middle dorsal face area, MD (red area in Fig. 1A).
The clustering principle of face processing has been established in studies focused on the coding of one meaningful facial dimension, identity (6, 9–12). But faces contain much more socially important information, transmitted primarily through facial expression and gaze (13). While facial identity is a structural property (14) of faces that only changes on a very slow time scale, expression and gaze are changeable facial aspects (13) that can change fast. It is thus possible that these two dimensions are coded differently from identity. This is currently unknown, because inside the core face-processing system no evidence for the encoding of these dimensions has been found yet (15).
In studies of single-cell and population codes for facial identity, one important question was whether identity was encoded in a way that was dependent on or in a way that was robust against changes in other image properties like size, position, and head orientation (6). Changes in size and position are affine transformations, while changes in head orientation are nonaffine. While the former are, theoretically, straightforward to compensate for in image processing (16), the latter are more difficult to achieve robustness against, as changes in head orientation bring new features into view and remove others from sight. Similarly, facial expression and gaze are nonaffine transformations that change the shape of a face. It is currently not known where and whether identity can be encoded robustly against another nonaffine transformation, expression, and vice versa. Expression-selective cells have been found (17–19), but their robustness has not been determined.
Faces and other visual objects are compositions of multiple features. One fundamental question about object representations is how feature combinations are encoded (20). One possibility is a separable code in which the joint tuning function to multiple stimulus dimensions can be predicted from the combination of the marginal one-dimensional tuning curves (21). In the case of facial features and objects, evidence for such a coding scheme was found (9, 22). In both cases, it was found that multiplicative combinations provided slightly better explanations than additive ones. This small difference, however, was shown to translate into a big computational advantage for the population decoding of identity (22). However, in the case of faces, it is not known how multiple, meaningful facial dimensions are combined. Similarly, in case multiple facial dimensions are encoded in a population, it is unknown whether this would be accomplished through specialized subpopulations or a heterogenous population in a code in which cells carry signatures of multiple dimensions.
Here, we used the recent discovery of dorsal face area MD (23) to address these questions directly through electrophysiological single-cell recordings (Fig. 1B). We asked the following: 1) whether MD concentrates face cells at similarly high concentration as areas of the core face-processing network; 2) whether MD encodes multiple facial dimensions, in particular identity, expression, gaze, and head orientation; and 3) if so, how does MD encode combinations of meaningful facial dimensions.
Results
MD Contains as High a Fraction of Face-Selective Cells as Face Areas in the Ventral Face-Processing System.
The clustering principle of face-processing systems implies that each face area contains a high concentration of face-selective cells (5). To determine if MD follows this rule like the ventral face areas, we recorded from 193 neurons (SI Appendix, Table S1) in two hemispheres (Fig. 1B) during the presentation of a set of static images, including monkey and human faces, bodies, and objects (faces, objects, and bodies [FOB]; Fig. 1C and SI Appendix, SI Materials and Methods). Human faces varied independently for viewpoint and identity and macaque monkey faces for viewpoint, identity, and expression (Fig. 1C).
Fig. 2A shows an example response profile of an MD cell. The cell was selective for subsets of monkey and human faces but did not respond to objects. This was typical for the MD population as a whole (Fig. 2B): The majority of cells (157 out of 193, 81%) were visually responsive (Fig. 2 B, Top), and the average MD population response exhibited high selectivity to monkey and human faces, while being suppressed by other nonface categories (Fig. 2C). We quantified the degree of face selectivity of each visually responsive cell with face selectivity indices, which compare responses to monkey (or human) faces with those to nonface objects (SI Appendix, SI Materials and Methods). The distributions of both indices were heavily skewed to the right (median: 0.99, Fig. 2D; median: 0.69, Fig. 2E). In total, 84% (132 out of 157) were face selective, responding at least twice as strongly to faces than to nonface objects (24, 25). These numbers are very similar to those in ventral face areas (6). Thus, MD is a bona fide face area.
Fig. 2.
MD is a face-selective area with a high fraction of face-selective cells. (A) Spike density function of an example MD cell’s responses to FOB image set. Stimuli are sorted from top to bottom, and stimulus categories are indicated by example pictures on the left. Minor ticks of the y-axis separate five viewpoints. Stimuli were presented for 200 ms, followed by 200 ms blank screen. The vertical gray line denotes the end of stimulus presentation. (B) Population response matrices to FOB image set for visually responsive (Top) and nonresponsive (Bottom) cells. Cells are sorted by the third PC of their responses to monkey faces. Vertical lines separate categories and minor ticks of the x-axis separate the faces of different viewpoints. (C) Population-averaged responses to the six categories in FOB across all recorded cells (n = 193). Error bars denote mean ± SEM. (D) Distribution of face selectivity indices (FSIs) for visually responsive cells (n = 157). (E) Distribution of human face selectivity indices (hFSIs) for visually responsive cells. In D and E, dashed lines indicate FSI or hFSI of 0.33, corresponding to a response ratio of 2:1 to faces versus nonface objects.
MD Cells Encode Viewpoint of Faces in Multiple Formats.
Previous studies showed that each area of the face-processing network has a characteristic representation of viewpoint that differs from the others (6). We found similar tuning in MD: Many MD neurons were strongly tuned to one specific viewpoint (Fig. 3A). In total, 129 (of 132, 98%) face-selective cells showed significant viewpoint tuning (Fig. 3C). Many other cells in MD responded preferably to two mirror-symmetric viewpoints (Fig. 3B). To determine the fraction of mirror symmetrically, view-tuned cells, we calculated a mirror symmetry index (MSI) and classified viewpoint-tuned cells with an MSI larger than 0.4 as mirror symmetric (SI Appendix, SI Materials and Methods and Fig. S2). In total, 27 out of 132 (21%) cells exhibited mirror-symmetric tuning (Fig. 3C).
Fig. 3.
MD represents viewpoint in multiple formats. (A) Responses to the five viewpoints of example viewpoint-specific cells. (B) Responses to the five viewpoints of example mirror-symmetric cells. In A and B, response was averaged across the identity and viewpoint for illustration purpose. We did not average cell response across other dimensions when performing statistical tests. (C) Distribution of face-selective cells with or without mirror-symmetric viewpoint tuning. (D) Decoding accuracy of viewpoint from face-selective cell population. Decoding was performed using linear support vector machine (SVM) classifiers trained on the population response of face-selective cells (n = 132). The dashed line depicts chance level. ***P < 0.001, bootstrap. (E) Confusion matrix of the linear SVM classifier in D decoding viewpoint. The color of each division in the matrix denotes the fraction of images of a viewpoint (row) classified as the targeted viewpoint (column). (F) Projections of population responses of face-selective cells to the monkey faces in FOB onto the first two PCs (50.1% of total variance explained). Each dot represents a face stimulus. Colors indicate viewpoints shown by images next to clusters. (G) Neural distance of pairs of faces that were mirror symmetric to each other. Neural distance was calculated based on three population of cells: full population of all face cells (gray bar); population with the 27 mirror-symmetric cells removed (dark blue bar); and population with 27 nonmirror-symmetric cells removed (light blue bar; SI Appendix, SI Materials and Methods). MS, mirror-symmetric; ns, not significant (P > 0.05); and ****P < 10−21, two-sided paired t tests. Error bars denote mean ± SEM in all plots.
The strength of viewpoint tuning suggests that even a small population might be able to encode viewpoint well. We performed a decoding analysis in which the viewpoint of a face was predicted from its population response vector (SI Appendix, SI Materials and Methods). The population of recorded cells reached a 100% correct classification rate (Fig. 3 D and E). Furthermore, the two kinds of viewpoint tuning in MD are not only strong but systematic. Because of this property, the MD population response, similarly, systematically separated faces by viewpoint. The lack of mirror confusion in Fig. 3E could result from the decoder putting more weights on viewpoint-specific than mirror-symmetric cells. To visualize the whole population in an unbiased way, we performed a principal component (PC) analysis on the population response matrix and plotted the response to each face stimulus onto the first two PCs (Fig. 3F) and found the following: First, responses to faces primarily cluster by viewpoint. Second, responses to neighboring viewpoints generate more similar responses than responses to more different viewpoints. And third, responses to mirror-symmetric faces (left and right profile, and left and right half profile) were significantly more similar to each other in the entire population (gray bar in Fig. 3G) than that in a population with mirror-symmetric cells removed (dark blue bar in Fig. 3G; P < 10−21, two-sided paired Student’s t test). To ensure this difference did not result from a difference in cell number, we reduced the population size in another control and removed nonmirror-symmetric cells (SI Appendix, SI Materials and Methods). This manipulation did not lead to an increase of neural distance of mirror faces (light blue bar in Fig. 3G). Thus, MD encodes faces, strongly and systematically, by viewpoint.
MD Encodes Gaze Direction.
Viewpoint conveys the head orientation of a face. This is an important clue toward another individual’s direction of attention (26). Head orientation is particularly important in macaque monkeys (26), because their eyes lack a large contrast between center and periphery (27). But macaque monkeys can use eye gaze also as a cue for inferring attentional direction and for joint attention (28). We thus wondered whether MD contains cells encoding gaze direction in addition to head orientation, representing one major intrinsically changeable dimension of faces. We tested for explicit tuning to gaze direction in a second experiment, in which we placed different gaze directions into otherwise identical face images (Fig. 4A and SI Appendix, SI Materials and Methods).
Fig. 4.
MD cells encode gaze direction in addition to head orientation. (A) Example stimuli of gaze conditions with condition labels. Conditions are labeled as “head orientation: gaze direction.” Within the same head orientation, gaze conditions differ only in the eye region (dashed boxes). L: left, F: front, and R: right. Left and right are relative to the viewer. Colored texts denote gaze direction. (B and C) Example cells with gaze tuning. (Left) Spike density functions to gaze directions within a specific head orientation. (Right) Responses to all seven experimental conditions. (D) Example cell with similar responses to left gaze across head orientations. Conventions are the same as in B and C. From B to D, gray bands denote time periods of stimulus presentation. Shading and error bars represent mean ± SEM; ns, not significant (P > 0.05); and **P < 0.01, ***P < 10−5; see SI Appendix, SI Materials and Methods for tests. (E) Distribution of face-selective cells tuned to gaze. (F) Statistical test results (P values) of all cells with gaze tuning (false discovery rate controlled).
Out of the 116 face-selective cells recorded and tested in this experiment, 28 (24%) showed a significant modulation by gaze direction in at least one head orientation context (Fig. 4 B–E; Wilcoxon rank-sum tests for half profile viewpoints or Kruskal–Wallis tests for front viewpoints; P < 0.05, false discovery rated corrected). In most cells, tuning to gaze direction was confined to one specific head orientation (Fig. 4F). Yet we asked if there might be cells representing gaze direction in a head orientation–invariant manner (i.e., cells that responded similarly to one preferred gaze direction across different head orientations). Fig. 4D shows one such example with similar responses to left gaze direction in both front and left faces. Among the 28 gaze-tuned cells, only two showed this property. Together, these results indicate that despite the small and spatially restricted difference between images with different gaze direction, MD is sensitive to gaze direction in addition to head orientation. These results provide direct, single-unit evidence that a face area outside the core face network processes changeable aspects of faces (13, 29).
MD Contains Codes for Facial Expression.
A second key, changeable dimension of faces is facial expression (13). We thus tested directly if MD represents facial expression. We used our FOB stimulus set, which contained monkey face stimuli varying orthogonally along expression, viewpoint, and identity (Fig. 1C).
We found MD cells that were tuned to specific expressions. Fig. 5 C and E show two example cells: The one in Fig. 5C selectively responds to threat and feargrin expressions and is slightly suppressed by neutral and lipsmack expressions. The cell in Fig. 5E shows graded responses across all four expressions, preferring the neutral expression the most and the threat expression the least. Curiously, the cell even differentiates the lipsmack from neutral expressions, even though these two are physically highly similar in our stimulus set (Figs. 5A and 1C). For both cells, the tuning to expression was largely maintained across identity and viewpoint, as shown by joint tuning functions (Fig. 5 D and F). Of the 132 face-selective MD cells tested, 49 (37%) showed significant expression modulation (Kruskal–Wallis test, P < 0.05; Fig. 5B). We further quantified how bimodal each cell’s response is with a bimodality index (SI Appendix, SI Materials and Methods) and found that 18 out of the 49 (37%) expression-tuned cells have a bimodal response. The others have a graded response profile.
Fig. 5.
MD represents facial expression independently from viewpoint and identity. (A) Example stimulus of each facial expression. (B) Distribution of face-selective cells tuned to expression. (C) Spike density function of an example cell showing expression tuning. Response was averaged across viewpoint and identity for illustration purpose. Gray bands denote time periods of stimulus presentation. Shading represent mean ± SEM. (D) Joint tuning to expression and identity (Left) and to expression and viewpoint (Middle) of the example cell in C. The response shown was averaged across viewpoint (Left) or identity (Middle). Tuning to expression averaged across other dimensions is shown at the right. ****P < 10−8, Kruskal–Wallis test. (E and F) Spike density function in E and joint tuning plots in F for another example cell. Conventions are the same as in C and D. (G) Schematic of expression decoding. Expression was decoded across all variations of viewpoint and identity. (H) Decoding accuracy of expression from the whole face-selective cell population. Decoding was performed using linear support vector machine (SVM) classifiers trained on the population response of face-selective cells (n = 132). The dashed line depicts chance level. ***P < 0.001, bootstrap. (I) Confusion matrix of the linear SVM classifier in H decoding expression. The color of each division in the matrix denotes the fraction of images of a specific expression (row) classified as the targeted expression (column). The matrix represents the decoding of expression across all viewpoints and identities. Note that lipsmack and neutral expressions are quite similar in our stimulus set (see Figs. 1C and 5A). In all plots, error bars denote mean ± SEM.
To quantify how well MD represents expression, we performed a decoding analysis. Importantly, we assessed the decodability of expression independently from the other two facial dimensions and not after averaging responses across these other two dimensions (SI Appendix, SI Materials and Methods): Each expression category included all images of that expression, varying along viewpoints and identities (Fig. 5G) and decoding considered responses to all of these images, no matter its viewpoint or identity, within one expression against responses to images of other expressions (SI Appendix, SI Materials and Methods). Such decoding is challenging because images for the same facial expression vary strongly with viewpoint and, to a lesser degree, with identity (Fig. 1C). Thus, for expression to be decodable, neural populations in MD need to “recognize” expressions robustly across changes in viewpoint and identity. We found that face-selective MD populations encode facial expression robustly across identity and viewpoint (accuracy: 54% and chance level: 25%; Fig. 5H). The confusion matrix of expression decoding (Fig. 5I) revealed that the main limitation to decoding accuracy was a high rate of misclassification between the physically very similar neutral and lipsmack expressions.
MD Contains Codes for Facial Identity.
We next investigated, using our FOB stimulus set (Fig. 1C), whether MD encodes identity. Surprisingly, we found MD cells that were tuned to identity across viewpoint and expression (Fig. 6 C–F). And there were surprisingly many cells with significant tuning to identity (72 of 132 face-selective cells, 55%; Kruskal–Wallis test, P < 0.05; Fig. 6B). Tuning to identity took different forms. Some cells were tuned broadly, responding preferably to more than one identity (Fig. 6C), while others (11 of 72 identity-tuned cells, 15%) even showed strong selectivity to a single identity (Fig. 6E). Joint tuning functions show that both cells maintained the tuning to identity across expression and viewpoint (Fig. 6 D and F). Thus, despite the physically relatively small differences between identities of the same viewpoint, MD extracts this dimension of structural information. In fact, a large fraction of MD cells, more than to expression or gaze, were identity tuned.
Fig. 6.
MD represents facial identity independently from viewpoint and expression. (A) Example stimuli from each facial identity. ID: identity. (B) Distribution of face-selective cells tuned to identity. (C) Spike density function of an example cell showing identity tuning. Response was averaged across viewpoint and expression for illustration purpose. Gray bands denote time periods of stimulus presentation. Shading represent mean ± SEM. (D) Joint tuning to identity and expression (Left) and to identity and viewpoint (Middle) of the example cell in C. The response shown was averaged across viewpoint (Left) or expression (Middle). Tuning to identity averaged across other dimensions is shown at the right. ****P < 10−12, Kruskal–Wallis test. (E and F) Spike density function in E and joint tuning plots in F for another example cell. Conventions are the same as in C and D. (G) Schematic of identity decoding. Identity was decoded across all variations of viewpoint and expression. (H) Decoding accuracy of identity from the whole face-selective cell population. Decoding was performed using linear support vector machine (SVM) classifiers trained on the population response of face-selective cells (n = 132). The dashed line depicts chance level. ***P < 0.001, bootstrap. (I) Confusion matrix of the linear SVM classifier in H decoding identity. The color of each division in the matrix denotes the fraction of images of a specific identity (row) classified as the targeted identity (column). The matrix represents the decoding of identity across all viewpoints and expressions. In all plots, error bars denote mean ± SEM.
This finding suggests that populations of MD cells might encode identity robustly across expression and viewpoint. To this end, we performed a decoding analysis to quantify how well identity was represented by the MD population independently from facial expression and viewpoint without averaging responses across these dimensions: Each identity category included all images of that identity, varying along viewpoints and expressions (Fig. 6G), and decoding considered responses to all of these images, no matter their viewpoint or expression, within one identity against responses to images of other identities (SI Appendix, SI Materials and Methods). This provides a strong test for the robustness of coding for identity. We found that MD populations indeed encode identity across expressions and viewpoints robustly (accuracy: 63% and chance level: 17%; Fig. 6H). The confusion matrix of identity decoding confirmed that MD populations systematically separated all six identities in the stimulus set (Fig. 6I).
Thus, face area MD encodes structural properties of the face. Specifically, more than half of the MD population encodes facial identity. This capability requires the differentiation of fine spatial detail. Because the population encodes facial identity, even in the face of variation in two other dimensions, viewpoint and expression, the encoding of fine detail for identity is also robust.
Multidimensional Encoding in MD.
We found many MD cells tuned to multiple dimensions of faces: viewpoint, identity, expression, and gaze (SI Appendix, Fig. S4). To quantify the multidimensional encoding of facial dimensions in MD cells, we performed a three-way ANOVA on the responses of each face-selective cell to viewpoint, expression, and identity. We found that most face cells (131 out of 132) show at least one significant main effect (Fig. 7A). Viewpoint is the factor that elicit the largest number of significant main effects within cells (n = 129), followed by identity (n = 90), and then expression (n = 68). Although there are 27 cells showing a main effect for viewpoint only, it is rare that any cells with significant identity or expression main effects lack a significant effect on viewpoint. This is in accordance with our other analyses, showing that viewpoint is strongly represented in MD (Fig. 3).
Fig. 7.
Combination of multiple facial dimensions in MD. (A) Venn chart showing the number of face-selective cells that show main effects to viewpoint, expression, or identity from the three-way ANOVA analysis. Among all 132 face-selective cells, only one cell (not shown in the chart) did not show any main effect. (B, Left) Three joint tuning functions of an example cell to facial expression, identity, and viewpoint. The response shown in each matrix was averaged across the left–out dimension. Tuning curve to each dimension is shown as marginals. (Right) Predicted joint tuning functions from the multiplicative model for the cell. (C) Mean correlation between model prediction and true data (normalized by data reliability) averaged across all face-selective cells for the multiplicative (red) and additive (blue) models. In each model, expression, viewpoint, and identity are fit together. ***P < 10−3, Wilcoxon signed-rank test. (D) Model performance (same in C) averaged across expression- or identity-tuned cells for the multiplicative (red) and additive (blue) models. Results for each pair of the three facial dimensions (indicated above plots) are shown. ns, not significant (P > 0.05); and ***P < 10−3, Wilcoxon signed-rank tests, false discovery rate corrected. In all plots, error bars denote mean ± SEM. See SI Appendix, Fig. S3 for unnormalized model performance.
Fig. 7A also shows that there are a large fraction of cells showing main effects to multiple factors. For example, 54 cells showed significant effects to all three factors. However, there is also a clear division of labor between identity and expression within the whole population—36 cells with an identity effect are insensitive to expression, and 14 cells with an expression effect are insensitive to identity. These results indicate that although the majority of cells encode viewpoint, there are at least four populations of them regarding the sensitivity to identity and expression: one encodes both identity and expression, two others encode either identity or expression, and another only encodes viewpoint but not identity and expression.
We next studied the interaction effects of the three dimensions, summarized in Table 1. In Table 1, we split the whole population into eight groups according to the main effects the cells show by columns. For each group of cells, we then counted the number of cells with significant interaction effects. We found that cells that encode multiple factors also show more prominent interaction effects between encoded main factors, yet weaker interactions involving factors that are not encoded. For example, many of the cells with main effects to all three factors (column eight) show various interaction effects among the three factors. Similarly, for cells encoding both viewpoint and identity but not expression (column six), the majority shows an interaction effect between viewpoint and identity. For the other groups of cells, interaction effects are much less prominent. For example, more than half of the cells which only encode viewpoint (column two) and cells which only encodes viewpoint and expression (column five) do not show any interaction effect (row six).
Table 1.
Number of cells showing significant main and interaction effects in three-way ANOVA testing viewpoint, expression, and identity
| View only | Exp only | ID only | View & Exp only | View & ID only | Exp & ID only | View & Exp & ID | No main effect | Sum | |
| View × Exp | 2 | 1 | 1 | 3 | 12 | 0 | 31 | 0 | 50 |
| View × ID | 9 | 1 | 1 | 5 | 30 | 0 | 48 | 0 | 94 |
| Exp × ID | 1 | 1 | 0 | 2 | 7 | 0 | 34 | 1 | 46 |
| View × Exp × ID | 3 | 0 | 0 | 1 | 9 | 0 | 25 | 0 | 38 |
| No interaction | 15 | 0 | 0 | 6 | 3 | 0 | 3 | 0 | 27 |
| Total | 27 | 1 | 1 | 13 | 35 | 0 | 54 | 1 | 132 |
Across columns, cells are split into nonoverlapping groups according to the main effects they show. The rows represent the number of cells show a certain interaction effect within the cell group indicated by the column. Exp, expression; ID, identity.
In summary, these results indicate that there are different populations of cells within MD encoding various numbers of facial dimensions. While some cells encode all factors interactively, there are also cells encoding only subsets of the facial dimensions or encoding them independently.
Combinations of Multiple Facial Dimensions in MD.
How are multiple dimensions combined in individual MD cells? It has been shown previously that the multiplicative mixing of image features leads to better decodability of features compared to additive mixing (22). We thus tested whether MD cells combine multiple facial dimensions by multiplication or addition of marginal tuning. We found that the multiplicative model fit the empirical joint tuning functions well in many cells (see Fig. 7B for an example). In fact, across the population, the performance of the multiplicative model (red bars in Fig. 7 C and D) is close to the upper bounds set by the reliability of the data.
To quantify the multiplicative separability of the joint tuning of MD cells, we calculated a multiplicative separability index (SI Appendix, SI Materials and Methods) (22). We found a high mean separability of 0.8 (SD: 0.1) across all face-selective cells in MD. However, this result does not necessarily imply that multiplication is the only way MD cells combine multiple dimensions. An additive-mixing model provided similar performance (blue bars in Fig. 7 C and D and SI Appendix, SI Materials and Methods) to that of the multiplicative one. However, we do find that the multiplicative model has a small but significant advantage over the additive one when considering viewpoint, expression, and identity together (normalized correlation ± SEM: 1.11 ± 0.06 for multiplicative, 1.10 ± 0.06 for additive model; P < 10−3, Wilcoxon signed-rank test; Fig. 7C; see also SI Appendix, Fig. S3A). To determine which, if any, pair of dimensions benefits from multiplicative over additive combination, we fit the two models separately for each pair of facial dimensions (SI Appendix, SI Materials and Methods). We found that, for all the pairs of facial dimensions, the multiplicative model had a slight but significant advantage over the additive model only in the joint tuning of identity and viewpoint in identity-tuned cells (Fig. 7D and SI Appendix, Fig. S3B). Thus, our results are consistent with previous findings in generally object-selective parts of inferotemporal cortex that joint tuning to identity and viewpoint is slightly better explained by multiplicative mixing than additive mixing (22). In addition, we found that tuning to a changeable feature, expression, and viewpoint or identity does not show this advantage of multiplicative mixing.
Discussion
Area MD is located in the dorsal bank of the superior temporal sulcus (STS) and also dorsally to the other middle face areas ML and MF. Previous fMRI studies in both humans and monkeys have revealed that dorsal face areas, but not ventral face areas, show strong face–motion selectivity (23, 30, 31). MD, a dorsal face area, was discovered as a face–motion area (23). MD does not appear to be directly connected to the ventral face-processing areas (7, 8). These properties suggested that MD might encode facial information differently from the areas of the ventral face-processing network.
Recording single-unit activity from fMRI-identified area MD, we find that MD is packed with just as high a fraction of face cells as much as the extensively studied, ventral face areas (5, 6, 32). MD cells are highly heterogenous: First, we found both single-viewpoint selectivity and mirror-symmetric viewpoint selectivity in MD cells. In contrast, each ventral face area contains homogeneous cells showing one tuning scheme to viewpoint: single-viewpoint selectivity occurring in middle face areas ML and MF and mirror-symmetric viewpoint selectivity in AL face area (6). Second, the subpopulations of MD cells encode fundamentally different properties of faces, including both changeable dimensions like expression and gaze and structural dimensions like identity. Thus, MD implements a basic version of the clustering principle, harboring high fractions of face cells but not homogeneity of tuning.
Facial expressions are one dimension of changeable facial features (13). In our recordings from MD, we found direct, single-cell evidence for facial expression coding in a face area. Even more importantly, we found expression-selective cells that are not simply selective for a single image of an expression but code expression across other dimensions of variation, including viewpoint and identity. Previous studies have documented cells sensitive to facial expression in the temporal lobe but not in any fMRI-defined face areas (17–19). These studies, either ignoring head orientation or mixing viewpoints and facial expressions without controlling viewpoints across different facial expressions, did not establish expression selectivity robust to head orientation tuning. Furthermore, how well those cells as a population could encode facial expression was previously not known. Here, by using a rigorously controlled face stimulus set with faces varied along three facial dimensions (viewpoint, expression, and identity) independently, we provide direct evidence in the temporal cortex of cells building a strong and robust representation of facial expressions. It is possible that expression-tuned cells in MD also code nonemotional facial gestures (e.g., chewing, yawning, etc.). Such tuning would further emphasize tuning to changeable facial features in MD.
The second changeable dimension of faces of high social importance is gaze (13, 33). Sensitivity to static gaze direction has been reported before in the STS (25, 34) but not in any known, localized face areas. Here, we provide single-unit evidence that MD encodes this second dimension of changeable facial features as well. Importantly, we find that there are cells in a face area that exhibit tuning to gaze in addition to head orientation. Previous studies have found high sensitivity to the eye region in the posterior lateral (PL) face area (32) and middle face areas ML and MF (9). However, in these studies, gaze direction was not manipulated and gaze tuning not investigated. The high sensitivity to the eye region does not imply tuning for gaze direction since it could be due to the high salience of the eye and its importance for face detection (32). Here, we find that MD contains neural tuning to both head orientation and gaze direction. While the former can be inferred based on coarse features alone (35), the latter requires high spatial acuity not expected for a motion-selective area. Taken together, MD contains rich information that can be used to infer the attentional state of others.
We find that MD encodes facial identity as well as expression and gaze. Thus, MD encodes both structural and changeable features of faces. We show this directly, at the level of populations of single cells, in a face area. What might the utility for such rich representations be? A partial answer might come from considering the informational requirements of action selection in social contexts. Social interactions require the processing of changeable aspects of faces, which reveal others’ internal states like direction of attention (from head and gaze direction) and emotion (from facial expression). Social interactions also require analyzing structural properties of faces, which reveal not only identity but also other qualities, like gender or dominance rank (36), that determine the nature of social actions. This multidimensional representation could also be used for performing nonsocial actions. For example, during foraging, a subordinate animal needs to know whether a more dominant peer is watching at the same source of food before collecting it.
Here, as with most previous studies on facial identity coding (6, 11, 12), we focus on the tuning to global facial dimensions. Only few studies [e.g., Freiwald et al. (9)] investigated tuning to specific facial features like intereye distance. Thus, we lack a mechanistic understanding of which shape selectivity gives rise to tuning to different global dimensions, like head orientation, identity, or expression, and their combination in a robust manner. Taking the case of expression as an example, in our FOB stimulus set, differences between expressions are most prominent in the mouth region. Expression-tuned cells in MD might be particularly sensitive to the mouth region. But sensitivity to other features related to facial expressions (e.g., eyebrows) might exist within MD as well. Thus, the expression-tuned cells we found might constitute a subset of the entire expression-coding population.
Yet we did gain understanding of how multiple facial dimensions are encoded. First, MD encodes them through a population of highly heterogeneously specialized cells. We did not find subpopulations encoding identity fully separated from populations for expression, separated from populations for gaze, but rather, we found various combinations of tuning to these different dimensions (SI Appendix, Fig. S4). Some cells were even selective for all three dimensions. And this selectivity was combined with the heterogeneous encoding of head orientation. Thus, while MD cells share selectivity for faces, they encode them in very different ways. Second, we found individual cells and the MD population as a whole to encode identity, expression, and head orientation in a separable manner. Similar to earlier studies, we found additive and multiplicative combinations of marginal tuning to yield similar predictions (9, 22). And, like one recent study, we found a slight advantage for multiplicative combination, in case of identity and viewpoint selectivity in the subpopulation of identity-selective cells (22). This result suggests that head orientation and identity, but not expression, should be easily separable at the population level. Expression and identity, a changeable and a structural dimension, might not even be completely discernable in the MD code, possibly because both affect the shape of the face, and some features could support the coding of both expression and identity (37). Sensitivity to such common features might be important to the decoding of expression and identity. Other sources exist as well (e.g., the cells we found to be exclusively tuned to expression [or identity], which could also decode expression [or identity] significantly above chance levels) (SI Appendix, Fig. S6).
It is important yet daunting at the same time to consider how one cortical area can simultaneously encode such heterogenous information. We do not know of face-selective inputs to MD. Prior studies of connectivity between fMRI-identified face areas (7, 8) did not include MD and thus might possibly have missed such connections. However, these studies did not provide evidence for connections from other face-selective regions, even to the general part of the upper STS that harbors MD. Classical connectivity and lesion studies showed inputs from multiple sources, including V1, MT, and the superior colliculus to this large upper STS region (38–40), suggesting that MD might be in a position to combine rich visual information but not for MD to inherit face information from earlier processes. Although this is certainly not impossible, it appears more plausible that MD computes face selectivity and four major qualities of it de novo. Subpopulations in MD might be sensitive to different features of faces: Cells processing structural information might be sensitive to the configuration of facial components (41) and cells processing expression or gaze to local features of faces. These considerations have a broad impact for understanding face processing and, indeed, object recognition systems in general, as well as the computational depth of cortex in general. It would also be interesting for future studies to compare the representations of multiple facial dimensions in deep networks [e.g., Jacob et al. (42)] with the ones we found in MD.
In summary, our study shows heterogeneous and multidimensional codes for faces in area MD (Fig. 8). This pattern of population selectivity stands in stark contrast to the principles assumed for face areas in general. Our results provide inroads into the mechanistic understanding of fundamentally important, but little studied, facial information like expression and gaze and their use for social interactions. The deeper understanding of how MD accomplishes so many tasks is likely to provide insights for the understanding of cortical computations.
Fig. 8.
MD represents multidimensional facial information, including both structural (identity) and changeable aspects (expression and gaze). D: dorsal; and A: anterior. Conventions as in Fig. 1A.
Materials and Methods
Data were obtained from two male rhesus monkeys (Macaca mulatta, 8 to 10 y old, 12 to 15 kg). Electrophysiological recordings were taken from fMRI-localized area MD. All procedures conformed to federal and state regulations and followed the NIH Guide for Care and Use of Laboratory Animals (43). See SI Appendix for details of experimental design, data collection, and analysis.
Supplementary Material
Acknowledgments
This work was supported by the National Eye Institute of the NIH under Award Number R01 EY021594 to W.A.F. and the New York Stem Cell Foundation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We thank W. Zarco, C. Fisher, A. Gonzalez, S. Landi, S. Shepherd, and I. Sani for help with animal training, monkey fMRI, and stimulus preparation; B. Deen for help with MRI sequences; S. Serene for sharing analysis tools; L. Yin and J. Cheng for logistics support; and the veterinary team of the Rockefeller University for the care of the subjects.
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2108283118/-/DCSupplemental.
Data Availability
Cell response data have been deposited in Figshare (https://doi.org/10.6084/m9.figshare.15087096) (44). All other study data are included in the article and/or SI Appendix.
References
- 1.Bruce V., Young A. W., Face Perception (Psychology Press, London, UK, 2012). [Google Scholar]
- 2.Pinsk M. A., DeSimone K., Moore T., Gross C. G., Kastner S., Representations of faces and body parts in macaque temporal cortex: A functional MRI study. Proc. Natl. Acad. Sci. U.S.A. 102, 6996–7001 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tsao D. Y., Freiwald W. A., Knutsen T. A., Mandeville J. B., Tootell R. B. H., Faces and objects in macaque cerebral cortex. Nat. Neurosci. 6, 989–995 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Tsao D. Y., Moeller S., Freiwald W. A., Comparing face patch systems in macaques and humans. Proc. Natl. Acad. Sci. U.S.A. 105, 19514–19519 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Tsao D. Y., Freiwald W. A., Tootell R. B. H., Livingstone M. S., A cortical region consisting entirely of face-selective cells. Science 311, 670–674 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Freiwald W. A., Tsao D. Y., Functional compartmentalization and viewpoint generalization within the macaque face-processing system. Science 330, 845–851 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Moeller S., Freiwald W. A., Tsao D. Y., Patches with links: A unified system for processing faces in the macaque temporal lobe. Science 320, 1355–1359 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Grimaldi P., Saleem K. S., Tsao D., Anatomical connections of the functionally defined “face patches” in the macaque monkey. Neuron 90, 1325–1342 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Freiwald W. A., Tsao D. Y., Livingstone M. S., A face feature space in the macaque temporal lobe. Nat. Neurosci. 12, 1187–1196 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Leopold D. A., Bondar I. V., Giese M. A., Norm-based face encoding by single neurons in the monkey inferotemporal cortex. Nature 442, 572–575 (2006). [DOI] [PubMed] [Google Scholar]
- 11.Koyano K. W., et al., Dynamic suppression of average facial structure shapes neural tuning in three macaque face patches. Curr. Biol. 31, 1–12.e5 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chang L., Tsao D. Y., The code for facial identity in the primate brain. Cell 169, 1013–1028.e14 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Haxby J. V., Hoffman E. A., Gobbini M. I., The distributed human neural system for face perception. Trends Cogn. Sci. 4, 223–233 (2000). [DOI] [PubMed] [Google Scholar]
- 14.Bruce V., Young A., Understanding face recognition. Br. J. Psychol. 77, 305–327 (1986). [DOI] [PubMed] [Google Scholar]
- 15.Hesse J. K., Tsao D. Y., The macaque face patch system: A turtle’s underbelly for the brain. Nat. Rev. Neurosci. 21, 695–716 (2020). [DOI] [PubMed] [Google Scholar]
- 16.Riesenhuber M., Poggio T., Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999). [DOI] [PubMed] [Google Scholar]
- 17.Hasselmo M. E., Rolls E. T., Baylis G. C., The role of expression and identity in the face-selective responses of neurons in the temporal visual cortex of the monkey. Behav. Brain Res. 32, 203–218 (1989). [DOI] [PubMed] [Google Scholar]
- 18.Morin E. L., Hadj-Bouziane F., Stokes M., Ungerleider L. G., Bell A. H., Hierarchical encoding of social cues in primate inferior temporal cortex. Cereb. Cortex 25, 3036–3045 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sugase Y., Yamane S., Ueno S., Kawano K., Global and fine information coded by single neurons in the temporal visual cortex. Nature 400, 869–873 (1999). [DOI] [PubMed] [Google Scholar]
- 20.Sripati A. P., Olson C. R., Responses to compound objects in monkey inferotemporal cortex: The whole is equal to the sum of the discrete parts. J. Neurosci. 30, 7948–7960 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Grunewald A., Skoumbourdis E. K., The integration of multiple stimulus features by V1 neurons. J. Neurosci. 24, 9185–9194 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ratan Murty N. A., Arun S. P., Multiplicative mixing of object identity and image attributes in single inferior temporal neurons. Proc. Natl. Acad. Sci. U.S.A. 115, E3276–E3285 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Fisher C., Freiwald W. A., Contrasting specializations for facial motion within the macaque face-processing system. Curr. Biol. 25, 261–266 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Baylis G. C., Rolls E. T., Leonard C. M., Selectivity between faces in the responses of a population of neurons in the cortex in the superior temporal sulcus of the monkey. Brain Res. 342, 91–102 (1985). [DOI] [PubMed] [Google Scholar]
- 25.Perrett D. I., et al., Visual cells in the temporal cortex sensitive to face view and gaze direction. Proc. R. Soc. Lond. B Biol. Sci. 223, 293–317 (1985). [DOI] [PubMed] [Google Scholar]
- 26.Marciniak K., Dicke P. W., Thier P., Monkeys head-gaze following is fast, precise and not fully suppressible. Proc. Biol. Sci. 282, 20151020 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kobayashi H., Kohshima S., Unique morphology of the human eye. Nature 387, 767–768 (1997). [DOI] [PubMed] [Google Scholar]
- 28.Shepherd S. V., Following gaze: Gaze-following behavior as a window into social cognition. Front. Integr. Nuerosci. 4, 5 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hoffman E. A., Haxby J. V., Distinct representations of eye gaze and identity in the distributed human neural system for face perception. Nat. Neurosci. 3, 80–84 (2000). [DOI] [PubMed] [Google Scholar]
- 30.Pitcher D., Dilks D. D., Saxe R. R., Triantafyllou C., Kanwisher N., Differential selectivity for dynamic versus static information in face-selective cortical regions. Neuroimage 56, 2356–2363 (2011). [DOI] [PubMed] [Google Scholar]
- 31.Polosecki P., et al., Faces in motion: Selectivity of macaque and human face processing areas for dynamic stimuli. J. Neurosci. 33, 11768–11773 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Issa E. B., DiCarlo J. J., Precedence of the eye region in neural processing of faces. J. Neurosci. 32, 16666–16682 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Carlin J. D., Calder A. J., The neural basis of eye gaze processing. Curr. Opin. Neurobiol. 23, 450–455 (2013). [DOI] [PubMed] [Google Scholar]
- 34.Perrett D. I., Hietanen J. K., Oram M. W., Benson P. J., Organization and functions of cells responsive to faces in the temporal cortex. Philos. Trans. R. Soc. Lond. B Biol. Sci. 335, 23–30 (1992). [DOI] [PubMed] [Google Scholar]
- 35.Chen J., Wu J., Richter K., Konrad J., Ishwar P., “Estimating head pose orientation using extremely low resolution images” in 2016 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI) (IEEE, Santa Fe, NM, 2016), pp. 65–68. [Google Scholar]
- 36.Todorov A., Olivola C. Y., Dotsch R., Mende-Siedlecki P., Social attributions from faces: Determinants, consequences, accuracy, and functional significance. Annu. Rev. Psychol. 66, 519–545 (2015). [DOI] [PubMed] [Google Scholar]
- 37.Calder A. J., Young A. W., Understanding the recognition of facial identity and facial expression. Nat. Rev. Neurosci. 6, 641–651 (2005). [DOI] [PubMed] [Google Scholar]
- 38.Bruce C. J., Desimone R., Gross C. G., Both striate cortex and superior colliculus contribute to visual properties of neurons in superior temporal polysensory area of macaque monkey. J. Neurophysiol. 55, 1057–1075 (1986). [DOI] [PubMed] [Google Scholar]
- 39.Gross C. G., Contribution of striate cortex and the superior colliculus to visual function in area MT, the superior temporal polysensory area and the inferior temporal cortex. Neuropsychologia 29, 497–515 (1991). [DOI] [PubMed] [Google Scholar]
- 40.Pitcher D., Ungerleider L. G., Evidence for a third visual pathway specialized for social perception. Trends Cogn. Sci. 25, 100–110 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Maurer D., Grand R. L., Mondloch C. J., The many faces of configural processing. Trends Cogn. Sci. 6, 255–260 (2002). [DOI] [PubMed] [Google Scholar]
- 42.Jacob G., Pramod R. T., Katti H., Arun S. P., Qualitative similarities and differences in visual object representations between brains and deep networks. Nat. Commun. 12, 11872 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.National Research Council , Guide for the Care and Use of Laboratory Animals (National Academies Press, Washington, DC, ed. 8, 2011). [Google Scholar]
- 44.Yang Z., Freiwald W. A.: Joint encoding of facial identity, orientation, gaze, and expression in the middle dorsal face area.Figshare. Dataset. 10.6084/m9.figshare.15087096. Deposited 31 July 2021. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Cell response data have been deposited in Figshare (https://doi.org/10.6084/m9.figshare.15087096) (44). All other study data are included in the article and/or SI Appendix.








