Highlights
-
•
Dynamic actions are represented in posterior and middle STS regions.
-
•
Categorization learning modulates action representations in STS body patches.
-
•
Dynamic body information and learning related effects were also present in face patch ML.
-
•
Our findings highlight the role of macaque STS in action recognition.
Keywords: Action observation, Category learning, Rhesus macaque, fMRI, MVPA
Abstract
Neuroimaging and single cell recordings have demonstrated the presence of STS body category-selective regions (body patches) containing neurons responding to presentation of static bodies and body parts. To date, it remains unclear if these body patches and additional STS regions respond during observation of different categories of dynamic actions and to what extent categorization learning influences representations of observed actions in the STS. In the present study, we trained monkeys to discriminate videos depicting three different actions categories (grasping, touching and reaching) with a forced-choice action categorization task. Before and after categorization training, we performed fMRI recordings while monkeys passively observed the same action videos. At the behavioral level, after categorization training, monkeys generalized to untrained action exemplars, in particular for grasping actions. Before training, uni- and/or multivariate fMRI analyses suggest a broad representation of dynamic action categories in particular in posterior and middle STS. Univariate analysis further suggested action category specific training effects in middle and anterior body patches, face patch ML and posterior STS region MT and FST. Overall, our fMRI experiments suggest a widespread representation of observed dynamic bodily actions in the STS that can be modulated by visual learning, supporting its proposed role in action recognition.
1. Introduction
Recognition of conspecifics’ actions is a crucial aspect of social cognition and behavior in human and non-human primates. While neurons tuned to observed actions are found in a wide range of brain regions, including parieto-frontal pathways devoted to planning and executing one's own actions (Gallese et al., 1996; Nelissen et al., 2005; Lanzilotto et al., 2019), a large number of electrophysiological and neuroimaging investigations have shown that macaque STS in particular contains several regions devoted to the visual analysis of observed bodies and bodily actions. Electrophysiological investigations in monkeys demonstrated the existence of STS neurons tuned to different types of body movements involving the torso, leg, arm, hand or fingers (Bruce et al., 1981; Perrett et al., 1985; Jellema and Perrett, 2006; Barraclough et al., 2006). In addition to these intransitive body movements, populations of STS cells were also found to respond selectively to goal-directed or manipulative hand actions (Perrett et al., 1989; Barraclough et al., 2009). Later fMRI studies confirmed that observed hand actions yield responses in different portions of lower, upper bank and fundus of macaque STS (Nelissen et al., 2006; 2011; Sliwa and Freiwald, 2017; Sharma et al., 2018; Fiave et al., 2018).
Besides responses to actions, macaque STS also houses several regions showing category-selective responses to presentation of stationary bodies or body parts. Using fMRI and a range of stimuli and contrasts, several groups have now demonstrated the existence of body selective regions or so-called body patches in monkey STS (Tsao et al., 2003; Bell et al., 2009; Pinsk et al., 2005; Popivanov et al., 2012; Fischer and Freiwald, 2015). The most consistently observed body patches across these monkey fMRI studies are the middle STS body patch (MSB), located near the ML face patch and the anterior STS body patch (ASB), located close to the AL face patch (Tsao et al., 2003; Fischer and Freiwald, 2015). Recent electrophysiological investigations have confirmed the existence of body or body part selective neurons in these fMRI defined body patches (Popivanov et al., 2014, 2016; Kumar et al., 2017; Bao and Tsao, 2018) and suggest these neurons provide information with respect to body shape, posture, orientation and identity (for review Vogels, 2022). Overall, aforementioned studies provide evidence for a key role of macaque STS in computing observed bodies and bodily actions, relevant for social interactions with others as well as survival in a competitive environment (Pitcher and Ungerleider, 2021).
While monkey STS body patches have been mostly studied using static stimuli, only few studies examined body patch responses to dynamic actions (Jastorff et al., 2012; Sliwa and Freiwald, 2017) and it remains unclear to what extent distinct dynamic action categories yield fMRI responses throughout different portions of the STS. Moreover, while several groups have examined to what extent motor expertise influences observed action representations (Calvo-Merino et al., 2004; Balser et al., 2014), it is not known whether visual learning alters action representations in macaque STS regions devoted to the visual analysis of bodies and bodily actions.
In the present monkey fMRI study, we sought to examine to what extent action categories are represented in the STS body patches and adjacent STS regions and whether categorization learning changes these representations. Macaque monkeys were initially scanned (pre-categorization training) while observing videos depicting an animated monkey model performing six different actions (Fig. 1B). These videos varied across two stimulus dimensions, action category and effector. With respect to action category, the videos depicted one of three different goals or end-states: transitive grasp or touch actions, or intransitive reach actions (Fig. 1B). For each of these three action categories, we created two versions that involved a different effector performing the action (either hand/arm or tail). This dimension ‘effector’ served as a control to examine the specificity of the discrimination training, since categorization training involved discriminating the stimuli across action categories, with effector being irrelevant. To examine monkeys’ action categorization abilities and the effect of this visual discrimination learning on action representations in the brain, monkeys were trained on a three-alternative forced-choice categorization task requiring them to discriminate between grasp, touch and reach actions. Initial categorization training involved a small set of videos depicting human actors performing similar hand grasp, hand touch or hand reach actions as depicted in the monkey action videos. After categorization training reached above 80% correct trials, we tested the extent to which monkeys generalized to different exemplars of human action videos, as well as to the monkey action videos used in the fMRI action observation experiments. Since after learning to categorize human action videos, monkeys did not generalize completely to the monkey action videos, monkeys were additionally trained to categorize the six monkey action videos until performance reached above 80% correct. After this categorization training, monkeys were rescanned while observing the same monkey action videos (post-categorization training action observation experiment), similar as done before categorization training. Univariate and multivariate fMRI analyses were employed on the pre- and post-categorization training fMRI action observation datasets to examine to what extent action categories (grasp, touch and reach) were represented in the STS and if experience due to categorization learning leads to changes in these action representations (Jastorff et al., 2009; Apšvalka et al., 2018).
Fig. 1.
Experimental setup and paradigm. A. Monkey fMRI experimental setup. Monkey subjects sat in a sphinx position in a plastic custom-made monkey chair (I) facing a MR-compatible translucent (fMRI scanner setup) or LCD (training setup) screen (II) while they performed either action observation or categorization tasks. Eye movements were monitored using an MR-compatible eye camera (III). During the action observation task, monkey subjects were rewarded for fixating a red fixation dot in the center of the screen while different action videos and controls were shown. B. Key frames (cropped here for display purpose) from the different action videos and static control conditions used in the action observation task. Actions depicted either a hand grasp, hand touch, hand reach, tail grasp, tail touch, or tail reach. A static frame of the hand and tail action videos was used as control conditions for hand and tail actions, respectively. C. Example trial of the forced-choice three-alternative categorization task. A trial started with fixation of a centrally positioned red fixation target for 500 ms, followed by fixation of a 3 s video (with the red fixation target superimposed at the level of the object) displaying a human actor performing a hand grasp, hand touch or hand reach action. Afterwards, red fixation dot and video disappeared, while three green targets were presented and monkeys were required to make saccades to the correct target (indicated here for illustration purposes with a white arrow, this arrow was not shown during the actual action categorization task) in order to receive a liquid reward. D. Key frames from the grasping action videos used in the categorization task training: ① to ④ side view of a female and male actor grasping a white or blue ball from right side of the object with their left hand. For display purpose here, the red fixation target that was superimposed on the object during the categorization trials is not shown. E. Key frames from the grasping action videos used during the different generalization tests. The same female actor (as in ① and ②) either grasped a similar sized green ball ⑤, a small sized white ball ⑥, or a white ring ⑦ with her left hand. In addition, the same female actor grasped the white ball with her right hand while seated on the right side of the object ⑧. We also tested generalization to a mirrored video of training video ①, hence showing the same female actor grasping the white ball with her right hand while seated at the left side of the object ⑨. Finally, we tested if monkeys generalized to other female ⑩ or male ⑪ actors, or to the hand grasp ⑫ and tail ⑬ grasp videos depicting the 3D monkey model, the latter which were used in the action observation task. For each of the grasping action videos, there was a corresponding touch and reach action video (not shown). See suppl. Video 1–2 for example of the three type of actions. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
2. Material and methods
2.1. Subjects
Two male (M1, M2) rhesus monkeys (Macaca mulatta, 5–7 kg, 5–7 years old) participated in the present study. All experimental procedures and animal care followed the national and European guidelines and were approved by the animal ethical committee of the KU Leuven.
2.2. Action observation tasks
2.2.1. Action observation (fixation) task training
Monkey subjects were trained to sit in a sphinx position in a custom-made MR-compatible chair (Fig. 1A, I). Monkeys directly faced a liquid crystal display (LCD) screen (Fig. 1A, II) that was positioned at 57 cm from their eyes. Monkeys were required to maintain fixation within a 2 × 2° window centered on a red dot (0.2 × 0.2°) in the middle of the screen. Eye position was monitored at 120 Hz through pupil position and corneal reflection (ISCAN Inc., Fig. 1A, III). Rewards (juice drops) were delivered continuously if monkeys maintained fixation on the red dot (Fig. 1A, II).
2.2.2. Stimuli
We made videos of a 3D animated monkey model (Fig. 1B) using open-source Blender software (https://www.blender.org/), depicting three types of actions (grasp, touch, or reach) with two effectors (hand or tail). The ‘grasp’ action involves the monkey model reaching and grasping a centrally positioned object (white ball), either with its left hand or tail. The ‘touch’ action shows the same monkey model reaching and touching the bottom part of the object with its hand or tail. The ‘reach’ action displays the monkey model performing a similar reaching movement towards the object, as in the touch action, but without touching the object. Static images of the monkey model in standing position, either facing to or facing away from the object were used as control conditions for the hand and tail actions, respectively. Supplementary video 1 shows all six hand and tail actions (the text displayed in the upper left corner is only for illustration purposes and was not shown during the task). The white object was positioned at the center of the screen with the animated monkey model positioned on the right half of the screen. Stimulus dimensions were 20 × 15° Each video lasted 6 s.
2.2.3. Experimental design
We performed two separate fMRI scan sessions for the action observation task before (pre-training scan) and after (post-training scan) monkeys completed categorization training and generalization testing of the action categorization task (see below). Note that before the post-training scan, we particularly let monkeys practice categorization with the 3D monkey videos until their performance for all the action categories reached 80% or above (Fig. 2E). The timeline of the experimental procedure is shown in Fig. 2A.
Fig. 2.
Experimental timeline and behavioral results. A. Experimental timeline. The experiment involved pre- and post-categorization training scan sessions while monkeys performed the action observation task with the animated monkey action videos. Between these two scan sessions, monkeys were trained on the action categorization task including three stages: 1) initial learning of the three-way categorization task; 2) generalization tests; 3) categorization training with the monkey videos. B. Performance progress of both monkeys at the initial categorization learning stage. Black arrows with numbers on top indicate the introduction of new sets of videos (each set include a grasp, touch and reach action video). These numbers correspond to the respective videos shown in Fig. 1D. C. Behavioral results for generalization tests with videos depicting human actors for monkey M1 and monkey M2. The diagonal of a color coded confusion matrix represents categorization accuracies for corresponding conditions, and the rest indicates error rate of incorrect selections of the other two targets. ‘Left’, ‘up’, and ‘right’ indicate the location of corresponding target of action categories in the categorization task. D. Behavioral results for generalization tests with animated monkey action videos. E. Behavioral results for categorization of monkey videos following additional categorization training but before the post-training action observation scan. Performance for hand and tail actions are averaged. M1 – monkey M1, M2 – monkey M2. Asterisks in C, D and E indicate significant generalization (p<0.05, binomial test).
The fMRI recordings were performed using a block design with alternating blocks of totally 9 conditions: fixation only, observing hand-grasp, tail-grasp, hand-touch, tail-touch, hand-reach, tail-reach, hand-static, tail-static. We generated ten pseudo-random orders of the alternative blocks of all conditions, and each was used and repeated once during one run. Each block consisted of 15 volumes (30 s). A complete run totalled 9 min and 10 s, during which 275 whole-brain volumes were acquired with 5 start volumes at the beginning. We applied the same design for both pre- and post-training scans. In the pre-training scan, for monkeys M1 and M2, respectively, 39 and 36 runs were collected, and 5 runs and 1 run were discarded from the respective subject due to bad performance (<90%) on the fixation task. In post-training scan, 40 and 39 runs were acquired while 6 and 4 runs were excluded because of bad fixation performance from M1 and M2, respectively.
2.3. Action categorization tasks
2.3.1. Categorization task and training
We trained monkeys with a three-alternative forced-choice action categorization task. Monkeys had to categorize three types of action videos (grasp, touch, and reach) by making saccades towards corresponding targets (Fig. 1C). Each categorization trial started with fixation of a centrally positioned target for 500 ms, upon which a video depicting one of the 3 actions was displayed for 3 s. During video display, monkeys had to hold fixation until the video disappeared. Afterwards, three targets (0.8 × 0.8°, green) showed up replacing the central fixation target and the video. Monkeys were required to make a saccade towards one of the three targets in order to receive a drop of juice reward. The next trial started automatically once monkeys made a saccade to one of the targets or if no response was made after 3 s. Trials with no response or trials that were aborted due to failure to keep fixation until the end of the video presentation were excluded from the data analysis. The three saccade targets were located at 10.5° to the left, up, and right of the center of the screen. For monkey M1 grasp, touch, and reach actions were associated with respectively the left, upper and right target. For monkey M2, the left, upper, and right targets were associated with respectively the grasp, reach, and touch actions.
The training of categorization task involved three steps. In initial training trials, only the correct target that corresponded to the specific action category was presented after video presentation. This way monkeys learned to associate each action category (grasp, touch, or reach) with its corresponding target. Next, monkeys were trained to categorize the three actions by selecting the correct target when all three targets were presented after the video. For initial categorization training, we used 3 action videos depicting a female actor performing a grasp, touch or reach action towards a white ball (Fig. 1D, ①). Later, 3 additional video sets (Fig. 1D, ②-④) including either a different object color, or a different (male) actor were introduced gradually. Thus, in total 12 videos were used for the entire action categorization training (3 action categories, 2 actors, 2 objects). For each categorization trial, videos were randomly picked from the set of 12 training videos, and the same video was never repeated more than three times consecutively. In addition, a response-bias-correction procedure (Vangeneugden et al., 2010) was applied. If the monkey failed to categorize a video correctly, a trial of the same video would be repeated until the categorization was correct. Both monkeys performed around 400–800 trials per category for each training session. Once the performance of both monkeys kept above 80% of accuracy (per action category) for a few consecutive training sessions (Fig. 2B), we proceeded with generalization testing.
2.3.2. Generalization tests
We next tested if the monkeys were able to generalize the learned categorization rule to untrained novel videos of the three action categories (grasp, touch, or reach) including the 3D animated monkey videos used in the action observation task.
Each generalization testing session contained ∼90% trials depicting familiar, trained videos, and the rest of the trials (∼10%) consisted of novel untrained videos. During generalization testing, monkeys were rewarded no matter which target (left, up or right) was chosen by the monkey during categorization of the untrained videos, in order to avoid introducing learning effects. Response-bias-correction was not used during the generalization tests. In total, generalization tests used nine different sessions with nine sets of action videos (Fig. 1E) including each time one set of 3 novel videos (one for each action category). We performed 6–10 testing runs (15–20mins/run) for every testing session. Each testing run contained 50 and 5 trials per category of trained and untrained videos, respectively. Thus, in total, monkeys were tested with 900–1500 trials with 90–150 novel untrained videos per testing session.
2.3.3. Stimuli
Besides the animated 3D monkey videos (see Action observation tasks), we recorded the three types of actions (grasp, touch, and reach) performed by different human actors and including objects with various colors and shapes. Example key frames of grasping actions are shown in Fig. 1D and E for the training session and generalization tests, respectively. In total videos included four different actors (2 male and 2 female) performing the actions (with their left hand) towards different objects. These objects either consisted of a medium sized ball (three different colors), a small ball, or a ring. In addition, one female actor also performed the actions with her other (right) hand (Fig. 1E ⑧). Finally, we also used a flipped version of video sets ①, depicting the female actor performing the actions from the left side of the object (flipped view, Fig. 1E ⑨). Supplementary video 2 shows all three actions performed by an actor with the medium sized white ball (the text on the upper-left corner is only for illustration purpose and was not physically shown during the task). The videos (13.9 × 8.4°) were presented in the center of the screen with the fixation dot overlayed on the object. For the generalization test, the action videos depicting the 3D animated monkey model were resized and mean luminance adjusted to match the human action videos used in the categorization training and generalization testing. Suppl. Fig. 1 shows a key frame of all action videos depicting human actors used during the initial categorization training and during the different generalization tests.
2.4. Localiser of body and face patches
To investigate in detail the location of action observation responses in the STS body and face patches, we performed localiser scans in both monkeys that allowed to functionally localize the STS body patches (BPs) and face patches (FPs). We followed the same procedure as reported by Popivanov et al. (2012) with three sets of their stimuli - monkey bodies, monkey faces, and objects matched in aspect ratio to the monkey bodies. Consistent with their results and other findings (Taubert et al., 2015; Tsao et al., 2008), our localizer allowed to functionally delineate MSB and ASB (anterior and posterior) body patches, as well as middle (ML) and anterior (AL) face patches. The functional ROIs were extracted for both hemispheres in individual monkey brains using threshold of p < 0.05 corrected for multiple comparisons (family wise error, FWE). The contrasts we used to define the ROIs were: for BPs, monkey bodies vs. (monkey faces + monkey objects); for FPs, monkey faces vs. (monkey bodies + monkey objects) (Popivanov et al., 2012).
2.5. fMRI data acquisition and data analysis
2.5.1. fMRI data acquisition
FMRI data were acquired with a 3 Tesla full body scanner (Siemens, Prisma fit) using a gradient-echo T2∗ -weighted echo-planar imaging sequence of 40 horizontal slices (time repetition [TR], 2 s; time echo [TE], 17 ms; 1.25 × 1.25 × 1.25 mm3 isotropic voxels) with a custom-built 8-channel phased-array receive coil, and a saddle-shaped, radial transmit-only surface coil. Before each scanning session, an iron contrast agent (Molday ION, BioPAL) was injected intravenously (9–12 mg/kg) to enhance spatial selectivity of the MR signal changes and accordingly improve signal-to-noise ratio (Vanduffel et al., 2001). We have inverted the sign of all beta values to account for the difference between iron contrast agent CBV and BOLD activation maps (i.e. increased brain activation produces a decrease in MR signal in iron contrast agent CBV maps).
2.5.2. fMRI data pre-processing and GLM fitting
Pre-processing steps included firstly realignment using statistical parametric mapping (SPM12) software. The motion corrected images were next processed with non-rigid co-registration (using JIP, http://www.nmr.mgh.harvard.edu/∼jbm/jip/) to a template anatomy (M12, Ekstrom et al., 2008) for both individual and group data analysis. Functional volumes were then resliced to 1 mm3 isotropic and smoothed with a 1.5 mm (FWHM) Gaussian kernel using SPM12. A general linear model (GLM) was used for estimating the response amplitude at each voxel (using SPM12) following previously described procedures (Friston et al., 1995; Vanduffel et al., 2001). Stimulus conditions were presented as a boxcar model convolving with a MION hemodynamic response function (Vanduffel et al., 2001). All nine conditions (see above) were modelled as regressors of interest. To account for head-motion and eye-movement related artifacts, six regressors corresponding to three rotations and translations along x, y, and z axis and three regressors corresponding to horizontal and vertical components of eye position and pupil diameter were included as covariates of no interest.
2.5.3. Univariate whole-brain based analysis
We performed fixed-effect group analysis with combined data from both monkeys, for both the pre- and post-training datasets. The contrast of all actions versus static controls was calculated and the significance level was set at p < 0.001, uncorrected. For display purpose, SPM T-contrast maps of pre- and post-training scans were superimposed with each other and presented on a flattened 3D image of both template hemispheres (template M12) using Caret software (Fig. 3).
Fig. 3.
Whole-brain fMRI results for action observation. Univariate group (n = 2) brain activations (rendered on flattened M12 template brain) depicting main effect of monkey action observation versus static controls in left and right hemispheres, for pre- (orange) and post-training (turquoise) action observation fMRI scans. Yellow colors indicate overlap between pre- and post-training fMRI SPM T-maps. Maps are thresholded at p<0.001 (uncorrected). LuS – lunate sulcus, IOS – inferior occipital sulcus, IPS – intraparietal sulcus, STS – superior temporal sulcus, LS – lateral sulcus, CS – central sulcus, CiS – cingulate sulcus, PS – principal sulcus, ArS – arcuate sulcus. LH – left hemisphere, RH – right hemisphere. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
2.5.4. Definition of regions-of-interest (ROIs)
We defined several ROIs along the posterior to anterior extent of the entire STS. In the posterior portion of the STS, we examined a ROI corresponding to MT(V5) and adjacent FST. Delineation of these ROIs was based upon a previous macaque fMRI study examining motion and action-related responses in the STS (Nelissen et al., 2006). Based upon the same study of Nelissen et al. (2006), we delineated a ROI corresponding to the middle portion of STP in the upper bank of the STS, as well as the anterior portion of STP (termed UB1 in Nelissen et al., 2011). Using the body/face patch localizer (see above) by Popivanov et al. (2012), we functionally delineated the middle STS body patch MSB and the anterior STS body patch ASB by contrasting headless bodies vs objects and faces. Contrasting faces vs objects and bodies on the other hand allowed delineation of neighbouring face patches ML and AL in the lower bank of the STS. Finally, using anatomical landmarks, we defined a ROI located anterior to ASB on the lower bank and adjacent convexity of the inferotemporal cortex, corresponding to a portion of TEr. In order to examine in particular voxels yielding visual responses to our action videos, we selected those voxels from aforementioned STS ROIs that yielded responses for the main contrast of all actions vs static controls at the threshold of p < 0.05, uncorrected, from either the pre- or post-training fMRI datasets.
2.5.5. Univariate ROI-based analysis
Percentage signal change (PSC) per run within each ROI for all action conditions were calculated (using MarsBaR) against corresponding control conditions, i.e., hand actions vs hand static, tail actions vs. tail static. The PSCs were then used for further statistical analysis. For pre- and post-training data sets independently, PSCs of hand and tail actions were averaged per action category per ROI to show overall responses of each action category in the ROIs. Significantly stronger responses during action observation compared to static controls were calculated in ROIs using paired t-test (one-tail) across runs on the averaged PSCs for each action category vs static controls. P-values less than 0.05 were declared significant, and FDR correction was applied. To statistically investigate the categorization training effect on action observation, we performed three-factor ANOVA with PSCs per ROI including both pre- and post-training action observation data sets. The structure of the three-factor ANOVA was a 2 (training: pre-training and post-training) × 3 (action category: grasp, touch, reach) × 2 (effector type: hand and tail) repeated-measures ANOVA. We were particularly interested in the main effect of training and the interaction effect between ‘training’ and ‘action category’. The interaction effect between ‘training’ and ‘effector type’ was also examined as a control, since monkeys learned to categorize across action categories with effector type being irrelevant for the task.
2.5.6. Multivariate ROI-based decoding
We applied multi-voxel pattern analysis (MVPA) using a Matlab-based Decoding Toolbox (Hebart et al., 2015). With a similar procedure of GLM fitting using SPM12 (see above), T-contrast maps for single action categories per run were computed to serve as input to a linear support vector machine (SVM) classifier for the ROI based MVPA decoding. The contrasts for all actions were hand grasp vs hand static, hand touch vs hand static, hand reach vs hand static, tail grasp vs tail static, tail touch vs tail static, tail reach vs tail static. We then applied a leave-one-run-out cross validation scheme, where at each iteration the extracted t-value features from all‑but-one runs were used to train the classifier while those from the remaining one run were left out for testing until all runs have been tested once. The averaged classification accuracy of all iterations was used as the decoding result. To determine statistical significance of the ROI classification performance, we applied a permutation test where the conditions associated with each feature (or label) were randomly shuffled, and cross-validation classification scheme was applied for each iteration. This procedure was repeated 1000 times leading to 1000 classification performance values, thus producing a null distribution of classification performance. Based on the classification performance of the original classification relative to the null distribution, p-values for significance level were calculated. To statistically compare the decoding performance pre- vs post-training, we subtracted classification accuracies from the two data sets for both original classification (with label) and respective permutation analysis. Based on the subtractions (post-training - pre-training) of the original classifications relative to null distributions, p values were calculated to indicate similar or different classification performance between pre- and post-training data sets. ROIs with p-values less than 0.05 were declared significant, and FDR correction was applied.
We performed ROI-based MVPA for all pairs of the six action conditions, ‘between action category’, ‘grasp vs non-grasp actions’ and ‘within action category’ comparisons. For pair-wise binary decoding among all action conditions, in total 15 pairs of classification were applied (Fig. 5, e.g. hand grasp vs hand touch, tail touch vs tail reach, etc.). For multi-class ‘between action category’ decoding, t-value features were generated by grouping conditions based on action categories irrespective of effector types, i.e., grasp action = hand grasp + tail grasp, touch action = hand touch + tail touch, reach action = hand reach + tail reach. The three t-value features were used for both training and testing the classifier. Since learning-induced changes in multi-class action decoding were weak, if present, and behavioural testing showed that generalization effects were strongest for grasping actions (compared to touch and reach), we also examined if learning effects were more visible in changes in action representations for grasping vs the two other action classes. For this ‘grasp vs non-grasp actions’ decoding, t-value features for respectively grasp observation vs touch + reach observation were used for both training and testing the classifier. Since subjects learned to categorize actions irrespective of effector (hand or tail), we also examined to what extent categorization training might lead to changes in representations between effector types. Using multi-class ‘within action category’ decoding, we tested generalization across effectors within all three action categories (cross-effector decoding). For example, ROI t-maps of hand actions (grasp, touch, and reach) were used for training, while data from tail actions were used for testing the classifier. Classification performance from both directions (training with hand actions and testing with tail actions, and vice versa) were averaged.
Fig. 5.
Action decoding before and after categorization training in left hemisphere STS ROIs. Heatmaps showing single subjects (monkey M1 lower left, monkey M2 upper right) decoding accuracies for pair-wise decoding of the six actions (hand grasp - hG, tail grasp - tG, hand touch - hT, tail touch - tT, hand reach - hR, tail reach - tR) in the different STS ROIs. Per ROI, left and right heatmaps indicate decoding before and after categorization training. Posterior = posterior portion of STS, anterior = anterior portion of STS, lower = lower bank of STS, upper = upper bank of STS.
2.5.7. Analysis of behavioural data for action categorization tasks
To monitor monkeys’ learning progress during the categorization task, we only included completed trials and computed monkeys’ performance for each action category per training session. The accuracies for the training sessions as a function of training sessions was plotted for each monkey (Fig. 2B). To summarize monkeys’ performance during the categorization-generalization tests, we made confusion matrices (Fig. 2C-E) by computing the proportion of category choices for each presented category per test. Each confusion matrix shows the monkeys’ categorization performance for the novel videos during the different generalization tests. The diagonal of a confusion matrix represents categorization accuracies for the corresponding conditions, and the rest indicates error rate of incorrect selections towards the other two targets.
During the generalization tests, both monkeys showed a strong bias for the upper target for trials involving novel videos (a typical example: generalization test of ⑨ in M1, Fig. 2C). To demonstrate this bias, we calculated the selection ratio of the three targets from incorrect trials of the novel videos (Suppl. Fig. 2). We used binomial tests to quantify monkeys’ performance during the generalization tests. Binomial testing was performed on proportions of correct trials for each action category per generalization test against chance level (33.33%). The formula we used is: z = ( – ) / (= proportion correct trials, = chance level, n = total number of trials). P-values (one-tail) were correspondingly obtained from the z values. P-values less than 0.05 indicate significant higher performance than chance level.
2.5.8. Analysis of eye movement data from action observation fMRI experiments
To quantify the eye movement behavior of the monkey subjects, we calculated overall % fixation and average number of saccades per minute during observation of each of the six types of action videos, both for the pre- and post-training action observation fMRI runs. A one-way repeated measurements ANOVA performed on the eye data revealed no significant differences in fixation behavior either for the individual action conditions pre-training (% fixation: M1= 95.94%, p = 0.63; M2 = 95.42%, p = 0.79; # sacc/min: M1= 8.22, p = 0.6; M2 = 6.01, p = 0.66) or post-training (% fixation: M1= 95.68%, p = 0.095; M2 = 96.24%, p = 0.66; # sacc/min: M1= 8.92, p = 0.1; M2 = 6.02, p = 0.11). Comparing pre- and post-training eye data (2 × 3 × 2 factorial repeated measurement ANOVA, main effect of ‘training’) showed no significant difference in number of saccades per minute in both subjects (M1: 8.22 vs 8.92, p = 0.19; M2: 6.01 vs 6.02, p = 0.97) and slightly better % fixation post-training in M2 only (M1: 95.94% vs 95.68%, p = 0.46; M2: 95.42% vs 96.24%, p = 0.021).
3. Results
3.1. Action categorization training and generalization testing
Monkeys were trained to categorize videos (Fig. 1C) depicting human actors performing either a grasp, touch, or reach hand action (Fig. 1D). Training started with only one set of action videos (depicting a female actor performing a grasp, touch or reach action; Fig. 1D①). As monkeys’ performance progressed, three additional video sets (depicting either the same female actor with a novel object or a novel actor, Fig. 1D②-④) were introduced gradually. Fig. 2B shows the learning curves of both monkeys during initial task training. Black arrows and numbers in Fig. 2B indicate at which session the new videos were introduced respectively. We trained both monkeys with the four video sets (12 videos) which included two actors and two objects (Fig. 1D) for overall 39 and 34 sessions (respectively for monkey M1 and M2), until both monkeys reached a stable performance level of above 80% correct responses per action category (Fig. 2B). Afterwards, we applied generalization tests with untrained videos to evaluate if and to what extent monkeys could generalize to untrained and novel examples of the action categories. These generalization tests included videos depicting human actors (Fig. 2C), as well as the monkey 3D videos used in the action observation fMRI experiments (Fig. 2D). For the generalization tests involving videos of human actors, we tested effects of changing objects (different color or shape of object), actors (female and male), effector (right hand instead of left hand) and visual field/viewpoint (mirror image of trained videos). Both monkeys’ performance during these generalization tests are shown in Fig. 2C-D. Asterisks indicate significant generalization (binomial test, p<0.05). Both subjects generalized to novel examples (Fig. 2C). In both subjects, generalization was observed in most instances for novel examples of grasping actions and to a lesser extent for novel touch and reach stimuli involving human actors (Fig. 2C). Generalization testing involving the monkey animated action videos also showed monkeys could categorize in particular the monkey hand and tail grasping actions (Fig. 2D). During the generalization tests, both monkeys showed strong bias towards the upper target. Supplementary Fig. 2 shows that both monkeys selected the upper target in most cases of incorrect trials for all the tests (on average: M1: 3.3% target left, 84.21% target up, 12.49% target right; M2: 4.13% target left, 83.11% target up, 12.75% target right). Due to this upper target bias, except for generalization tests for which monkeys categorized the three action categories correctly, it is difficult to conclude to what extent the monkeys could discriminate between the visually similar touch and reach actions (Fig. 2C-D). After generalization testing, additional categorization training sessions with all six monkey action videos were done until both monkeys could categorize correctly all monkey action videos (Fig. 2E).
3.2. Whole brain fMRI responses to observed actions before and after categorization training
We first employed univariate analyses to examine to what extent observation of grasp, touch and reach actions yielded responses throughout the monkey brain, both before and after categorization training. Due to the visual field asymmetry in the presentation of the action stimuli (Fig. 1A), fMRI activity was biased to the contralateral (left) hemisphere (Fig. 3). In line with previous monkey fMRI action observation studies (Nelissen et al., 2005, 2011; Sliwa and Freiwald, 2017; Fiave et al., 2018; Sharma et al., 2018, 2019; Cui and Nelissen, 2021; Fiave and Nelissen, 2021), action observation (main effect of hand and tail actions compared to static controls, fixed-effects, n = 2) yielded responses in a typical broad network including early visual, STS, parietal, somatosensory, premotor and prefrontal regions (Fig. 3). In terms of spatial extent, overall brain responses during action observation before (Fig. 3, orange) and after (Fig. 3, turquoise) categorization training were to a large extent overlapping (Fig. 3, yellow).
3.3. Action representations in STS regions-of-interest before and after categorization training
In order to examine action representations and potential training-induced changes throughout the STS more in detail, we examined fMRI brain activity in different STS ROIs, including posterior STS motion sensitive regions (MT and FST), body patches (MSB and ASB) and adjacent face patches (ML and AL), upper bank STS regions STPm and STPa, and anterior TE (TEr) located in the lower bank and adjacent convexity anterior to ASB (see methods; Fig. 4A). Before categorization training, observation of grasp, touch or reach actions yielded significant responses in most left hemisphere STS ROIs. In general, grasping actions yielded strongest responses, followed by touch and reach actions in most ROIs (Fig. 4B). In posterior STS, MT and FST yielded significant responses to all three action categories (Fig. 4B). Also, both body patches MSB and ASB showed robust responses to all action categories, while weaker action observation related responses were observed in neighbouring face patches ML and AL. In the upper bank, in particular the middle portion of STP (STPm) yielded strong responses to the three hand actions, with more modest responses for STPa. Anterior TE (TEr) showed overall modest (monkey M1) to weak (monkey M2) responses to the action videos compared to the other STS regions (Fig. 4B).
Fig. 4.
ROI-based univariate results in left hemisphere STS regions. A. Illustration of locations of our predefined ROIs in/near STS on the inflated/flattened M12 left hemisphere template brain. These ROIs included MT and FST in the posterior STS, body patches MSB and ASB, face patches ML and AL, STPm and STPa in the upper bank of the STS, and TEr in the anterior portion of the lower bank and adjacent convexity. B. Histogram plots showing % MR signal change in the different left hemisphere STS ROIs for observing grasp, touch and reach actions (hand and tail combined) versus static controls, from pre-training (orange colors) and post-training (turquoise colors) action observation data. M1 – monkey M1, M2 – monkey M2. Asterisks indicate significant stronger responses to dynamic action observation compared to static controls (one-tail paired t-test, FDR corrected). C. Main effect of training from three-factor ANOVA (see methods). D. Interaction effect between ‘training’ and ‘action category’ from three-factor ANOVA (see methods). Asterisks indicate significance level p<0.05 (FDR corrected). Yellow shaded regions indicate ROIs that showed significant effects in both subjects. E. Interaction effect between ‘training’ and ‘effector type’ from three-factor ANOVA. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Action observation responses were in general larger post-training compared to pre-training in both monkey subjects (Fig. 4B), as evident by a main effect of training that was found in both animals in posterior STS ROIs MT and FST, STPm in the middle portion of the upper bank, middle STS face patch ML and both middle (MSB) and anterior (ASB) body patches (Fig. 4C; Suppl. Table 1). In both monkey subjects, a significant interaction effect between ‘training’ and ‘action category’ (Fig. 4D, Suppl. Table 1) was observed in MT, FST, body patches MSB and ASB, as well as face patch ML. As expected, due to the asymmetry in stimulus presentation, effects were more modest in the right hemisphere (Suppl. Fig. 3A, Suppl. Table 1), yielding a consistent main effect of training in both subjects in FST and the two body patches MSB and ASB (Suppl. Fig. 3B, Suppl. Table 1). Consistent action category × training interaction effects were on the other hand observed in both subjects in right hemisphere MT, FST, face patch ML and body patch ASB (Suppl. Fig. 3C, Suppl. Table 1). Importantly, none of the left or right hemisphere ROIs showed an effector × training interaction effect (Fig. 4E; Suppl. Fig. 3D; Suppl. Table 1), indicating interaction effects were limited to the trained stimulus dimension (action category).
In addition to the univariate analyses, we employed MVPA to examine to what extent the three action categories could be decoded from the different STS ROIs and to assess if categorization training leads to significant changes in these decoding results. First of all, we employed binary decoding analyses for all pairs of the six action conditions to examine to what extent different actions could be decoded from each other. Fig. 5 shows that before categorization training, most pairs of action conditions could be decoded well above chance in particular from lower bank/fundus of posterior and middle STS, including MT, FST, body patch MSB and face patch ML (Fig. 5, heatmaps on the leftside per ROI). Pair-wise decoding was much weaker in upper bank regions STPm and STPa, as well as in the anterior lower bank regions, including body patch ASB, face patch AL and TEr. Post-categorization training, pair-wise decoding of the action conditions (Fig. 5, heatmaps on the rightside per ROI) did not reveal consistent changes in both subjects in any of the STS ROIs. In line with the univariate results, pair-wise decoding of all six action conditions for the right hemisphere ROIs (Suppl. Fig. 4) was in general weaker compared to the corresponding left hemisphere ROIs.
To examine more in detail the possible changes in action representations due to categorization training, we performed additional specific decoding analyses. First, we examined multi-class decoding testing distinctness of representations of the three action categories (grasping, touching and reaching) and possible changes in these representations due to training (between-category separation). Multi-class decoding suggested that before categorization training, action categories could be decoded above chance from left hemisphere MT, FST, face patches ML and AL, body patches MSB and ASB, in addition to STPa in both subjects (Suppl. Table 2, ‘Between action category’). Categorization training did not result in significant changes in distinctness of multi-voxel action representations in these left hemisphere ROIs (Suppl. Table 2, ‘Between action category’). In the right hemisphere, action categories could be decoded above chance pre-training in both subjects in FST, face patches ML and AL and body patch MSB (Suppl. Table 2, ‘Between action category’). Although after categorization training, multi-class decoding in right hemisphere FST, face patch ML and body patches MSB and ASB showed increased classification accuracies compared to pre-training in both subjects, these training effects were modest and in general did not reach significance when comparing directly pre- vs post-training accuracies (Suppl. Table 2; ‘Between action category’). Since multi-class decoding between action categories did not suggest clear training related changes in action representations and given that generalization training showed most robust effects for grasping actions in particular, we also examined binary decoding of grasping actions vs touch + reach actions before and after training (Suppl. Table 2; ‘Grasp vs non-grasp actions’). Grasping actions could be discriminated from the other two in most STS ROIs both before and after training. Although STS body patches showed increased decoding accuracies with training in most instances (7 out of 8 cases), this effect was only significant in right hemisphere ASB and STP ROIs in M2 (Suppl. Table 2; ‘Grasp vs non-grasp actions’), similar to the training effects found in multi-class between action category decoding (Suppl. Table 2, ‘Between action category’).
We finally performed cross-exemplar MVPA to scout for evidence of within-category compression after training by examining to what extent action categories could be decoded across exemplars (train on hand actions, test on tail actions, and vice-versa) before and after training. In the left hemisphere significant within action category cross-decoding was possible from the pre-training data set in both subjects from MT and FST, face patch ML, body patch MSB, TEr and STPm (Suppl. Table 2, ‘Within action category’). Post-training cross-exemplar decoding between categories yielded significant decoding accuracies in both subjects in MT, FST, face patches ML and AL and middle (MSB) and anterior (ASB) body patches. However direct comparison between pre- and post-training decoding accuracies did not reveal consistent significant increases in cross-exemplar decoding in left or right hemisphere (Suppl. Table 2, ‘Within action category’).
4. Discussion
4.1. Representation of bodies and actions in macaque STS
Some of the first evidence suggesting the role of the STS in representing observed body parts and actions was already obtained more than four decades ago. Since this early demonstration of STS neurons showing responses to a walking person (Bruce et al., 1981), subsequent single cell studies showed that the macaque STS contained neurons in both banks and fundus responding to a range of stimuli involving bodies or body parts, either presented statically or in motion (Perrett et al., 1989; Oram and Perrett, 1996; Barraclough et al., 2006; Jellema and Perrett, 2006; Vangeneugden et al., 2009; Singer and Sheinberg, 2010).
In line with these previous reports, our findings suggest observation of dynamic actions yield responses throughout a large portion of the lower bank and fundus, including posterior MT and FST, in addition to middle and anterior body patches. In the upper bank on the other hand, the middle portion of STP yielded stronger fMRI activity compared to more anterior parts. Also, MVPA analyses showed different observed dynamic actions could be decoded above chance in particular from the posterior/middle STS ROIs in the lower bank and fundus. While cell recordings suggest many STS neurons respond to mere presentation of moving bodies or body parts, STS also contains neurons showing sensitivity for body part-object interactions. For instance, neural responses in STS to manipulative hand actions involving reach and grasp or touch were shown first by single cell recordings (Perrett et al., 1989) and later also demonstrated using monkey fMRI (Nelissen et al., 2006, 2011; Sharma et al., 2018; Sliwa and Freiwald, 2017; Fiave and Nelissen, 2021). Most of the STS ROIs we examined showed overall stronger responses to observation of grasping actions (vs static controls) compared to the two other hand actions (Fig. 4; Suppl. Fig. 3). These findings fit with previous electrophysiology findings from several other parts of the action observation network including parietal and premotor cortex, suggesting that within the population of neurons responding to observed manipulative hand actions, grasping seems overrepresented (Lanzilotto et al., 2019; Gallese et al., 1996).
Functional MRI investigations in awake macaques also showed that, besides the presence of face patches, STS also houses body category-selective patches (Tsao et al., 2003). Since this initial demonstration, several groups have shown the presence of these STS body patches, using a range of different stimuli and contrasts (for extensive review see Vogels, 2022). While the exact number of macaque STS body patches is currently not clear, it seems that in particular the middle STS body patch (MSB), neighbouring face patch ML, and the anterior body patch (ASB) found close to the anterior face patch AL, are the two most consistently observed body patches across studies (Vogels, 2022; Popivanov et al., 2012; Fisher and Freiwald, 2015).
Overall, our results confirm the key role of middle and anterior STS ROIs in processing actions. While area MT represents the first step of motion processing within macaque STS, visual information is sent from area MT to multiple higher order motion sensitive STS regions, giving rise to at least two distinct motion pathways. While MST areas in the posterior portion of the STS receive input from MT and are involved in control of locomotion/heading and pursuit of small objects (Saito et al., 1986; Komatsu and Wurtz, 1988), area MT also projects to more anterior and ventral regions, including FST, STP and more anterior lower bank regions, dedicated to the visual analysis of actions (Nelissen et al., 2006; 2011; Pitcher and Ungerleider, 2021). Although area FST in the fundus of the STS has not been fully examined for the presence of body or action selective responses, FST neurons’ sensitivity to three-dimensional shape from motion and actions suggests a role in processing of biological motion (Mysore et al., 2010; Vanrie and Verfaillie, 2006). Previous single recordings have suggested area STP, which occupies the middle and anterior portion of the upper bank of the STS, plays a role in processing actions and biological motion (for review see Puce and Perrett, 2003; Oram and Perrett, 1994), yet it is not clear from these earlier single cell examinations to what extent action responses are more prevalent in specific parts or otherwise spread out across the entire length of area STP. As for motion-sensitive responses that seem to be most evident in the middle portion of STP (Nelissen et al., 2006), visual responses to conspecifics’ actions also seem to elicit strongest responses in this middle portion of STP, in line with previous macaque fMRI studies (Nelissen et al., 2006; 2011).
So far, most studies examining responses in body patches used static stimuli and only few studies have examined to what extent monkey STS body patches respond to dynamic bodies or actions. Jastorff et al. (2012) found that MSB and ASB respond stronger to biological motion point-lights compared to scrambled or translation control stimuli. Although it is currently not clear to what extent motion cues contribute to responses of neurons in body patches, our finding showing that at the voxel level both STS body patches respond stronger to dynamic actions of a conspecific as compared to a static image of this conspecific, might reflect the fact that action videos drive many subpopulations of body patch neurons each tuned to specific silhouettes of a body (Popivanov et al., 2015). Only a few electrophysiology studies have been conducted to examine the similarities and differences between MSB and ASB body patch cells with respect to coding bodies and body parts (for review see Vogels, 2022), and the detailed computations performed by both regions and their contribution to body and action recognition is still not clear. Single cell investigations suggest that while both MSB and ASB neurons are selective to body postures and their identity, body posture and identity can be better decoded from a population of ASB neurons than from MSB neurons (Kumar et al., 2017). Our findings suggest that at the population level, also different dynamic action categories can be decoded from both body patches, with seemingly stronger and more elaborated pairwise action decoding in the middle compared to the anterior body patch (Fig. 5). Since making claims about underlying single cell properties from fMRI voxel-based decoding results is not always straightforward (Dubois et al., 2015), future invasive recordings could shed further light on to what extent body patches represent different action categories at the single cell level and if these representations differ between middle and anterior STS body patches. The finding that face patches ML and AL also showed modest stronger responses to bodily actions compared to a static body is in line with previous findings suggesting heterogeneous responses in category-selective fMRI patches (Popivanov et al., 2014; Bell et al., 2011; Meyers et al., 2015) and fits with the finding that body information is not processed completely independent from and is integrated in the face patch system (Fischer and Freiwald, 2015). Although our current dynamic and static stimuli always contained the same (static) face presented parafoveally, it is possible that action responses in the ML and AL face patches were driven partly by the fact that dynamic bodies performing actions (compared to a static image) might provide stronger contextual cues indicative of the presence of a face, given previous findings suggesting context can cause face patch activity modulations even in the absence of a face stimulus (Cox et al., 2004; Arcaro et al., 2020).
4.2. Action categorization training and generalization
In line with previous results showing monkeys can learn to discriminate grasping from non-grasping hand actions (Nelissen and Vanduffel, 2017) and left vs right or forward vs backward walking actions (Vangeneugden et al., 2010), with a certain degree of generalization to untrained exemplars, our current findings show monkeys are able to discriminate between grasp, touch and reach actions after training. Although we only trained with a small stimulus set for a limited period of time, both monkeys became proficient in the three-way discrimination task after 30 to 40 training sessions. While the grasp category seemed the most straightforward to learn, monkeys had more problems discriminating the subtle difference between an effector touching an object versus merely reaching towards this object. It is difficult to conclude to what extent monkeys relied on local spatial or motion cues to solve the categorization task and generalization tests. Behavioural results from the generalization tests showed that monkeys had most problems generalizing to novel exemplars of the touch and reach categories, which indeed were highly similar with respect to shape and local motion features. The fact that the action category that was most distinct in terms of local shape features (grasping) also resulted in the most prominent generalization, suggests monkeys might have relied on the shape of the hand during the final part of the action. This assumption is in line with a previous behavioural study in monkeys that examined generalization after categorization training to leftward vs rightward or forward vs backward walking (Vangeneugden et al., 2010). The results of this study suggest that monkeys learned discriminations much faster when they could rely on form cues to solve the task instead of motion cues. It should be mentioned that in this study we only tested generalization across a limited set of features like object color, object identity, actor identity and effector type. Future behavioural studies would be needed to examine the extent of generalization across other highly relevant dimensions like for instance viewpoints (Fiave and Nelissen, 2021), object affordances or social context. It is feasible that using more visually distinct actions categories, as well as longer training duration in combination with a larger and more diverse stimulus training set, might result in more elaborate generalization to novel exemplars (Fabre-Thorpe, 2003).
4.3. Categorization training-induced effects in STS
Univariate analysis suggested a main effect of training in several STS ROIs, including MT, FST, middle face patch ML, middle and anterior body patches and middle portion of STP (Nelissen et al., 2006). Except for STPm, these regions also showed a consistent action × training interaction effect. Importantly, this interaction effect was limited to the trained stimulus dimension (action category), since no training by effector interaction effects were observed in any of the ROIs. Our findings are in agreement with an earlier human study employing fMRI adaptation to examine changes in fMRI brain responses due to discrimination training of complex motion patterns (Jastorff et al., 2009). Jastorff et al. (2009) demonstrated not only learning-related effects in visual regions known to be involved in processing biological motion (pSTS and FBA), but also in earlier areas like human MT/V5+, which is involved in processing of simple motion patterns. The findings by Jastorff et al. (2009) and our current results suggest a role of human and macaque STS in visual learning of biological motion or actions that is crucial for action recognition. Although differences in attention between pre- and post-training action observation scans cannot be ruled out completely, analysis of eye movement behavior (see methods) did not reveal a significant difference in number of saccades between pre- and post-training scans, and showed only a slightly higher % fixation post-training in one subject (M2: average of 95.42% pre-training vs 96.24% post-training). Furthermore, the specificity of the observed interaction effects (Fig. 4d-E; interaction effect for training × action category and no interaction effect for training × effector) does not suggest a general attention effect during the post-training scans was the sole cause of the increased fMRI responses post-training.
Learned categorical perception involves members of different categories being perceived more dissimilar (between-category separation) and/or members of the same category being perceived as more similar (within-category compression) (Harnad, 1987; Livingston et al., 1998; Goldstone and Hendrickson, 2010; Juárez et al., 2019). In order to examine the neural correlates of these phenomena, we also employed multivariate fMRI techniques to specifically examine if between-category separation (increased decoding of action categories after training) was evident comparing the before and after-categorization training data throughout the STS ROIs. In addition, we employed cross-condition MVPA to scout for evidence of within-category compression by examining if cross-condition (hand to tail, or vice-versa) decoding of action category or identity (grasp vs touch vs reach) became stronger after categorization training. Although both body patches, in addition to several other STS regions, allowed between action category decoding, no clear between-category separation (as evident from significantly increased between-category decoding between pre- and post-categorization training datasets) effects were observable in the multi-voxel patterns (Suppl. Table 2, ‘Between action category’). Since behavioral tests (Fig. 2) showed distinctions between the visually similar touch and reach actions (Fig. 1B) were more difficult to learn, we also examined if training effects might be more evident when comparing distinctness of voxel patterns for grasping observation vs non-grasping (touch + reach actions) between pre- and post-training data. Although in particular STS body patches showed increases in decoding accuracies in both subjects (7/8 cases) post-training, these differences were in general weak (Suppl. Table 2, ‘Grasp vs non-grasp actions’).
As for the weak between-category separation effects in the multi-voxel patterns, clear within-category compression (increased cross-exemplar decoding between pre- and post-categorization training datasets) effects were not readily observed, either, in the multi-voxel patterns (Suppl. Table 2, ‘Within action category’). Although we focused on learning-induced changes in macaque STS in this study, preliminary analysis suggests a consistent main effect of training and action x training interaction effect was also present in parietal area AIP. Future studies combining mapping of the functional motor networks related to reach, touch and grasp actions with action categorization tasks in the same subjects would be needed to examine in detail to what extent categorization learning also yields changes in the motor nodes of the AON.
In line with our findings, a recent human study that examined observational learning induced changes in fronto-parietal neural representations of actions sequences also did not find strong changes in representational dissimilarity between either trained and untrained sequences or between pre- and post-training data using MVPA ROI or searchlight techniques (Apšvalka et al., 2018). In addition, previous studies examining the effect of categorization learning on shape representations in IT neurons suggest these effects are subtle and may easily be missed using fMRI (Freedman et al., 2003; de Baene et al., 2008; Op de Beeck et al., 2008). Guided by fMRI, more focal techniques allowing to record from multiple single units or mesoscale level optical imaging using viral techniques (Li et al., 2017; Kim and Schnitzer, 2022) might be crucial to examine to what extent categorization training related changes in action representations might occur in the STS body patches or adjacent regions. Moreover, these invasive techniques would allow tracking changes in action representations at a much fine temporal scale, even during the actual categorization training stages. Ultimately, the combination of active categorization tasks as used in our study with precise focal perturbation techniques like optogenetics or microstimulation and fMRI will be crucial for shedding light on the causal role of body patches or other STS regions for action recognition and to what extent their integrity is crucial for normal functioning of other down-stream action observation related brain regions.
CRediT authorship contribution statement
Ding Cui: Conceptualization, Methodology, Investigation, Formal analysis, Visualization, Writing – original draft, Writing – review & editing. Lotte Sypré: Investigation, Methodology. Mathias Vissers: Methodology, Investigation. Saloni Sharma: Investigation. Rufin Vogels: Methodology, Writing – original draft. Koen Nelissen: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Supervision, Funding acquisition.
Acknowledgments
Acknowledgements
We thank W. Depuydt, M. De Paep, P.A. Fiave, S. Kumar, I. Popivanov, Y. Zafirova, T. Yao, C. Fransen, A. Hermans, P. Kayenbergh, G. Meulemans, I. Puttemans, C. Ulens, M. Verbeeck and S. Verstraeten for technical assistance.
Funding
This work was supported by Fonds Wetenschappelijk Onderzoek Vlaanderen (G.0.622.08; G.0.593.09; G.0.854.19) to KN, the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 856495) to RV, and KU Leuven (C14/17/109; C14/21/111) to KN and RV.
Footnotes
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.neuroimage.2022.119780.
Appendix. Supplementary materials
Data availability
Data will be made available on request.
References
- Apšvalka D., Cross E.S., Ramsey R. Observing action sequences elicits sequence-specific neural representations in frontoparietal brain regions. J. Neurosci. 2018;38:10114–10128. doi: 10.1523/JNEUROSCI.1597-18.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arcaro M.J., Ponce C., Livingstone M. The neurons that mistook a hat for a face. Elife. 2020;9:e53798. doi: 10.7554/eLife.53798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balser N., Lorey B., Pilgramm S., Stark R., Bischoff M., Zentgraf K., Williams A.M., Munzert J. Prediction of human actions: expertise and task-related effects on neural activation of the action observation network. Hum. Brain Mapp. 2014;35:4016–4034. doi: 10.1002/hbm.22455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bao P., Tsao D.Y. Representation of multiple objects in macaque category-selective areas. Nat. Commun. 2018:1–16. doi: 10.1038/s41467-018-04126-7. 2018 9:1 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barraclough N.E., Keith R.H., Xiao D., Oram M.W., Perrett D.I. Visual adaptation to goal-directed hand actions. J. Cogn. Neurosci. 2009;21:1806–1820. doi: 10.1162/jocn.2008.21145. [DOI] [PubMed] [Google Scholar]
- Barraclough N.E., Xiao D., Oram M.W., Perrett D.I. The sensitivity of primate STS neurons to walking sequences and to the degree of articulation in static images. Prog. Brain Res. 2006;154:135–148. doi: 10.1016/S0079-6123(06)54007-5. [DOI] [PubMed] [Google Scholar]
- Bell A.H., Hadj-Bouziane F., Frihauf J.B., Tootell R.B.H., Ungerleider L.G. Object representations in the temporal cortex of monkeys and humans as revealed by functional magnetic resonance imaging. J. Neurophysiol. 2009;101:688–700. doi: 10.1152/jn.90657.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bell A.H., Malecek N.J., Morin E.L., Hadj-Bouziane F., Tootell R.B.H., Ungerleider L.G. Relationship between functional magnetic resonance imaging-identified regions and neuronal category selectivity. J. Neurosci. 2011;31:12229–12240. doi: 10.1523/JNEUROSCI.5865-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bruce C., Desimone R., Gross C.G. Visual properties of neurons in a polysensory area in superior temporal sulcus of the macaque. J. Neurophysiol. 1981;46:369–384. doi: 10.1152/jn.1981.46.2.369. [DOI] [PubMed] [Google Scholar]
- Calvo-Merino B., Glaser D.E., Grèzes J., Passingham R.E., Haggard P. Action Observation and Acquired Motor Skills: an fMRI Study with Expert Dancers. Cereb. Cortex. 2004;15:1243–1249. doi: 10.1093/cercor/bhi007. [DOI] [PubMed] [Google Scholar]
- Cox D., Meyers E., Sinha P. Contextually evoked object-specific responses in human visual cortex. Science. 2004;304:115–117. doi: 10.1126/science.1093110. [DOI] [PubMed] [Google Scholar]
- Cui D., Nelissen K. Examining cross-modal fMRI adaptation for observed and executed actions in the monkey brain. Neuroimage. 2021;233 doi: 10.1016/j.neuroimage.2021.117988. [DOI] [PubMed] [Google Scholar]
- de Baene W., Ons B., Wagemans J., Vogels R. Effects of category learning on the stimulus selectivity of macaque inferior temporal neurons. Learn. Memory. 2008;15:717–727. doi: 10.1101/lm.1040508. [DOI] [PubMed] [Google Scholar]
- Dubois J., de Berker A.O., Tsao D.Y. Single-unit recordings in the macaque face patch system reveal limitations of fMRI MVPA. J. Neurosci. 2015;35:2791–2802. doi: 10.1523/JNEUROSCI.4037-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ekstrom L.B., Roelfsema P.R., Arsenault J.T., Bonmassar G., Vanduffel W. Bottom-up dependent gating of frontal signals in early visual cortex. Science (1979) 2008;321:414–417. doi: 10.1126/science.1153276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fabre-Thorpe M. Visual categorization: accessing abstraction in non–human primates. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2003;358:1215–1223. doi: 10.1098/rstb.2003.1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fiave P.A., Nelissen K. Motor resonance in monkey parietal and premotor cortex during action observation: influence of viewing perspective and effector identity. Neuroimage. 2021;224 doi: 10.1016/j.neuroimage.2020.117398. [DOI] [PubMed] [Google Scholar]
- Fiave P.A., Sharma S., Jastorff J., Nelissen K. Investigating common coding of observed and executed actions in the monkey brain using cross-modal multi-variate fMRI classification. Neuroimage. 2018;178:306–317. doi: 10.1016/j.neuroimage.2018.05.043. [DOI] [PubMed] [Google Scholar]
- Fisher C., Freiwald W.A. Whole-agent selectivity within the macaque face-processing system. Proc. Natl. Acad. Sci. U. S. A. 2015;112:14717–14722. doi: 10.1073/pnas.1512378112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freedman D.J., Riesenhuber M., Poggio T., Miller E.K. A comparison of primate prefrontal and inferior temporal cortices during visual categorization. J. Neurosci. 2003;23:5235–5246. doi: 10.1523/JNEUROSCI.23-12-05235.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friston K.J., Holmes A.P., Worsley K.J., Poline J.-.P., Frith C.D., Frackowiak R.S.J. Statistical parametric maps in functional imaging: a general linear approach. Hum. Brain Mapp. 1995;2:189–210. [Google Scholar]
- Gallese V., Fadiga L., Fogassi L., Rizzolatti G. Action recognition in the premotor cortex. Brain. 1996;119:593–609. doi: 10.1093/brain/119.2.593. [DOI] [PubMed] [Google Scholar]
- Goldstone R.L., Hendrickson A.T. Categorical perception. Wiley Interdiscip. Rev. Cogn. Sci. 2010;1:69–78. doi: 10.1002/wcs.26. [DOI] [PubMed] [Google Scholar]
- Harnad S. Category induction and representation. Cognit. Brain Theory. 1987;5:535–565. [Google Scholar]
- Hebart M.N., GÃrgen K., Haynes J.-.D. The Decoding Toolbox (TDT): a versatile software package for multivariate analyses of functional imaging data. Front. Neuroinform. 2015;8:1–18. doi: 10.3389/fninf.2014.00088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jastorff J., Popivanov I.D., Vogels R., Vanduffel W., Orban G.A. Integration of shape and motion cues in biological motion processing in the monkey STS. Neuroimage. 2012;60:911–921. doi: 10.1016/j.neuroimage.2011.12.087. [DOI] [PubMed] [Google Scholar]
- Jastorff J., Kourtzi Z., Giese M.A. Visual learning shapes the processing of complex movement stimuli in the human brain. J. Neurosci. 2009;29:14026–14038. doi: 10.1523/JNEUROSCI.3070-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jellema T., Perrett D.I. Neural representations of perceived bodily actions using a categorical frame of reference. Neuropsychologia. 2006;44:1535–1546. doi: 10.1016/j.neuropsychologia.2006.01.020. [DOI] [PubMed] [Google Scholar]
- Juárez F.P.G., Sicotte T., Thériault C., Harnad S. Category learning can alter perception and its neural correlates. PLoS One. 2019;14 doi: 10.1371/journal.pone.0226000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim T.H., Schnitzer M.J. Fluorescence imaging of large-scale neural ensemble dynamics. Cell. 2022;185:9–41. doi: 10.1016/j.cell.2021.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Komatsu H., Wurtz R.H. Relation of cortical areas MT and MST to pursuit eye movements. I. Localization and visual properties of neurons. J. Neurophysiol. 1988;60:580–603. doi: 10.1152/jn.1988.60.2.580. [DOI] [PubMed] [Google Scholar]
- Kumar S., Kaposvari P., Vogels R. Encoding of predictable and unpredictable stimuli by inferior temporal cortical neurons. J. Cogn. Neurosci. 2017;29:1445–1454. doi: 10.1162/jocn_a_01135. [DOI] [PubMed] [Google Scholar]
- Lanzilotto M., Ferroni C.G., Livi A., Gerbella M., Maranesi M., Borra E., Passarelli L., Gamberini M., Fogassi L., Bonini L., Orban G.A. Anterior intraparietal area: a hub in the observed manipulative action network. Cereb. Cortex. 2019;29:1816–1833. doi: 10.1093/cercor/bhz011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li L., Patki P.G., Kwon Y.B., Stelmakh V., Campbell B.D., Annamalai M., Lakoba T.I., Vasilyev M. All-optical regenerator of multi-channel signals. Nat. Commun. 2017;8:1–11. doi: 10.1038/s41467-017-00874-0. 2017 8:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Livingston K.R., Andrews J.K., Harnad S. Categorical perception effects induced by category learning. J. Exp. Psychol. Learn. Mem. Cogn. 1998;24:732–753. doi: 10.1037//0278-7393.24.3.732. [DOI] [PubMed] [Google Scholar]
- Meyers E.M., Borzello M., Freiwald W.A., Tsao D. Intelligent information loss: the coding of facial identity, head pose, and non-face information in the macaque face patch system. J. Neurosci. 2015;35:7069–7081. doi: 10.1523/JNEUROSCI.3086-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mysore S.G., Vogels R., Raiguel S.E., Todd J.T., Orban G.A. The selectivity of neurons in the macaque fundus of the superior temporal area for three-dimensional structure from motion. J. Neurosci. 2010;30:15491–15508. doi: 10.1523/JNEUROSCI.0820-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelissen K., Borra E., Gerbella M., Rozzi S., Luppino G., Vanduffel W., Rizzolatti G., Orban G.A. Action observation circuits in the macaque monkey cortex. J. Neurosci. 2011;31:3743–3756. doi: 10.1523/JNEUROSCI.4803-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelissen K., Luppino G., Vanduffel W., Rizzolatti G., Orban G.A. Observing others: multiple action representation in the frontal lobe. Science. 2005;310:332–336. doi: 10.1126/science.1115593. [DOI] [PubMed] [Google Scholar]
- Nelissen K., Vanduffel W. Action Categorization in Rhesus Monkeys: discrimination of grasping from non-grasping manual motor acts. Sci. Rep. 2017;7:1–10. doi: 10.1038/s41598-017-15378-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelissen K., Vanduffel W., Orban G.A. Charting the lower superior temporal region, a new motion-sensitive region in monkey superior temporal sulcus. J. Neurosci. 2006;26:5929–5947. doi: 10.1523/JNEUROSCI.0824-06.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Op de Beeck H.P., Deutsch J.A., Vanduffel W., Kanwisher N.G., DiCarlo J.J. A stable topography of selectivity for unfamiliar shape classes in monkey inferior temporal cortex. Cereb. Cortex. 2008;18:1676–1694. doi: 10.1093/cercor/bhm196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oram M.W., Perrett D.I. Integration of form and motion in the anterior superior temporal polysensory area (STPa) of the macaque monkey. J. Neurophysiol. 1996;76:109–129. doi: 10.1152/jn.1996.76.1.109. [DOI] [PubMed] [Google Scholar]
- Oram M.W., Perrett D.I. Responses of anterior superior temporal polysensory (STPa) neurons to “biological motion” stimuli. J. Cogn. Neurosci. 1994;6:99–116. doi: 10.1162/jocn.1994.6.2.99. [DOI] [PubMed] [Google Scholar]
- Perrett D.I., Harries M.H., Bevan R., Thomas S., Benson P.J., Mistlin A.J., Chitty A.J., Hietanen J.K., Ortega J.E. Frameworks of analysis for the neural representation of animate objects and actions. J. Exp. Biol. 1989;146:87–113. doi: 10.1242/jeb.146.1.87. [DOI] [PubMed] [Google Scholar]
- Perrett D.I., Smith P.A.J., Mistlin A.J., Chitty A.J., Head A.S., Potter D.D., Broennimann R., Milner A.D., Jeeves M.A. Visual analysis of body movements by neurones in the temporal cortex of the macaque monkey: a preliminary report. Behav. Brain Res. 1985;16:153–170. doi: 10.1016/0166-4328(85)90089-0. [DOI] [PubMed] [Google Scholar]
- Pinsk M.A., DeSimone K., Moore T., Gross C.G., Kastner S. Representations of faces and body parts in macaque temporal cortex: a functional MRI study. Proc. Natl. Acad. Sci. U. S. A. 2005;102:6996–7001. doi: 10.1073/pnas.0502605102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pitcher D., Ungerleider L.G. Evidence for a third visual pathway specialized for social perception. Trends Cogn. Sci. (Regul. Ed.) 2021;25:100–110. doi: 10.1016/j.tics.2020.11.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Popivanov I.D., Jastorff J., Vanduffel W., Vogels R. Tolerance of Macaque Middle STS body patch neurons to shape-preserving stimulus transformations. J. Cogn. Neurosci. 2015;27:1001–1016. doi: 10.1162/jocn_a_00762. [DOI] [PubMed] [Google Scholar]
- Popivanov I.D., Jastorff J., Vanduffel W., Vogels R. Heterogeneous single-unit selectivity in an fMRI-defined body-selective patch. J. Neurosci. 2014;34:95–111. doi: 10.1523/JNEUROSCI.2748-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Popivanov I.D., Jastorff J., Vanduffel W., Vogels R. Stimulus representations in body-selective regions of the macaque cortex assessed with event-related fMRI. Neuroimage. 2012;63:723–741. doi: 10.1016/j.neuroimage.2012.07.013. [DOI] [PubMed] [Google Scholar]
- Popivanov I.D., Schyns P.G., Vogels R. Stimulus features coded by single neurons of a macaque body category selective patch. Proc. Natl. Acad. Sci. 2016;113 doi: 10.1073/pnas.1520371113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Puce A., Perrett D. Electrophysiology and brain imaging of biological motion. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2003;358:435–445. doi: 10.1098/rstb.2002.1221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saito H.A., Yukie M., Tanaka K., Hikosaka K., Fukada Y., Iwai E. Integration of direction signals of image motion in the superior temporal sulcus of the macaque monkey. J. Neurosci. 1986;6:145–157. doi: 10.1523/JNEUROSCI.06-01-00145.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharma S., Fiave P.A., Nelissen K. Functional MRI Responses to Passive, Active, and Observed Touch in Somatosensory and Insular Cortices of the Macaque Monkey. J. Neurosci. 2018;38:3689–3707. doi: 10.1523/JNEUROSCI.1587-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharma S., Mantini D., Vanduffel W., Nelissen K. Functional specialization of macaque premotor F5 subfields with respect to hand and mouth movements: a comparison of task and resting-state fMRI. Neuroimage. 2019;191:441–456. doi: 10.1016/j.neuroimage.2019.02.045. [DOI] [PubMed] [Google Scholar]
- Singer J.M., Sheinberg D.L. Temporal cortex neurons encode articulated actions as slow sequences of integrated poses. J. Neurosci. 2010;30:3133–3145. doi: 10.1523/JNEUROSCI.3211-09.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sliwa J., Freiwald W.A. A dedicated network for social interaction processing in the primate brain. Science (1979) 2017;356:745–749. doi: 10.1126/science.aam6383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taubert J., van Belle G., Vanduffel W., Rossion B., Vogels R. Neural correlate of the thatcher face illusion in a monkey face-selective patch. J. Neurosci. 2015;35:9872–9878. doi: 10.1523/JNEUROSCI.0446-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsao D.Y., Freiwald W.A., Knutsen T.A., Mandeville J.B., Tootell R.B.H. Faces and objects in macaque cerebral cortex. Nat. Neurosci. 2003;6:989–995. doi: 10.1038/nn1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsao D.Y., Moeller S., Freiwald W.A. Comparing face patch systems in macaques and humans. Proc. Natl. Acad. Sci. 2008;105:19514–19519. doi: 10.1073/pnas.0809662105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vanduffel W., Fize D., Mandeville J.B., Nelissen K., van Hecke P., Rosen B.R., Tootell R.B.H., Orban G.A. Visual motion processing investigated using contrast agent-enhanced fMRI in awake behaving monkeys. Neuron. 2001;32:565–577. doi: 10.1016/s0896-6273(01)00502-5. [DOI] [PubMed] [Google Scholar]
- Vangeneugden J., Pollick F., Vogels R. Functional differentiation of macaque visual temporal cortical neurons using a parametric action space. Cerebral Cortex March. 2009;19:593–611. doi: 10.1093/cercor/bhn109. [DOI] [PubMed] [Google Scholar]
- Vangeneugden J., Vancleef K., Jaeggli T., VanGool L., Vogels R. Discrimination of locomotion direction in impoverished displays of walkers by macaque monkeys. J. Vis. 2010;10:1–19. doi: 10.1167/10.4.22. [DOI] [PubMed] [Google Scholar]
- Vanrie J., Verfaillie K. Perceiving depth in point-light actions. Percept. Psychophys. 2006;4(68):601–612. doi: 10.3758/bf03208762. 2006 68. [DOI] [PubMed] [Google Scholar]
- Vogels R. More than the face: representations of bodies in the inferior temporal cortex. Annu. Rev. Vis. Sci. 2022 doi: 10.1146/annurev-vision-100720-113429. May 24. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data will be made available on request.





