Skip to main content
eLife logoLink to eLife
. 2022 May 24;11:e75027. doi: 10.7554/eLife.75027

Social-affective features drive human representations of observed actions

Diana C Dima 1,, Tyler M Tomita 2, Christopher J Honey 2, Leyla Isik 1
Editors: Chris I Baker3, Chris I Baker4
PMCID: PMC9159752  PMID: 35608254

Abstract

Humans observe actions performed by others in many different visual and social settings. What features do we extract and attend when we view such complex scenes, and how are they processed in the brain? To answer these questions, we curated two large-scale sets of naturalistic videos of everyday actions and estimated their perceived similarity in two behavioral experiments. We normed and quantified a large range of visual, action-related, and social-affective features across the stimulus sets. Using a cross-validated variance partitioning analysis, we found that social-affective features predicted similarity judgments better than, and independently of, visual and action features in both behavioral experiments. Next, we conducted an electroencephalography experiment, which revealed a sustained correlation between neural responses to videos and their behavioral similarity. Visual, action, and social-affective features predicted neural patterns at early, intermediate, and late stages, respectively, during this behaviorally relevant time window. Together, these findings show that social-affective features are important for perceiving naturalistic actions and are extracted at the final stage of a temporal gradient in the brain.

Research organism: Human

Introduction

In daily life, we rely on our ability to recognize a range of actions performed by others in a variety of different contexts. Our perception of others’ actions is both efficient and flexible, enabling us to rapidly understand new actions no matter where they occur or who is performing them. This understanding plays a part in complex social computations about the mental states and intentions of others (Jamali et al., 2021; Spunt et al., 2011; Thornton et al., 2019; Thornton and Tamir, 2021; Weaverdyck et al., 2021). Visual action recognition also interacts cross-modally with language-based action understanding (Bedny and Caramazza, 2011; Humphreys et al., 2013). However, there are two important gaps in our understanding of action perception in realistic settings. First, we still do not know which features of the visual world underlie our representations of observed actions. Second, we do not know how different types of action-relevant features, ranging from visual to social, are processed in the brain, and especially how they unfold over time. Answering these questions can shed light on the computational mechanisms that support action perception. For example, are different semantic and social features extracted in parallel or sequentially?

Relatively few studies have investigated the temporal dynamics of neural responses to actions. During action observation, a distributed network of brain areas extracts action-related features ranging from visual to abstract, with viewpoint-invariant responses emerging as early as 200 ms (Isik et al., 2018). Visual features include the spatial scale of an action (i.e., fine-scale manipulations like knitting vs. full-body movements like running) represented throughout visual cortex (Tarhan and Konkle, 2020), and information about biological motion, thought to be extracted within 200 ms in superior temporal cortex (Giese and Poggio, 2003; Hirai et al., 2003; Hirai and Hiraki, 2006; Johansson, 1973; Jokisch et al., 2005; Vangeneugden et al., 2014). Responses in occipito-temporal areas have been shown to reflect semantic features like invariant action category (Hafri et al., 2017; Lingnau and Downing, 2015; Tucciarelli et al., 2019; Tucciarelli et al., 2015; Wurm and Caramazza, 2019; Wurm and Lingnau, 2015), as well as social features like the number of agents and sociality of actions (Tarhan and Konkle, 2020; Wurm et al., 2017; Wurm and Caramazza, 2019).

Among the visual, semantic, and social features thought to be processed during action observation, it is unclear which underlie our everyday perception in naturalistic settings. Mounting evidence suggests that naturalistic datasets are key to improving ecological validity and reliability in visual and social neuroscience (Haxby et al., 2020; Nastase et al., 2020; Redcay and Moraczewski, 2020). Most action recognition studies to date have used controlled images and videos showing actions in simple contexts (Isik et al., 2018; Wurm and Caramazza, 2019). However, presenting actions in natural contexts is critical as stimulus–context interactions have been shown to modulate neural activity (Willems and Peelen, 2021). Recent attempts to understand naturalistic action perception, however, have yielded mixed results, particularly with regard to the role of social features. For example, one recent study concluded that sociality (i.e., presence of a social interaction) was the primary organizing dimension of action representations in the human brain (Tarhan and Konkle, 2020). Another, however, found that semantic action category explained the most variance in fMRI data, with little contribution from social features (Tucciarelli et al., 2019).

Here, we combined a new large-scale dataset of everyday actions with a priori feature labels to comprehensively sample the hypothesis space defined by previous work. This is essential in light of the conflicting results from previous studies, as it allowed us to disentangle the contributions of distinct but correlated feature spaces. We used three-second videos of everyday actions from the “Moments in Time” dataset (Monfort et al., 2020) and replicated our results across two different stimulus sets. Action videos were sampled from different categories based on the American Time Use Survey (ATUS, 2019) and were highly diverse, depicting a variety of contexts and people. We quantified a wide range of visual, action-related, and social-affective features in the videos and, through careful curation, ensured that they were minimally confounded across our dataset.

We used this dataset to probe the behavioral and neural representational space of human action perception. To understand the features that support natural action viewing, we predicted behavioral similarity judgments using the visual, action-related, and social-affective feature sets. Next, to investigate the neural dynamics of action perception, we recorded electroencephalography (EEG) data while participants viewed the stimuli, and we used the three sets of features to predict time-resolved neural patterns.

We found that social-affective features predict action similarity judgments better than, and independently of, visual and action-related features. Visual and action-related features explained less variance in behavior, even though they included fundamental features such as the scene setting and the semantic category of each action. Neural patterns revealed that behaviorally relevant features are automatically extracted by the brain in a progression from visual to action to social-affective features. Together, our results reveal the importance of social-affective features in how we represent other people’s actions, and show that these representations emerge in the brain along a temporal gradient.

Results

Disentangling visual, action, and social-affective features in natural videos

We curated two sets of naturalistic three-second videos of everyday actions from the Moments in Time dataset (Monfort et al., 2020). The videos were selected from a larger set, ensuring that features of interest were minimally correlated. 18 common activities based on the National Bureau of Labor Statistics’ American Time Use Survey (ATUS, 2019) were represented (Table 1; see section ‘Behavior: Stimuli’). The two stimulus sets contained 152 videos (eight videos per activity and eight additional videos with no agents, included to add variation in the dataset, see section ‘Behavior: Stimuli’) and 65 videos (three or four videos per activity), respectively. The second set was used to replicate behavioral results in a separate experiment with different stimuli and participants.

Table 1. Activities from the American Time Use Survey (ATUS) included in each of the two stimulus sets, with the amount of daily hours spent performing each activity and the corresponding verb labels from the Moments in Time dataset.

Note that control videos were only included in the first dataset. Fighting and hiking were added for variation in valence and action setting.

Activity Hours Verb labels (Moments in Time)
Childcare/taking care of children 0.37 Crying, cuddling, feeding, giggling, socializing
Driving 1.17 Driving, socializing
Eating 1.06 Chewing, eating
Fighting Fighting
Gardening 0.17 Gardening, mowing, planting, shoveling, weeding
Grooming 0.68 Bathing, brushing, combing, trimming, washing
Hiking Hiking
Housework 0.53 Cleaning, dusting, repairing, scrubbing, vacuuming
Instructing and attending class 0.22 Instructing, teaching
Playing games 0.26 Gambling, playing+fun, playing+videogames, socializing
Preparing food 0.60 Barbecuing, boiling, chopping, cooking, frying, grilling, rinsing, stirring
Reading 0.46 Reading
Religious activities 0.14 Praying, preaching
Sleeping 8.84 Resting, sleeping
Socializing and social events 0.64 Celebrating, dancing, marrying, singing, socializing, talking
Sports 0.34 Exercising, playing+sports, swimming, throwing
Telephoning 0.16 Calling, telephoning
Working 3.26 Working
Control videos Blowing, floating, raining, shaking

Naturalistic videos of actions can vary along numerous axes, including visual features (e.g., the setting in which the action takes place or objects in the scene), action-specific features (e.g., semantic action category), and social-affective features (e.g., the number of agents involved or perceived arousal). For example, an action like ‘eating’ may vary in terms of context (in the kitchen vs. at a park), object (eating an apple vs. a sandwich), and number of agents (eating alone vs. together). Drawing these distinctions is crucial to disambiguate between context, actions, and agents in natural events. To evaluate these different axes, we quantified 17 visual, action-related, and social-affective features using image properties, labels assigned by experimenters, and behavioral ratings collected in online experiments (Figure 1a). Visual features ranged from low-level (e.g., pixel values) to high-level features related to scenes and objects (e.g., activations from the final layer of a pretrained neural network). Action-related features included transitivity (object-relatedness), activity (the amount of activity in a video), effectors (body parts involved), and action category based on the ATUS (ATUS, 2019). Finally, social-affective features included sociality, valence, arousal, and number of agents (see section ‘Representational similarity analysis’). Representational dissimilarity matrices (RDMs) were created for each feature by calculating pairwise Euclidean distances between all videos.

Figure 1. Quantifying visual, social-affective, and action features in the two stimulus sets.

Figure 1.

(a) Correlations between feature representational dissimilarity matrices (RDMs). Note the low correlations between visual features and action/social-affective features (white rectangle). (b) Behavioral rating distributions in the two stimulus sets. The z-scored ratings were visualized as raincloud plots showing the individual data points, as well as probability density estimates computed using MATLAB’s ksdensity function (Allen et al., 2019).

In both video sets, there were only weak correlations between visual features and the higher-level action/social-affective features (Figure 1a). The highest correlations were those within each of the three sets of features, including visual features (Experiment 1: Conv1 and image saturation/gist, τA = 0.29; Experiment 2: Conv1 and image hue, τA = 0.32), action features (Experiment 1: arousal and activity, τA = 0.31; Experiment 2: activity and effectors, τA = 0.33), and social features (sociality and number of agents; Experiment 1: τA = 0.31, Experiment 2: τA = 0.3).

The distributions of action and social-affective features were not significantly different between the two stimulus sets (all Mann–Whitney z < 1.08, p>0.28). The width of these distributions suggests that the stimuli spanned a wide range along each feature (Figure 1b). In both experiments, transitivity was notable through its bimodal distribution, likely reflecting the presence or absence of objects in scenes, while other features had largely unimodal distributions.

Behaviorally rated features differed in reliability in Experiment 1 (F(4,819) = 22.35, p<0.001), with sociality being the most reliable and arousal the least reliable (Figure 3—figure supplement 1). In Experiment 2, however, there was no difference in reliability (F(4,619) = 0.76, p=0.55). Differences in reliability were mitigated by our use of feature averages to generate feature RDMs.

Individual feature contributions to behavioral similarity

To characterize human action representations, we measured behavioral similarity for all pairs of videos in each set in two multiple arrangement experiments (see section ‘Multiple arrangement’). Participants arranged videos according to their similarity inside a circular arena (Figure 2). The task involved arranging different subsets of 3–8 videos until sufficiently reliable distance estimates were reached for all pairs of videos. Videos would play on hover, and participants had to play and move each video to proceed to the next trial. In Experiment 1, participants arranged different subsets of 30 videos out of the total 152, while in Experiment 2, participants arranged all 65 videos. To emphasize natural behavior, participants were not given specific criteria to use when judging similarity. Behavioral RDMs containing the Euclidean distances between all pairs of stimuli were reconstructed from each participant’s multiple arrangement data using inverse MDS (Kriegeskorte and Mur, 2012).

Figure 2. Experimental and analysis pipeline for evaluating the contribution of different features to action representations.

Figure 2.

Above: a multiple arrangement task was used to generate behavioral representational dissimilarity matrices (RDMs) in the two behavioral experiments. Below: electroencephalography (EEG) data was recorded during a one-back task, and time-resolved neural RDMs were generated using pairwise decoding accuracies. Cross-validated variance partitioning was used to assess the unique contributions of visual, social-affective, and action features to the behavioral and neural RDMs, quantified as the predicted squared Kendall’s τA . The stimuli in this figure are public domain images similar to the types of videos used in the experiments.

The multiple arrangement task was unconstrained, which meant that participants could use different criteria. Although this may have introduced some variability, the adaptive algorithm used in the multiple arrangement task enabled us to capture a multidimensional representation of how actions are intuitively organized in the mind, while at the same time ensuring sufficient data quality. Data reliability was quantified using leave-one-subject-out correlations of the dissimilarity estimates and was above chance in both experiments (Kendall’s τA = 0.13 ± 0.08 and 0.18 ± 0.08 respectively, both p<0.001, permutation testing; Figure 3—figure supplement 1a). Reliability was significantly higher in Experiment 2 than in Experiment 1 (Mann–Whitney z = 3.21, p=0.0013), potentially reflecting differences in both participant pools and sampling methods (subsets of videos in Experiment 1 vs. full video dataset in Experiment 2; see section ‘Multiple arrangement’).

We assessed the contribution of 17 different visual, social, and action features to behavior in both experiments by correlating each feature RDM to each participant’s behavioral RDM (Supplementary file 1b). In Experiment 1 (Figure 3), only two visual features were significantly correlated with the behavioral RDMs (environment and activations from the final fully connected layer FC8 of AlexNet). However, there were significant correlations between behavioral RDMs and all action-related RDMs (action category, effectors, transitivity, and activity), as well as all social-affective RDMs (valence, arousal, sociality, and number of agents).

Figure 3. Feature contributions to behavioral similarity.

Feature-behavior correlations are plotted against the noise ceiling (gray). Each dot is the correlation between an individual participant’s behavioral representational dissimilarity matrix (RDM) and each feature RDM. Asterisks denote significance (p<0.005, sign permutation testing). The reliability of the data and feature ratings is presented in Figure 3—figure supplement 1.

Figure 3.

Figure 3—figure supplement 1. Reliability of behavioral data.

Figure 3—figure supplement 1.

(a) Reliability of behavioral similarity estimates (leave-one-subject-out correlations) in the two datasets. (b) Reliability of features measured in behavioral experiments (leave-one-subject-out correlations). Since most ratings were collected using the larger video set, the video sets rated by each participant differed both in number and content. Only participants who rated at least five videos from each set were included in the analysis.

In Experiment 2, the only visual feature that moderately correlated with behavior was the final fully connected layer of AlexNet (p=0.006, below our threshold for significance). Among action features, only effectors and activity were significantly correlated with the behavioral RDMs. However, we found significant correlations with all social-affective features. The results thus converge across both experiments in suggesting that social-affective and, to a lesser extent, action-related features, rather than visual properties, explain behavioral similarity.

Social-affective features explain the most unique variance in behavioral representations

We performed a cross-validated variance partitioning analysis (Groen et al., 2018; Lescroart et al., 2015; Tarhan et al., 2021) to determine which features contributed the most unique variance to behavior (see section ‘Variance partitioning’). We selected the 10 features that contributed significantly to behavior in either experiment, that is, two visual features (environment and layer FC8 of AlexNet) and all action and social-affective features. To keep the analysis tractable and understand the contribution of each type of information, we grouped these features according to their type (visual, action, and social-affective) and used them as predictors in a cross-validated hierarchical regression (Figure 4). Note that there was no collinearity among the 10 predictors, with an average variance inflation factor of 1.34 (Experiment 1) and 1.37 (Experiment 2).

Figure 4. Social-affective features explain behavior better than visual and action features.

The unique variance explained by visual, action, and social-affective features is plotted against the split-half reliability of the data (gray). Significant differences are marked with asterisks (all p<0.001, Wilcoxon signed-rank tests). For control analyses on how individual features (e.g., action category and the number of agents) and their assignment to groups affect the results, see Figure 4—figure supplements 15.

Figure 4.

Figure 4—figure supplement 1. Using a more detailed semantic model based on WordNet similarities between video labels does not increase the contribution of action features.

Figure 4—figure supplement 1.

(a) Correlations between the WordNet representational dissimilarity matrix (RDM) and the behavioral RDMs. (b) Variance partitioning results obtained after replacing the action category RDM with the WordNet RDM.
Figure 4—figure supplement 2. Quantifying motion energy as a visual feature does not change the pattern of variance partitioning results.

Figure 4—figure supplement 2.

(a) A motion energy model correlates with the behavioral similarity data in Experiment 1, but not in Experiment 2. (b) Adding the motion energy model to the group of visual features does not change the pattern of results in our variance partitioning analyses.
Figure 4—figure supplement 3. Unique variance explained by visual and action features (environment, FC8, activity, transitivity, effectors, action category) and social-affective features (number of agents, sociality, valence, arousal) in the behavioral data.

Figure 4—figure supplement 3.

This shows a striking contribution from social-affective features even when pitted against all other features.
Figure 4—figure supplement 4. Unique variance explained by the number of agents, action features (action category, effectors, transitivity, activity), and other social-affective features (sociality, valence, and arousal) in the behavioral data.

Figure 4—figure supplement 4.

This control analysis shows that higher-level social-affective features explain most of the unique variance in our data, above and beyond the number of agents.
Figure 4—figure supplement 5. Unique variance explained by the number of agents, sociality, and affective features (valence and arousal) in the behavioral data.

Figure 4—figure supplement 5.

This control analysis suggests that affective features explain most of the variance in the data, alongside the number of agents (particularly in Experiment 2). However, this analysis does not account for the contributions of other visual and action features.

Together, the 10 predictors explained most of the systematic variance in behavior. In Experiment 1, the predicted squared Kendall’s τA of the full model (τA2 = 0.06 ± 0.001) was higher on average than the true split-half squared correlation (τA2 = 0.04 ± 0.002). This is likely to be due to the lower reliability of the behavioral similarity data in this experiment and suggests that the 10 predictors are able to explain the data well despite the overall lower prediction accuracy. In Experiment 2, the full model achieved a predicted τA2 of 0.18 ± 0.1 on average compared to a true squared correlation of 0.25 ± 0.1, suggesting that the 10 predictors explain most of the variance (73.21%) in the behavioral data.

In both experiments, social-affective features contributed significantly more unique variance to behavior than visual or action features (Figure 4, all Wilcoxon z > 5.5, all p<0.001). While all three groups of features contributed unique variance to behavior in Experiment 1 (all p<0.001, randomization testing), in Experiment 2, only social-affective features contributed significantly to behavior (p<0.001), while visual and action features did not (p=0.06 and 0.47, respectively). Shared variance between feature groups was not a significant contributor in either dataset. Although the effect sizes were relatively low, social-affective features explained more than twice as much unique variance as either the visual or action features in Experiment 1, and six times as much in Experiment 2. Furthermore, given the limits placed on predictivity by the reliability of the behavioral data, affective features predicted a large portion of the explainable variance in both experiments.

The semantic RDM included among the action features was a categorical model based on activity categories (ATUS, 2019). To assess whether a more detailed semantic model would explain more variance in behavior, we generated a feature RDM using WordNet similarities between the verb labels corresponding to the videos in the Moments in Time dataset. However, replacing the action category RDM with the WordNet RDM did not increase the variance explained by action features (Figure 4—figure supplement 1).

Similarly, our decision to quantify motion and image properties separately by using an optic flow model may have reduced the explanatory power of motion features in our data. Indeed, a motion energy model (Adelson and Bergen, 1985; Nunez-Elizalde et al., 2021) significantly correlated with behavior in Experiment 1, but not in Experiment 2. However, the addition of this model did not change the pattern of unique feature contributions (Figure 4—figure supplement 2).

Although the assignment of features to domains was not always straightforward, our results were robust to alternative assignment schemes. For example, high-level visual features can be seen as bordering the semantic domain, while features like the number of agents or the amount of activity can be seen as visual. However, feature assignment was not the main factor driving our results, which stayed the same even when the activity feature was assigned to the visual group. More strikingly, the social-affective feature group explained significantly more variance than all other features grouped together in both experiments (Figure 4—figure supplement 3). This is a particularly stringent test as it pits the unique and shared contributions of all visual, semantic, and action features against the four social-affective features. In Experiment 1, the combined contribution of visual and action features approached that of social-affective features, while in Experiment 2 the difference was larger. Together with the larger contribution of the number of agents in Experiment 2 (Figure 4—figure supplement 4, Figure 4—figure supplement 5), this suggests that Experiment 2 may have captured more social information, potentially thanks to the exhaustive sampling of the stimuli that allowed each participant to arrange the videos according to different criteria.

Among the social-affective features we tested, the number of agents could be seen as straddling the visual and social domains. To assess whether our results were driven by this feature, we performed a control variance partitioning analysis pitting the number of agents against the other, higher-level social-affective features (Figure 4—figure supplement 3). In both experiments, the higher-level features (sociality, valence, and arousal) contributed more unique variance than the number of agents, suggesting that our results are not explained by purely visual factors.

Furthermore, an additional analysis looking at the separate contributions of the number of agents, sociality, and affective features (valence and arousal) found that the affective features contributed the greatest variance in both experiments (Figure 4—figure supplement 5). For technical reasons, this analysis compared the joint contribution of both affective features to each single social feature and did not discount the impact of variance shared with visual or action-related features. Despite these limitations, the results suggest that the contribution of the social-affective feature group is not driven by the number of agents or the variance it shares with sociality, and highlight the role of affective features (valence and arousal) in explaining behavior.

EEG patterns reflect behavioral similarity

We performed an EEG experiment to investigate how action-relevant features are processed over time. Participants viewed 500 ms segments of the 152 videos from Experiment 1 and performed a one-back action task in which they detected repetitions of the action category (see section ‘EEG: Experimental procedure’). To relate neural patterns to behavioral and feature RDMs, we computed time-resolved neural RDMs for each participant using decoding accuracies between all pairs of videos (Figures 2 and 5a). The time course of decoding performance was similar to that observed in previous E/MEG studies using still visual stimuli (Carlson et al., 2013; Cichy et al., 2014; Dima et al., 2018; Greene and Hansen, 2018; Isik et al., 2014). Decoding accuracy rose above chance at 50 ms after video onset, reached its maximum at 98 ms (63.88 ± 6.82% accuracy), and remained above chance until 852 ms after video onset (cluster-corrected p<0.05, sign permutation testing).

Figure 5. The features that explain behavioral action representations also contribute to neural representations.

(a) Time course of video decoding accuracy, averaged across all pairs of videos and participants (in gray: SEM across participants). The horizontal line marks above-chance performance (sign permutation testing, cluster-corrected p<0.05). (b) Behavioral similarity correlates with the neural representational dissimilarity matrices (RDMs). The noise ceiling is shown in light blue (leave-one-subject-out correlation, mean ± SEM). Horizontal lines mark significant time windows (sign permutation testing, cluster-corrected p<0.05). (c) The distribution of significant correlation onsets for each feature model across 1000 bootstrapping iterations (sign permutation testing, cluster-corrected p<0.05). Color rectangles show 90% confidence intervals. The time courses of all feature correlations are shown in Figure 5—figure supplement 1. The average electroencephalography (EEG) evoked response is visualized in Figure 5—figure supplement 2.

Figure 5.

Figure 5—figure supplement 1. Correlations between features and the time-resolved neural representational dissimilarity matrices (RDMs).

Figure 5—figure supplement 1.

The first layer of AlexNet (Conv1) was included since it is known to provide the best match to early visual neural responses.
Figure 5—figure supplement 2. Average electroencephalography (EEG) evoked response.

Figure 5—figure supplement 2.

(a) Grand average evoked response across participants. The average time course of the amplitude at each electrode is shown in black, with the global field power shown in red and the stimulus duration in light gray. (b) Average topography of the evoked response (re-referenced to the median), averaged across 0.2 s time windows.

To assess brain–behavior correlations, we related the average behavioral RDM obtained in Experiment 1 to the time-resolved neural RDMs (Kendall’s τA). The behavioral RDM correlated significantly with neural patterns during a cluster between 62 and 766 ms after video onset (Figure 5b), suggesting that the features guiding the intuitive categorization of naturalistic actions also underlie their neural organization.

Neural timescale of individual feature representations

We assessed the correlations between EEG patterns and the 10 feature RDMs found to contribute to behavior in Experiment 1. We also included an additional feature RDM based on the first convolutional layer of AlexNet, which best captures early visual neural responses (Figure 5—figure supplement 1; see section ‘Multivariate analysis’). The feature RDMs that contributed to behavioral similarity also correlated with the EEG patterns (Figure 5—figure supplement 1), with a single exception (sociality).

A bootstrapping analysis of the cluster onsets of these correlations (Figure 5c) suggests a progression from visual to action and social-affective features. Visual predictors correlated with the neural patterns between 65 ± 15ms (mean ± SD, Conv1) and 84 ± 62 ms (Environment), while action category also had an early onset (58 ± 9 ms). Other action-related features, however, emerged later (transitivity: 170 ± 67 ms, effectors: 192 ± 94 ms, activity: 345 ± 133 ms). Among social-affective features, the number of agents had the earliest correlation onset (178 ± 81 ms), while valence and arousal emerged later (395 ± 81 and 404 ± 91 ms, respectively). Importantly, these features are spontaneously extracted in the brain, as none of them, with the exception of action category, were directly probed in the one-back task performed by participants. In addition, all features were extracted during behaviorally relevant time windows (Figure 5b).

A temporal hierarchy in action perception

A cross-validated variance partitioning analysis revealed different stages in the processing of naturalistic actions (Figure 6). Visual features dominated the early time windows (66–138 ms after video onset). Action features also contributed a significant amount of unique variance (162–598 ms), as well as variance shared with social-affective features (354–598 ms; Figure 6—figure supplement 1). Finally, social-affective features independently predicted late neural responses (446–782 ms). Importantly, visual features did not share a significant amount of variance with either action or social-affective features.

Figure 6. Hierarchical processing of visual, action, and social-affective features.

(a) Unique variance explained by each group of features over time. The split-half reliability of the data is shown in gray (shaded area; see also Figure 5b). Horizontal lines mark significant time windows (sign permutation testing, cluster-corrected p<0.05). The time course of shared variance is displayed in Figure 6—figure supplement 1. See Figure 6—figure supplement 2 for the results of a fixed-effects analysis. Figure 6—figure supplement 3 shows how the addition of a motion energy model affects these results. (b) The distribution of effect onsets across 100 split-half iterations (sign permutation testing, cluster-corrected p<0.05). Color rectangles show 90% confidence intervals.

Figure 6.

Figure 6—figure supplement 1. Shared variance among visual, action, and social predictors in the cross-validated variance partitioning analysis.

Figure 6—figure supplement 1.

(a) Time course of shared variance. (b) Onsets (with bootstrapped 90% CIs) for all significant portions of unique and shared variance.
Figure 6—figure supplement 2. Fixed-effects variance partitioning results (stacked area plot).

Figure 6—figure supplement 2.

A hierarchical regression was performed with the average time-resolved representational dissimilarity matrix (RDM) as a response variable. Only the unique contributions of the three predictor groups are shown. Horizontal lines mark significant time windows (cluster-corrected p<0.001, permutation testing).
Figure 6—figure supplement 3. The contribution of motion energy to the neural data.

Figure 6—figure supplement 3.

(a) The motion energy model correlates with neural responses. (b) Including the motion energy model among visual features reduces the unique contribution of action features, but not social features, in the cross-validated variance partitioning analysis. (c) The temporal hierarchy, however, remains the same in the fixed-effects analysis.

An analysis of effect onsets across 100 split-half iterations points to the hierarchical processing of these features, with a progression from visual to action to social-affective features. Social-affective features (mean onset 418 ± 89 ms) contributed unique variance significantly later than other feature sets, while action features (245 ± 104 ms) came online later than visual features (65 ± 8ms; all Wilcoxon z > 7.27, p<0.001; Figure 6b). A fixed-effects analysis revealed the same order of feature information with larger effect sizes (Figure 6—figure supplement 2).

Motion has been shown to drive the response of visual areas to naturalistic stimuli (Russ and Leopold, 2015; Nishimoto et al., 2011). To better assess the effect of motion on EEG responses, we performed an additional analysis including the motion energy model. There was a sustained correlation between motion energy and EEG patterns beginning at 62 ms (Figure 6—figure supplement 3). In the variance partitioning analysis, the addition of motion energy increased the unique contribution of visual features and decreased that of action features, indicating that the action features share variance with motion energy. However, the three stages of temporal processing were preserved in the fixed-effects analysis even with the addition of motion energy, suggesting that the three feature groups made distinct contributions to the neural patterns. Importantly, the unique contribution of social-affective features was unchanged in both analyses by the addition of the motion energy model.

Discussion

Here, we used a large-scale naturalistic stimulus set to disentangle the roles of different features in action perception. Two novel findings emerge from our study. First, our behavioral results suggest that social-affective features play the most important role in how we organize naturalistic everyday actions, above and beyond fundamental visual and action features like scene setting or action category. Second, these behaviorally relevant features are spontaneously extracted in the brain and follow a hierarchical sequence from visual to action-related and culminating with social-affective features. These results offer an account of how internal representations of everyday actions emerge in the mind and brain.

Behavioral representations: What features support action perception?

Across two separate multiple arrangement experiments with large-scale naturalistic stimulus sets, we found that social-affective features predicted similarity judgments better than, and independently of, visual and action-related features. By sampling a comprehensive feature space ranging from low-level to conceptual, we were able to distinguish between components that often covary, such as scene setting and action category or sociality and transitivity. Previous studies have operationalized features in different ways, and an exhaustive investigation is thus difficult; however, our approach of including several important features from each group mitigated this, as suggested by the high amount of variance in behavior collectively explained by our features.

Our work adds to a growing body of evidence for the importance of social-affective features in action perception and extends it by disentangling the contributions of specific social and semantic features. Previous work has highlighted sociality as an essential feature in neural action representations (Tarhan and Konkle, 2020; Wurm et al., 2017; Wurm and Caramazza, 2019), and a recent study (Tarhan et al., 2021) found that behavioral action similarity judgments were better explained by similarity in actors’ goals than by visual similarity. In line with this work, we found a minimal contribution of visual features to action similarity judgments. In contrast, all of our social-affective features – the number of agents, sociality, valence, and arousal – were significantly correlated with behavioral similarity. Furthermore, only two individual action-related features replicated across the two experiments: the amount of activity and the effector (body part) feature, the latter of which is highly relevant to the actors’ goals. This could be interpreted as further evidence for the importance of socially relevant features in our internal representations of actions, and identifies specific social and goal-related features that are important for action understanding.

A hypothesis-driven approach will always pose challenges due to practical limitations in the number of feature spaces one can feasibly test. Our approach of grouping predictors together based on theoretical distinctions made it possible to rigorously evaluate the unique contributions of different types of features, which is an essential first step in understanding naturalistic action representations. This analysis revealed that social-affective features contributed the most unique variance in both experiments, suggesting that they robustly predict behavioral similarity judgments, while visual and action features explained little unique variance in either experiment (Figure 4). An exploratory follow-up analysis showed that this effect was primarily driven by affective features (valence and arousal), with the number of agents as a secondary contributor. Recent work found that affective features drive the perceived similarity of memories of real-life events (Tomita et al., 2021), suggesting that these features bridge the action, event, and memory domains in organizing mental representations.

Among our social-affective features, the number of agents could be construed as a perceptual precursor to sociality. Indeed, previous fMRI work has suggested that neural representations of actions in the visual system reflect perceptual precursors of social features rather than higher-level social features (Wurm and Caramazza, 2019). Here, we found that high-level social-affective features (particularly valence and arousal) contributed significantly to behavior independently of the number of agents. Further, affective features explained significantly more unique variance in behavior than the number of agents in both experiments (Figure 4—figure supplements 4 and 5). Our findings suggest that high-level social-affective features, and in particular valence and arousal, uniquely drive human action representations.

Neural representations: How does action perception unfold over time?

Using EEG, we tracked the temporal dynamics of naturalistic action perception. Using naturalistic stimuli and a rich feature space enabled us to disentangle the contributions of different features and investigate their relative timing. Visual, action, and social-affective features made unique contributions to the EEG patterns at different processing stages, revealing a representational hierarchy of spontaneously extracted features.

Almost all behaviorally relevant features correlated with the EEG patterns, with action-related and social-affective features emerging later than visual features (Figure 5c). Most action-related features emerged within 200 ms, on the timescale of feedforward processing, which is consistent with prior work showing invariant responses to actions as early as 200 ms (Isik et al., 2018; Tucciarelli et al., 2015), and action transitivity processing as early as 250 ms (Wamain et al., 2014). Among social-affective features, the number of agents emerged earliest (162 ms), pointing to the role of this feature as a perceptual precursor in social perception (Papeo, 2020; Wurm and Caramazza, 2019). Valence and arousal emerged later, around 400 ms after video onset. Interestingly, sociality, which has been highlighted as an important dimension in previous fMRI work on action perception (Tarhan and Konkle, 2020; Wurm et al., 2017), did not correlate with the EEG patterns. This effect was not likely to be driven by a lower reliability in the measurement of this feature, as sociality was more reliable than all other behaviorally rated features in Experiment 1 (Figure 3—figure supplement 1). While the absence of an effect does not preclude the possibility that this feature is being processed, it is possible that prior work has confounded sociality with other correlated social-affective features (such as the number of agents or arousal). Alternatively, our operationalization of this feature (which was broader than in some previous studies, e.g., Tucciarelli et al., 2019; Wurm et al., 2017) may have led to differences in the information captured. Note that this finding is mirrored in our behavioral results, where we observed larger unique contributions from valence, arousal, and the number of agents than sociality (Figure 4—figure supplement 5).

Importantly, these features emerged spontaneously as the one-back task performed during the EEG recordings only related to action category. However, the semantic processing required to perform the task may have contributed to these computations. The emergence of features irrelevant to the task at hand (action category is not correlated with any other features in the dataset) suggests that this temporal hierarchy would also emerge in the absence of a task; however, future work can more directly test the impact of implicit and explicit (e.g., social-affective) processing on these neural dynamics.

Variance partitioning revealed a clear temporal progression from visual features (~100 ms) to action features (~150–600 ms) to social-affective features (~400–00 ms). Importantly, these processing stages emerged after partialling out the contributions of other groups of predictors in a cross-validated analysis, validating our a priori distinctions between feature classes. These findings suggest that the extraction of visual features occurs rapidly, within 200 ms, and is likely supported by feedforward computations. The social-affective features that support behavioral representations, however, were extracted last. This is consistent with theories suggesting that internal visual experience reverses the stages of perceptual processing (Dijkstra et al., 2020; Hochstein and Ahissar, 2002). Specifically, it was the final, social-affective stage of neural processing that was reflected in the intuitive behavioral representations, and not the initially extracted visual features. Furthermore, action-related features were extracted significantly before social-affective features, suggesting the two are not extracted in parallel, but instead pointing to a hierarchy in which both visual and action-related features may contribute to socially relevant computations. Given the short duration of our videos and the relatively long timescale of neural feature processing, it is possible that social-affective features are the result of ongoing processing relying on temporal integration of the previously extracted features. However, more research is needed to understand how these temporal dynamics change with continuous visual input (e.g., a natural movie), and whether social-affective features rely on previously extracted information.

Our results add temporal characterization to previous fMRI findings, suggesting that the seemingly conflicting features revealed by previous studies, like the number of agents (Wurm and Caramazza, 2019), sociality (Tarhan and Konkle, 2020), or semantic action category (Tucciarelli et al., 2019), emerge at different stages during action observation. Thus, the existence of different organizing dimensions can be explained not just through spatial segregation within and across brain areas, but also through a temporal gradient starting with visual features and concluding with behaviorally relevant social and affective representations. More work is needed to understand where these dynamic representations emerge in the brain, and whether they are supported by overlapping or distinct networks. Future research could test this using EEG-fMRI fusion to track the spatiotemporal dynamics of action representations.

Actions in context

As real-world actions tend to occur in a rich social context, studies of action perception should consider socially relevant features and the interactions between different systems for perceiving actions, agents, and their mental states (Quadflieg and Koldewyn, 2017). Recent work suggests that social perception enhances visual processing (Bellot et al., 2021; Papeo, 2020) and recruits dedicated neural circuits (Isik et al., 2017; Pitcher and Ungerleider, 2021). Our findings open exciting new avenues for connecting these areas of research. For example, future studies could more explicitly disentangle the perceptual and conceptual building blocks of social and affective features, such as body posture or facial expression, and their roles in action and interaction perception.

One fundamental question that lies at the root of this work is how actions should be defined and studied. Here, we adopted a broad definition of the term, focusing on activities as described in the ATUS (ATUS, 2019). Although our stimuli were selected to clearly depict short, continuous actions performed by visible agents, their naturalistic and context-rich nature means that they could be understood as ‘events,’ encompassing elements that are not singularly specific to actions. A wealth of evidence has shown that context changes visual processing in a nonadditive way (Bar, 2004; Willems and Peelen, 2021), and emerging evidence suggests that the same is true for actions (Wurm et al., 2012). Studying actions in context holds promise for understanding how semantically rich representations emerge in naturalistic vision. This, in turn, will pave the way towards a computational understanding of the neural processes that link perception and cognition.

Materials and methods

Behavior: Stimuli

We curated two stimulus sets containing three-second videos of everyday actions from the Moments in Time dataset (Monfort et al., 2020). To broadly sample the space of everyday actions, we first identified the most common activities from the National Bureau of Labor Statistics’ American Time Use Survey (ATUS, 2019). Our final dataset included 18 social and nonsocial activities that lend themselves to visual representation (Table 1), to ensure a diverse and balanced stimulus set representative of the human everyday action space. We note that the ATUS distinctions are based on performed rather than observed actions. While imperfect, they provide ecologically relevant and objective criteria with which to define our action space.

Action categories were selected from the second-level activities identified in the ATUS. We used a minimum cutoff of 0.14 hr/day to select common actions (Table 1). To diversify our dataset, we added a ‘hiking’ category (to increase variability in scene setting) and a ‘fighting’ category (for variability along affective dimensions). In addition, ‘driving’ was selected as a more specific instance of the ‘travel’ categories in ATUS as it is the most common form of transportation in the United States. Some adjustments were also made to the ‘relaxing and leisure’ category by selecting two specific activities that were easy to represent and distinguish visually, as well as above threshold (‘reading’ and ‘playing games’). In addition, our ‘reading’ category included both ‘reading for personal interest’ and ‘homework and research’ as this distinction is difficult to convey visually. We omitted three leisure categories that were difficult to represent in brief videos (‘watching TV,’ ‘relaxing and thinking,’ and ‘computer use for leisure, excluding games’), as well as the ‘consumer goods purchases’ category.

We curated an initial set of 544 videos from the Moments in Time dataset by identifying the verb labels relevant to our chosen activities. Videos were chosen that were horizontally oriented (landscape), of reasonable image quality, clearly represented the activity in question, clearly depicted an agent performing the activity, and varied in terms of number of agents, gender and ethnicity of agents, and scene setting. Although some categories were represented by fewer verb labels than others in the final set, our curation procedure aimed to balance important features within and across action categories. We also curated a set of control videos depicting natural and indoors scenes.

We then selected two subsets of videos (1) that sampled all activities in a balanced manner and (2) where sociality (as assessed through behavioral ratings, see section ‘Behavioral ratings’) was minimally correlated to the number of agents (experimenter-labeled). This was done by randomly drawing 10,000 subsets of videos that sampled all activities equally and selecting the video set with the lowest correlation between sociality and the number of agents. These two features are difficult to disentangle in naturalistic stimulus sets, and we were able to minimize, though not fully eliminate, this correlation (Figure 1a).

The first stimulus set contained 152 videos (eight videos per activity and eight additional videos with no agents) and was used in Experiment 1. The videos with no agents were included to provide variation in the dataset along visual properties that did not pertain to actions or agents, as well as variation in the overall dataset in terms of number of agents per video. From the remaining videos, a second set of 76 videos was sampled and manually adjusted to remove videos without agents (in the interest of experimental time) and any videos that were too similar visually to other videos in the same category (e.g., involving a similar number of agents in similar postures). The second stimulus set thus contained 65 videos (three or four videos per activity) and was used in Experiment 2. The videos were preprocessed to a frame rate of 24 frames per second using the videoWriter object in MATLAB and resized to 600 × 400 pixels. This was done by first resizing the videos in order to meet the dimension criteria (using MATLAB’s imresize function with bicubic interpolation). The videos were then cropped to the correct aspect ratio either centrally or using manually determined coordinates to make sure the action remained clear after cropping.

Behavior: Participants

Behavioral ratings

A total of 256 workers (202 after exclusions, located in the United States, worker age and gender not recorded) from the online platform Amazon Mechanical Turk provided sociality, valence, arousal, and activity ratings of the video stimuli, and 43 workers (35 after exclusions) provided transitivity ratings.

Multiple arrangement

Two separate online multiple arrangement experiments were performed on each of the two stimulus sets. A total of 374 workers from Amazon Mechanical Turk took part in Experiment 1 (300 after exclusions, located in the United States, worker age and gender not recorded). Experiment 2 involved 58 participants (53 after exclusions, 31 female, 20 male, 1 non-binary, 1 not reported, mean age 19.38 ± 1.09) recruited through the Department of Psychological and Brain Sciences Research Portal at Johns Hopkins University.

All procedures for online data collection were approved by the Johns Hopkins University Institutional Review Board (protocol number HIRB00009730), and informed consent was obtained from all participants.

Behavior: Experimental procedure

Behavioral ratings

Participants viewed subsets of 30–60 videos from the initially curated large-scale set and rated the events depicted on a five-point scale. In a first set of experiments, the dimensions rated were sociality (how social the events were, from 1 – not at all to 5 – very social); valence (how pleasant the events were, from 1 – very unpleasant to 5 – very pleasant); arousal (how intense the events were, from 1 – very calm to 5 – very intense); and activity (how active they were, from 1 – no action to 5 – very active). In separate experiments, participants provided transitivity ratings for the two final stimulus sets (i.e., to what extent the actions involved a person or people interacting with an object, from 1 – not at all to 5 – very much). Participants were excluded if they responded incorrectly to catch trials (approximately 10% of trials) requiring them to label the action shown in the prior video, or if they provided overly repetitive ratings (e.g., using only two unique values or fewer out of five possible ratings throughout the entire experiment). This amounted to an average of 17.46 ± 2.14 ratings per video (Experiment 1) and 18.22 ± 2.09 ratings per video (Experiment 2). The experiments were implemented in JavaScript using the jsPsych library (de Leeuw, 2015).

Multiple arrangement

To characterize human action representations, we measured behavioral similarity using two multiple arrangement experiments. The experiments were conducted on the Meadows platform (https://meadows-research.com/) and required participants to arrange the videos according to their similarity inside a circular arena. Participants were free to use their own criteria to determine similarity, so as to encourage natural behavior.

Each trial started with the videos arranged around the circular arena. The videos would start playing on hover, and the trial would not end until all videos were played and dragged-and-dropped inside the arena (Figure 2). Different sets of videos were presented in different trials. An adaptive ‘lift-the-weakest’ algorithm was used to resample the video pairs placed closest together, so as to gather sufficient evidence (or improve the signal-to-noise ratio) for each pair. This procedure was repeated until an evidence criterion of 0.5 was reached for each pair or until the experiment timed out (Experiment 1: 90 min; Experiment 2: 120 min). By asking participants to zoom into the subsets previously judged as similar, the task required the use of different contexts and criteria to judge relative similarities. Compared to other methods of measuring similarity, multiple arrangement thus combines efficient sampling of a large stimulus set with adaptive behavior that can recover a multi-dimensional similarity structure (Kriegeskorte and Mur, 2012).

In Experiment 1, participants arranged different subsets of 30 videos from the 152-video set, with a maximum of 7 videos shown in any one trial. The stimuli were sampled in a balanced manner across participants. The task took on average 32 ± 14.4 min and 86.8 ± 22.6 trials.

In Experiment 2, all participants arranged the same 65 videos (entire 65-video set), with a maximum of 8 videos shown in any one trial. The task took on average 87.5 ± 24.6 min, including breaks, and 289.7 ± 57.3 trials.

The experiments included a training trial in which participants arranged the same seven videos (in Experiment 1) or eight videos (in Experiment 2) before beginning the main task. In both experiments, these videos were hand-selected to represent clear examples from four categories. Participants were excluded from further analysis if there was a low correlation between their training data and the average of all other participants’ data (over 2 SDs below the mean). They were also excluded if they responded incorrectly to a catch trial requiring them to label the action in previously seen videos.

Inverse MDS was used to construct behavioral dissimilarity matrices containing normalized Euclidean distances between all pairs of videos (Kriegeskorte and Mur, 2012). In Experiment 1, the behavioral RDM contained 11,476 pairs with an average of 11.37 ± 3.08 estimates per pair; in Experiment 2, there were 2080 pairs arranged by all 53 participants.

Behavior: Data analysis

Representational similarity analysis

Everyday actions can be differentiated along numerous axes. Perceptually, they can differ in terms of visual properties, like the setting in which they take place. They can also be characterized through action-related features, like semantic action category, or through social features, like the number of agents involved. Understanding how these features contribute to natural behavior can shed light on how naturalistic action representations are organized. Here, we used representational similarity analysis (RSA) to assess the contribution of visual, action, and social-affective features to the behavioral similarity data.

We quantified features of interest using image properties, labels assigned by experimenters (Supplementary file 1a), and behavioral ratings (provided by participants, see section ‘Behavioral ratings’). We calculated the Euclidean distances between all pairs of stimuli in each feature space, thus generating 17 feature RDMs.

To quantify visual features, image properties were extracted separately for each frame of each video and averaged across frames. These included pixel value (luminance), hue, saturation, optic flow (the magnitude of the optic flow estimated using the Horn–Schunck method), and the spatial envelope of each image quantified using GIST (Oliva and Torralba, 2001). We also extracted activations from the first convolutional layer and last fully-connected layer of a pretrained feedforward convolutional neural network (AlexNet; Krizhevsky et al., 2012). These features were vectorized prior to computing Euclidean distances between them (see Supplementary file 1b for the dimensionality of each feature). Two additional experimenter-labeled features were included: scene setting (indoors/outdoors) and the presence of a watermark. To assess whether a motion energy model (Adelson and Bergen, 1985; Nishimoto et al., 2011; Watson and Ahumada, 1985) would better capture the impact of motion, we performed control analyses by computing motion energy features for each video using a pyramid of spatio-temporal Gabor filters with the pymoten package (Nunez-Elizalde et al., 2021).

Action feature RDMs were based on transitivity and activity ratings (provided by participants, see above), as well as action category (a binary RDM clustering the stimuli into activity categories based on the initial dataset designations) and effectors (experimenter-labeled). The latter consisted of binary vectors indicating the involvement of body parts in each action (face/head, hands, arms, legs, and torso). To assess whether a more detailed semantic model would capture more information, we also performed a control analysis using a feature RDM based on WordNet similarities between the verb labels in the ‘Moments in Time’ dataset (Figure 4—figure supplement 1).

Social-affective feature RDMs were based on sociality, valence, and arousal ratings (all provided by participants, see section ‘Behavioral ratings’ above) and the number of agents in each video, which was labeled by experimenters on a four-point scale (from 0, no agent present, to 3, three or more agents present).

Each participant’s behavioral RDM was correlated to the feature RDMs, and the resulting Kendall’s τA values were tested against chance using one-tailed sign permutation testing (5000 iterations). P-values were omnibus-corrected for multiple comparisons using a maximum correlation threshold across all models (Nichols and Holmes, 2002).

A noise ceiling was calculated by correlating each subject’s RDM to the average RDM (upper bound), as well as to the average RDM excluding the left-out subject (lower bound; Nili et al., 2014).

Variance partitioning

Despite low correlations between features of interest in both stimulus sets (Figure 1a), shared variance could still contribute to the RSA results. To estimate the unique contributions of the three primary groups of features, we performed a cross-validated variance partitioning analysis, excluding individual features that did not correlate with the behavioral data in the above RSA analysis. The three groups included visual features (scene setting and the last fully connected layer of AlexNet), action features (action category, effectors, transitivity, action), and social-affective features (number of agents, sociality, valence, arousal).

The behavioral data were randomly split into training and test sets (100 iterations) by leaving out half of the individual similarity estimates for each pair of videos in Experiment 1 (since different participants saw different subsets of videos) or half of the participants in Experiment 2. We fit seven different regression models using the average training RDM (with every possible combination of the three groups of features), and we calculated the squared Kendall’s τA between the predicted responses and the average test RDM. These values were then used to calculate the unique and shared portions of variance contributed by the predictors (Groen et al., 2018; Lescroart et al., 2015; Tarhan et al., 2021).

The resulting values were tested against chance using one-tailed sign permutation testing (5000 iterations, omnibus-corrected for multiple comparisons). Differences between groups of features were assessed with two-sided Wilcoxon signed-rank tests.

EEG: Stimuli

The stimulus set from behavioral Experiment 1 was used in the EEG experiment, containing 152 videos from 18 categories, as well as control videos. The three-second stimuli were trimmed to a duration of 0.5 s centered around the action as determined by visual inspection, to ensure that the shorter videos were easily understandable. This helped improve time-locking to the EEG signals and allowed for a condition-rich experimental design. An additional 50 videos were included as catch stimuli (25 easily identifiable pairs depicting the same action, manually chosen from the larger stimulus set).

EEG: Participants

Fifteen participants (six female, nine male, mean age 25.13 ± 6.81) took part in the EEG experiment. All participants were right-handed and had normal or corrected-to-normal vision. Informed consent was obtained in accordance with the Declaration of Helsinki, and all procedures were approved by the Johns Hopkins University Institutional Review Board (protocol number HIRB00009835).

EEG: Experimental procedure

Continuous EEG recordings with a sampling rate of 1000 Hz were made with a 64-channel Brain Products ActiCHamp system using actiCAP electrode caps in a Faraday chamber. Electrode impedances were kept below 25 kΩ when possible, and the Cz electrode was used as an online reference.

Participants were seated upright while viewing the videos on a back-projector screen situated approximately 45 cm away. The 152 videos were shown in pseudorandom order in each of 10 blocks with no consecutive repetition allowed. In addition, four repetitions of the 25 catch video pairs were presented at random times during the experiment. The video pairs were presented in different orders to minimize learning effects, so that for each video pair (V1, V2), half of the presentations were in the order V1-V2 and half of them were in the order V2-V1. Participants performed a one-back task and were asked to press a button on a Logitech game controller when they detected two consecutive videos showing the same action. Participants were not instructed on what constituted an action, beyond being given ‘eating’ as a simple example. There was a break every 150 trials, and participants could continue the experiment by pressing a button. In total, the experiment consisted of 1720 trials (1520 experimental trials and 200 catch trials) and took approximately 45 min.

The stimuli were presented using an Epson PowerLite Home Cinema 3000 projector with a 60 Hz refresh rate. Each trial started with a black fixation cross presented on a gray screen for a duration chosen from a uniform distribution between 1 and 1.5 s, followed by a 0.5 s video. The stimuli were presented on the same gray background and subtended approximately 15 × 13 degrees of visual angle. The fixation cross remained on screen, and participants were asked to fixate throughout the experiment. A photodiode was used to accurately track on-screen stimulus presentation times and account for projector lag. The paradigm was implemented in MATLAB R2019a using the Psychophysics Toolbox (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997).

EEG: Data analysis

Preprocessing

EEG data preprocessing was performed using MATLAB R2020b and the FieldTrip toolbox (Oostenveld et al., 2011). First, the EEG data were aligned to stimulus onset using the photodiode data to correct for any lag between stimulus triggers and on-screen presentation. The aligned data were segmented into 1.2 s epochs (0.2 s pre-stimulus to 1 s post-stimulus onset), baseline-corrected using the 0.2 s prior to stimulus onset, and high-pass filtered at 0.1 Hz.

Artifact rejection was performed using a semi-automated pipeline. First, the data were filtered between 110 and 140 Hz and Hilbert-transformed to detect muscle artifacts; segments with a z-value cutoff above 15 were removed. Next, channels and trials with high variance were manually rejected based on visual inspection of a summary plot generated using the ft_rejectvisual function in FieldTrip. Finally, independent component analysis (ICA) was performed to identify and remove eye movement components from the data.

Catch trials were removed from the data together with any trials that elicited a button response (13.74 ± 1.82% of all trials). Of the remaining trials, 8.36 ± 5.01% (ranging between 25 and 275 trials) were removed during the artifact rejection procedure. A maximum of two noisy electrodes were removed from eight participants’ datasets.

Prior to further analysis, the data were re-referenced to the median across all electrodes, low-pass filtered at 30 Hz to investigate evoked responses, and downsampled to 500 Hz.

Multivariate analysis

We performed multivariate analyses to investigate (1) whether EEG patterns reflected behavioral similarity and (2) whether different visual, action, and social-affective features explained variance in the neural data.

First, time-resolved decoding of every pair of videos was performed using a linear support vector machine classifier as implemented in the LibSVM library (Chang and Lin, 2011). Split-half cross-validation was used to classify each pair of videos in each participant’s data. To do this, the single-trial data was divided into two halves for training and testing, whilst ensuring that each condition was represented equally. To improve SNR, we combined multiple trials corresponding to the same video into pseudotrials via averaging. The creation of pseudotrials was performed separately within the training and test sets. As each video was shown 10 times, this resulted in a maximum of five trials being averaged to create a pseudotrial. Multivariate noise normalization was performed using the covariance matrix of the training data (Guggenmos et al., 2018). Classification between all pairs of videos was performed separately for each time point. Data were sampled at 500 Hz, and so each time point corresponded to nonoverlapping 2 ms of data. Voltage values from all EEG channels were entered as features to the classification model.

The entire procedure, from dataset splitting to classification, was repeated 10 times with different data splits. The average decoding accuracies between all pairs of videos were then used to generate a neural RDM at each time point for each participant. To generate the RDM, the dissimilarity between each pair of videos was determined by their decoding accuracy (increased accuracy representing increased dissimilarity at that time point).

Next, we evaluated the correlations between each participant’s time-resolved neural RDM and the feature RDMs found to correlate with behavioral similarity (Experiment 1). To investigate the link between behavioral and neural representations, we also correlated neural RDMs with the average behavioral RDM obtained from the multiple arrangement task in Experiment 1. This analysis was performed using 10 ms sliding windows with an overlap of 6 ms. The resulting Kendall’s τA values were tested against chance using one-tailed sign permutation testing (5000 iterations, cluster-corrected for multiple comparisons across time using the maximum cluster sum, α = 0.05, cluster setting α = 0.05). A noise ceiling was calculated using the same procedure as in the behavioral RSA (see ‘Representational similarity analysis’). Effect latencies were assessed by bootstrapping the individual correlations 1000 times with replacement to calculate 90% confidence intervals around effect onsets.

To quantify the contributions of visual, social-affective, and action features to the neural RDMs, a time-resolved cross-validated variance partitioning procedure was performed. Using 100 split-half cross-validation iterations, the neural RDM was entered as a response variable in a hierarchical regression with three groups of feature RDMs (visual, social-affective, and action) as predictors. This analysis employed the same 10 feature RDMs used in the behavioral variance partitioning (see section ‘Variance partitioning’), with the addition of activations from the first convolutional layer of AlexNet (Conv1). As Conv1 best captures early visual responses (Figure 5—figure supplement 1), its inclusion ensured that we did not underestimate the role of visual features in explaining neural variance. We did not use frame-wise RDMs to model these visual features; however, our approach of averaging features across video frames was justified by the short duration of our videos and the high correlation of CNN features across frames (Conv1: Pearson’s ρ=0.89±0.09; FC8: ρ=0.98±0.03).

The analysis was carried out using 10 ms sliding windows with an overlap of 6 ms. The resulting predicted Kendall’s τA values were tested against chance using one-tailed sign permutation testing (5000 iterations, cluster-corrected for multiple comparisons using the maximum cluster sum across time windows and regressions performed, α = 0.05, cluster-setting α = 0.05). The distributions of effect onsets across the 100 split-half iterations were compared using two-sided Wilcoxon signed-rank tests.

Data availability

Behavioral and EEG data and results have been archived as an Open Science Framework repository (https://osf.io/hrmxn/). Analysis code is available on GitHub (https://github.com/dianadima/mot_action; Dima, 2021).

Acknowledgements

This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. The authors wish to thank Tara Ghazi, Seah Chang, Alyssandra Valenzuela, Melody Lee, Cora Mentor Roy, Haemy Lee Masson, and Lucy Chang for their help with the EEG data collection, Dimitrios Pantazis for pairwise decoding code, and Emalie McMahon for comments on the manuscript.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Diana C Dima, Email: ddima@jhu.edu.

Chris I Baker, National Institute of Mental Health, National Institutes of Health, United States.

Chris I Baker, National Institute of Mental Health, National Institutes of Health, United States.

Funding Information

This paper was supported by the following grant:

  • National Science Foundation CCF-1231216 to Leyla Isik.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review and editing.

Conceptualization, Data curation, Methodology, Writing – review and editing.

Conceptualization, Methodology, Writing – review and editing.

Conceptualization, Funding acquisition, Methodology, Resources, Supervision, Writing – review and editing.

Ethics

Human subjects: All procedures for data collection were approved by the Johns Hopkins University Institutional Review Board, with protocol numbers HIRB00009730 for the behavioral experiments and HIRB00009835 for the EEG experiment. Informed consent was obtained from all participants.

Additional files

Supplementary file 1. Additional information about the stimulus sets and features used in analysis.

(a) Breakdown of scene setting and number of agents across the two final stimulus sets. (b) Features quantified in both stimulus sets and used to generate feature representational dissimilarity matrices (RDMs) in the representational similarity analysis.

elife-75027-supp1.docx (15.5KB, docx)
Transparent reporting form

Data availability

Behavioral and EEG data and results have been archived as an Open Science Framework repository (https://osf.io/hrmxn/). Analysis code is available on GitHub (https://github.com/dianadima/mot_action, copy archived at swh:1:rev:af9eede56f27215ca38ddd32564017f1f90417d0).

The following dataset was generated:

Dima DC, Tomita TM, Honey CJ, Isik L. 2021. Social-affective features drive human representations of observed actions. Open Science Framework. hrmxn

References

  1. Adelson EH, Bergen JR. Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America. A, Optics and Image Science. 1985;2:284–299. doi: 10.1364/josaa.2.000284. [DOI] [PubMed] [Google Scholar]
  2. Allen M, Poggiali D, Whitaker K, Marshall TR, Kievit RA. Raincloud plots: A multi-platform tool for robust data visualization. Wellcome Open Research. 2019;4:63. doi: 10.12688/wellcomeopenres.15191.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. ATUS . American Time Use Survey, United States Department of Labor. Bureau of Labor Statistics; 2019. [Google Scholar]
  4. Bar M. Visual objects in context. Nature Reviews. Neuroscience. 2004;5:617–629. doi: 10.1038/nrn1476. [DOI] [PubMed] [Google Scholar]
  5. Bedny M, Caramazza A. Perception, action, and word meanings in the human brain: the case from action verbs. Annals of the New York Academy of Sciences. 2011;1224:81–95. doi: 10.1111/j.1749-6632.2011.06013.x. [DOI] [PubMed] [Google Scholar]
  6. Bellot E, Abassi E, Papeo L. Moving Toward versus Away from Another: How Body Motion Direction Changes the Representation of Bodies and Actions in the Visual Cortex. Cerebral Cortex (New York, N.Y. 2021;31:2670–2685. doi: 10.1093/cercor/bhaa382. [DOI] [PubMed] [Google Scholar]
  7. Brainard DH. The Psychophysics Toolbox. Spatial Vision. 1997;10:433–436. [PubMed] [Google Scholar]
  8. Carlson T, Tovar DA, Alink A, Kriegeskorte N. Representational dynamics of object vision: the first 1000 ms. Journal of Vision. 2013;13:1–19. doi: 10.1167/13.10.1. [DOI] [PubMed] [Google Scholar]
  9. Chang CC, Lin CJ. LIBSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011;2:1961199. doi: 10.1145/1961189.1961199. [DOI] [Google Scholar]
  10. Cichy RM, Pantazis D, Oliva A. Resolving human object recognition in space and time. Nature Neuroscience. 2014;17:455–462. doi: 10.1038/nn.3635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. de Leeuw JR. jsPsych: A JavaScript library for creating behavioral experiments in A Web browser. Behavior Research Methods. 2015;47:1–12. doi: 10.3758/s13428-014-0458-y. [DOI] [PubMed] [Google Scholar]
  12. Dijkstra N, Ambrogioni L, Vidaurre D, van Gerven M. Neural dynamics of perceptual inference and its reversal during imagery. eLife. 2020;9:e53588. doi: 10.7554/eLife.53588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dima DC, Perry G, Messaritaki E, Zhang J, Singh KD. Spatiotemporal dynamics in human visual cortex rapidly encode the emotional content of faces. Human Brain Mapping. 2018;39:3993–4006. doi: 10.1002/hbm.24226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dima DC. mot_action. swh:1:rev:af9eede56f27215ca38ddd32564017f1f90417d0GitHub. 2021 https://github.com/dianadima/mot_action
  15. Giese MA, Poggio T. Neural mechanisms for the recognition of biological movements. Nature Reviews. Neuroscience. 2003;4:179–192. doi: 10.1038/nrn1057. [DOI] [PubMed] [Google Scholar]
  16. Greene MR, Hansen BC. Shared spatiotemporal category representations in biological and artificial deep neural networks. PLOS Computational Biology. 2018;14:e1006327. doi: 10.1371/journal.pcbi.1006327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Groen II, Greene MR, Baldassano C, Fei-Fei L, Beck DM, Baker CI. Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior. eLife. 2018;7:e32962. doi: 10.7554/eLife.32962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Guggenmos M, Sterzer P, Cichy RM. Multivariate pattern analysis for MEG: A comparison of dissimilarity measures. NeuroImage. 2018;173:434–447. doi: 10.1016/j.neuroimage.2018.02.044. [DOI] [PubMed] [Google Scholar]
  19. Hafri A, Trueswell JC, Epstein RA. Neural Representations of Observed Actions Generalize across Static and Dynamic Visual Input. The Journal of Neuroscience. 2017;37:3056–3071. doi: 10.1523/JNEUROSCI.2496-16.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Haxby JV, Gobbini MI, Nastase SA. Naturalistic stimuli reveal a dominant role for agentic action in visual representation. NeuroImage. 2020;216:116561. doi: 10.1016/j.neuroimage.2020.116561. [DOI] [PubMed] [Google Scholar]
  21. Hirai M, Fukushima H, Hiraki K. An event-related potentials study of biological motion perception in humans. Neuroscience Letters. 2003;344:41–44. doi: 10.1016/s0304-3940(03)00413-0. [DOI] [PubMed] [Google Scholar]
  22. Hirai M, Hiraki K. The relative importance of spatial versus temporal structure in the perception of biological motion: an event-related potential study. Cognition. 2006;99:B15–B29. doi: 10.1016/j.cognition.2005.05.003. [DOI] [PubMed] [Google Scholar]
  23. Hochstein S, Ahissar M. View from the top: hierarchies and reverse hierarchies in the visual system. Neuron. 2002;36:791–804. doi: 10.1016/s0896-6273(02)01091-7. [DOI] [PubMed] [Google Scholar]
  24. Humphreys GF, Newling K, Jennings C, Gennari SP. Motion and actions in language: semantic representations in occipito-temporal cortex. Brain and Language. 2013;125:94–105. doi: 10.1016/j.bandl.2013.01.008. [DOI] [PubMed] [Google Scholar]
  25. Isik L, Meyers EM, Leibo JZ, Poggio T. The dynamics of invariant object recognition in the human visual system. Journal of Neurophysiology. 2014;111:91–102. doi: 10.1152/jn.00394.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Isik L, Koldewyn K, Beeler D, Kanwisher N. Perceiving social interactions in the posterior superior temporal sulcus. PNAS. 2017;114:E9145–E9152. doi: 10.1073/pnas.1714471114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Isik L, Tacchetti A, Poggio T. A fast, invariant representation for human action in the visual system. Journal of Neurophysiology. 2018;119:631–640. doi: 10.1152/jn.00642.2017. [DOI] [PubMed] [Google Scholar]
  28. Jamali M, Grannan BL, Fedorenko E, Saxe R, Báez-Mendoza R, Williams ZM. Single-neuronal predictions of others’ beliefs in humans. Nature. 2021;591:610–614. doi: 10.1038/s41586-021-03184-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Johansson G. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics. 1973;14:201–211. doi: 10.3758/BF03212378. [DOI] [Google Scholar]
  30. Jokisch D, Daum I, Suchan B, Troje NF. Structural encoding and recognition of biological motion: evidence from event-related potentials and source analysis. Behavioural Brain Research. 2005;157:195–204. doi: 10.1016/j.bbr.2004.06.025. [DOI] [PubMed] [Google Scholar]
  31. Kleiner M, Brainard DH, Pelli DG, Broussard C, Wolf T, Niehorster D. What’s new in Psychtoolbox-3? Perception. 2007;36:1–16. [Google Scholar]
  32. Kriegeskorte N, Mur M. Inverse MDS: Inferring Dissimilarity Structure from Multiple Item Arrangements. Frontiers in Psychology. 2012;3:1–13. doi: 10.3389/fpsyg.2012.00245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems; 2012. pp. 1–9. [DOI] [Google Scholar]
  34. Lescroart MD, Stansbury DE, Gallant JL. Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Frontiers in Computational Neuroscience. 2015;9:1–20. doi: 10.3389/fncom.2015.00135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lingnau A, Downing PE. The lateral occipitotemporal cortex in action. Trends in Cognitive Sciences. 2015;19:268–277. doi: 10.1016/j.tics.2015.03.006. [DOI] [PubMed] [Google Scholar]
  36. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, Gutfreund D, Vondrick C, Oliva A. Moments in Time Dataset: One Million Videos for Event Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020;42:502–508. doi: 10.1109/TPAMI.2019.2901464. [DOI] [PubMed] [Google Scholar]
  37. Nastase SA, Goldstein A, Hasson U. Keep it real: rethinking the primacy of experimental control in cognitive neuroscience. NeuroImage. 2020;222:117254. doi: 10.1016/j.neuroimage.2020.117254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Nichols TE, Holmes AP. Nonparametric permutation tests for functional neuroimaging: A primer with examples. Human Brain Mapping. 2002;15:1–25. doi: 10.1002/hbm.1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Nili H, Wingfield C, Walther A, Su L, Marslen-Wilson W, Kriegeskorte N. A toolbox for representational similarity analysis. PLOS Computational Biology. 2014;10:e1003553. doi: 10.1371/journal.pcbi.1003553. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Nishimoto S, Vu A, Naselaris T, Benjamini Y, Yu B, Gallant JL. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology. 2011;21:1641–1646. doi: 10.1016/j.cub.2011.08.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Nunez-Elizalde A, Deniz F, Dupré la Tour T, Visconti di Oleggio Castello M. pymoten: scientific python package for computing motion energy features from video. v0.0.4Zenodo. 2021 doi: 10.5281/zenodo.6349625. [DOI]
  42. Oliva A, Torralba A. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision. 2001;42:145–175. doi: 10.1023/A:1011139631724. [DOI] [Google Scholar]
  43. Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational Intelligence and Neuroscience. 2011;2011:156869. doi: 10.1155/2011/156869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Papeo L. Twos in human visual perception. Cortex; a Journal Devoted to the Study of the Nervous System and Behavior. 2020;132:473–478. doi: 10.1016/j.cortex.2020.06.005. [DOI] [PubMed] [Google Scholar]
  45. Pelli DG. The VideoToolbox software for visual psychophysics: transforming numbers into movies. Spatial Vision. 1997;10:437–442. doi: 10.1163/156856897X00366. [DOI] [PubMed] [Google Scholar]
  46. Pitcher D, Ungerleider LG. Evidence for a Third Visual Pathway Specialized for Social Perception. Trends in Cognitive Sciences. 2021;25:100–110. doi: 10.1016/j.tics.2020.11.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Quadflieg S, Koldewyn K. The neuroscience of people watching: how the human brain makes sense of other people’s encounters. Annals of the New York Academy of Sciences. 2017;1396:166–182. doi: 10.1111/nyas.13331. [DOI] [PubMed] [Google Scholar]
  48. Redcay E, Moraczewski D. Social cognition in context: A naturalistic imaging approach. NeuroImage. 2020;216:116392. doi: 10.1016/j.neuroimage.2019.116392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Russ BE, Leopold DA. Functional MRI mapping of dynamic visual features during natural viewing in the macaque. NeuroImage. 2015;109:84–94. doi: 10.1016/j.neuroimage.2015.01.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Spunt RP, Satpute AB, Lieberman MD. Identifying the What, Why, and How of an Observed Action: An fMRI Study of Mentalizing and Mechanizing during Action Observation. Journal of Cognitive Neuroscience. 2011;23:63–74. doi: 10.1162/jocn.2010.21446. [DOI] [PubMed] [Google Scholar]
  51. Tarhan L, Konkle T. Sociality and interaction envelope organize visual action representations. Nature Communications. 2020;11:1–11. doi: 10.1038/s41467-020-16846-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Tarhan L, De Freitas J, Konkle T. Behavioral and neural representations en route to intuitive action understanding. Neuropsychologia. 2021;163:108048. doi: 10.1016/j.neuropsychologia.2021.108048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Thornton MA, Weaverdyck ME, Tamir DI. The brain represents people as the mental states they habitually experience. Nature Communications. 2019;10:1–10. doi: 10.1038/s41467-019-10309-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Thornton MA, Tamir DI. People accurately predict the transition probabilities between actions. Science Advances. 2021;7:eabd4995. doi: 10.1126/sciadv.abd4995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Tomita TM, Barense MD, Honey CJ. The Similarity Structure of Real-World Memories. bioRxiv. 2021 doi: 10.1101/2021.01.28.428278. [DOI]
  56. Tucciarelli R, Turella L, Oosterhof NN, Weisz N, Lingnau A. MEG Multivariate Analysis Reveals Early Abstract Action Representations in the Lateral Occipitotemporal Cortex. The Journal of Neuroscience. 2015;35:16034–16045. doi: 10.1523/JNEUROSCI.1422-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Tucciarelli R, Wurm M, Baccolo E, Lingnau A. The representational space of observed actions. eLife. 2019;8:e47686. doi: 10.7554/eLife.47686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Vangeneugden J, Peelen MV, Tadin D, Battelli L. Distinct neural mechanisms for body form and body motion discriminations. The Journal of Neuroscience. 2014;34:574–585. doi: 10.1523/JNEUROSCI.4032-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wamain Y, Pluciennicka E, Kalénine S. Temporal dynamics of action perception: differences on ERP evoked by object-related and non-object-related actions. Neuropsychologia. 2014;63:249–258. doi: 10.1016/j.neuropsychologia.2014.08.034. [DOI] [PubMed] [Google Scholar]
  60. Watson AB, Ahumada AJ. Model of human visual-motion sensing. Journal of the Optical Society of America. A, Optics and Image Science. 1985;2:322–341. doi: 10.1364/josaa.2.000322. [DOI] [PubMed] [Google Scholar]
  61. Weaverdyck ME, Thornton MA, Tamir DI. The representational structure of mental states generalizes across target people and stimulus modalities. NeuroImage. 2021;238:118258. doi: 10.1016/j.neuroimage.2021.118258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Willems RM, Peelen MV. How context changes the neural basis of perception and language. IScience. 2021;24:102392. doi: 10.1016/j.isci.2021.102392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Wurm MF, Cramon DY, Schubotz RI. The context-object-manipulation triad: cross talk during action perception revealed by fMRI. Journal of Cognitive Neuroscience. 2012;24:1548–1559. doi: 10.1162/jocn_a_00232. [DOI] [PubMed] [Google Scholar]
  64. Wurm XMF, Lingnau A. Decoding actions at different levels of abstraction. The Journal of Neuroscience. 2015;35:7727–7735. doi: 10.1523/JNEUROSCI.0188-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Wurm MF, Caramazza A, Lingnau A. Action Categories in Lateral Occipitotemporal Cortex Are Organized Along Sociality and Transitivity. The Journal of Neuroscience. 2017;37:562–575. doi: 10.1523/JNEUROSCI.1717-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Wurm MF, Caramazza A. Lateral occipitotemporal cortex encodes perceptual components of social actions rather than abstract representations of sociality. NeuroImage. 2019;202:116153. doi: 10.1016/j.neuroimage.2019.116153. [DOI] [PubMed] [Google Scholar]

Editor's evaluation

Chris I Baker 1

This study investigates and characterizes the representations of visual actions in video stimuli. The combination of the analytical techniques and stimulus domain makes the article likely to be of broad interest to scientists interested in action representation amidst complex sequences. This article enhances our understanding of visual action representation and the extraction of such information in natural settings.

Decision letter

Editor: Chris I Baker1
Reviewed by: Angelika Lingnau2, Mark Lescroart3

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Social-affective features drive human representations of observed actions" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Chris Baker as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Angelika Lingnau (Reviewer #1); Mark Lescroart (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) Reviewer 1 (Angelika Lingnau) notes and I agree that there are some needed qualifications/clarifications needed about the strength of the distinctions between visual, action, and social features. Given the correlations between these features, some steps need to be taken analytically to strengthen the dissociation or the claims should be attenuated. Note also her thorough smaller comments that should all be directly addressed to strengthen the paper.

2) Relatedly, there are a number of clarifications requested by Reviewer 2 (Mark Lescroart) about the construction of the features and the procedure used in the information analyses. These concerns should be carefully addressed as they will strengthen the paper.

Reviewer #1 (Recommendations for the authors):

Below please find a number of comments that might help strengthen the science and the presentation of this manuscript.

(1) Assignment of features to domains (e.g. page 4, line 92ff): As mentioned above, in the view of this reviewer some of the assignments between features to domains were less clear. As an example why is the setting in which the action takes place considered to be a visual feature (in contrast to, say, a semantic feature, such as indoors or outdoors)? Likewise, why are activations from the final layer of a pretrained neural network considered a high-level visual feature rather than a conceptual/semantic feature? Why is 'activity' (the amount of activity) in a video considered an action-related feature instead of a visual feature? Why is the number of agents considered a social-affective feature (instead of a visual feature; see also my next comment)? Regarding the latter, the authors admit that this feature might be considered either visual or social-affective.

(2) I liked the attempts to try to minimize the correlations between features, which is a known problem when working with naturalistic stimuli. However, in light of the significant correlations between some of the features (in particular between sociality and the number of agents), I am concerned about biases in the estimation of the β weights, which in turn may have impacted the results of the variance partitioning analysis. Moreover, if the features sociality and number of agents are correlated, and if we accept that the number of agents might be a visual rather than a social feature, how do we know that the results obtained for the feature sociality really are based on a social-affective feature?

(3) Methods (page 4, line 89): What was the reason to choose 8 additional videos with no agents?

(4) Figure 1: It might be helpful for the reader to indicate in the figure which of the features are considered to be visual, action-related or socio-affective.

(5) Figure 1B: Please provide some more details on how these plots were generated, and what they show.

(6) Multi-arrangement task (page 7, line 130f): At several points in the manuscript, the authors write that they collected behavioural similarity ratings when referring to the multi-arrangement task. This wording might be confusing since the multi-arrangement task did not require an explicit rating (in contrast to the ratings the authors collected from a separate set of participants for the features sociality, valence, arousal, activity, and transitivity). The authors may want to use a different wording when referring to the results of the multi-arrangement task.

(7) Related to the previous point, I'm somewhat concerned regarding the fact that participants were not provided with specific instructions as to the criteria they should follow when performing the multi-arrangement task. Whereas the authors argue that this was done to emphasize natural behaviour, in the view of this reviewer this approach comes with the risk of not knowing what the participants did in the end. To give concrete examples, it is possible that some participants focused on the context of the actions, while other participants focused on the valence, while yet another set of participants focused on some low level visual features. Moreover, participants might have changed their strategies throughout the experiment if they figured out that a criterion they first considered useful no longer worked for some of the configurations. Whereas the reported leave-one-subject-out correlations were above chance in both experiments, they were overall quite low, which might be related to the issue described above. How can the authors rule out such behaviour (and if not, how could this impact the interpretation of the results)?

(8) Figure 4: The overall differences between the three feature types appear to be quite small. What is the contribution of the number of agents for the social features (see also my corresponding comment on Figure 1a/ correlations between these two features)?

(9) On page 10, 2nd paragraph, the authors report the variance explained by the ten predictors in Experiment 2. What about the corresponding value for Experiment 1?

(10) Task EEG experiment (e.g. page 11, line 212f): participants had to perform a one-back action task in which they detected repetitions of the action category. Did they know what the action categories were (see also related question on the contents of Table 1)?

(11) Figure 5B: maybe add 'noise ceiling' to the legend?

(12) Discussion: the authors write that sociality did not correlate with the EEG patterns. Could it be that different participants used different strategies when performing this rating? (see corresponding comment on instructions for rating of sociality, page 22).

(12) Discussion: page 18: The authors argue for a hierarchy from visual to action-related to socially relevant computations. How might the one-back task (focusing on repetitions of the same action category) have contributed to this result?

(13): Table 1 (page 20): The mapping between the verb labels provided in Table 1 and the 8 exemplars used for each activity did not become quite clear to me. It seems from Table 1 that some of the categories consisted of a range of different actions (e.g. crying, cuddling, feeding, giggling, socializing for the activity 'childcare'), while some of them consisted of one single action only (e.g. reading), which makes the categories hard to compare, both in terms of the underlying perceptual variability and the task difficulty associated with judging a repetition of that category. Please comment.

(14) Table 1 (page 20): Unless I misunderstood, the action 'socializing' occurred within four of the 18 activities (in addition to being one of the 18 activities). What was the reason for this? Could this have impacted the results, showing a stronger contribution of socio-affective features in comparison to action-related and visual features? Likewise, could the results obtained in this study be affected by the curation of the two stimulus datasets (adjusted to include both social and non-social activities)?

(15) Page 20: What was the reason to use these specific common activities? Was there a specific cutoff?

(16) Page 21, line 404f: What was the purpose of the control videos depicting natural and indoors scenes?

(17) Page 21, line 415f: Please explain what 'repetitive' videos mean in this context.

(18) Page 22, behavioural ratings: What were the exclusion criteria for the behavioural ratings and the multiple arrangement task?

(19) Page 22: Participants were instructed to rate 'how social the events were'. In the view of this reviewer, this leaves a lot of room for interpretation (and also differs from the instructions given in previous studies). Please comment.

(20) Page 22: How consistent were the behavioural ratings across participants? Was variability across participants higher for some features than for others?

(21) Page 23 line 454f: How was the subset of videos presented in trial 1 selected?

(22) Page 28 line 541f: According to which criteria did the authors center the 0.5 seconds window around the action?

(23) Page 28 line 557f: What do the authors mean by 'shuffled pairs'?

(24) Page 29, 2nd paragraph: Please provide details on the projector.

Reviewer #2 (Recommendations for the authors):

If I have read the paper (and code) correctly, it seems that a single RDM was used to model all timepoints in the EEG responses, and that single RDM was correlated with the RDMs generated by the pairwise decoding analysis for each timepoint. For purposes of modeling behavioral data – in which the subjects are making a judgment based on the entire video clip – a single RDM per video clip seems reasonable. However, for purposes of modeling time-varying neural responses, it may be less than ideal to use a single RDM to summarize each video. The usefulness of a single temporal summary RDM will depend on how homogenous the features are across the video clip. Having looked over a few of the video clips in Moments in Time, the clips appear to be generally slow-moving – i.e., no large changes in background elements or actors across the 3 seconds. This seems like a reasonable justification for the use of a single temporal summary RDM; I would encourage the authors to clarify their rationale for using one RDM for the whole timecourse instead of framewise RDMs. Quantitative arguments would be useful.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Social-affective features drive human representations of observed actions" for further consideration by eLife. Your revised article has been evaluated by Chris Baker (Senior Editor) in consultation with the original reviewers.

We all agree the manuscript has been improved, the issues raised by the reviewers have been addressed well, and the manuscript will make a nice addition to the literature. But one of the reviewers highlighted a remaining concern that I wanted to draw your attention to. This concern could be addressed with additional analyses or additional discussion and motivation of the approach you used (I will leave it up to you which avenue you want to pursue). I'll quote from the reviewer:

"Among the visual features the authors investigate is motion. To me, this is a visual feature that would seem to have a strong potential to explain the EEG data in particular, since motion contrast strongly drives many visual areas. I am not completely satisfied with the way the authors have quantified motion, and I have a medium concern that their motion model is a bit of a straw man. First, they refer to the model as a "motion energy" model, which is not technically correct. Motion energy (i.e. Adelson and Bergen, 1985; Nishimoto et al., 2011) reflects image contrast, whereas optic flow does not (at least to the same degree). For example, a gray square moving 2 degrees on a slightly less gray background will generate the same magnitudes of optic flow vectors as a starkly black square on a white background moving the same distance. Motion energy will be substantially greater for higher-contrast stimuli. As such, it's likely that motion energy would be a better model for visual responses (at least in early visual areas) than optic flow. They also choose to compute optic flow densely, with one value per pixel. Thus, if actors in the various videos are in slightly different locations, the optic flow metric could be quite different, and the RDM may characterize arguably similar stimuli as distinct. I think that a multi-scale pyramid of optic flow (or better still, motion energy) would have a better chance of capturing motion selectivity in the way that the brain does.

There is one more choice that the authors make that I'm not entirely comfortable with, which relates to this same issue. They choose to use only the models that can be related to behavioral data to model the EEG data. This choice is motivated by the desire to relate brain activity to behavior, which is a worthwhile endeavor. However, I think there is an assumption underlying this choice: the authors assume that the signal that they measure with EEG must reflect neural representations that guide behavior and not low-level perceptual processes. I do not think that this is guaranteed to be the case. EEG measures a spatially diffuse signal, which may well be dominated by activity from quite low-level areas (e.g. V1). I think the question of whether the EEG signal reflects relatively high-level cognitive processing or relatively low-level perceptual processing – for any given experiment – is still open. For most of the models they exclude, I don't think this is a big issue. For example, I think it's reasonable to test luminance basically as a control (to show that there aren't huge differences in luminance that can explain behavior) and then to exclude it from the EEG modeling. However, I'm less happy with exclusion of motion, based on a somewhat sub-optimal motion model not explaining behavior.

The combination of these two issues has left me wanting a better quantification of motion as a model for the EEG data. My bet would be that motion would be a decent model for the EEG signal at an earlier stage than they see the effects of action class and social factors in the signal, so I don't necessarily think that modeling motion would be likely to eliminate the effects they do see; to my mind, it would just present a fairer picture of what's going on with the EEG signal."

eLife. 2022 May 24;11:e75027. doi: 10.7554/eLife.75027.sa2

Author response


Reviewer #1 (Recommendations for the authors):

Below please find a number of comments that might help strengthen the science and the presentation of this manuscript.

(1) Assignment of features to domains (e.g. page 4, line 92ff): As mentioned above, in the view of this reviewer some of the assignments between features to domains were less clear. As an example why is the setting in which the action takes place considered to be a visual feature (in contrast to, say, a semantic feature, such as indoors or outdoors)? Likewise, why are activations from the final layer of a pretrained neural network considered a high-level visual feature rather than a conceptual/semantic feature? Why is 'activity' (the amount of activity) in a video considered an action-related feature instead of a visual feature? Why is the number of agents considered a social-affective feature (instead of a visual feature; see also my next comment)? Regarding the latter, the authors admit that this feature might be considered either visual or social-affective.

This is an important point, and we agree the distinction between feature domains is not always clear cut. To answer the question about why the setting and final DNN layer are considered visual: since this work focused on actions, we considered action-related features to pertain to actions and their semantics. Thus, we considered the setting of an action to be visual and not semantic, since it only pertains to the action’s context. Similarly, the neural network we used was pretrained on ImageNet, and as such the features extracted by it would tend to describe the environment and objects involved rather than the action.

The word ‘visual’, particularly when contrasted with ‘action’ and ‘social-affective’, may indeed suggest only low-level information; however, we believe the distinction between context, actions, and agents to be relevant as it captures the variability of naturalistic actions. For example, an action like ‘eating’ would be characterized by action features (action category, transitivity, effectors involved and amount of activity) that would stay largely consistent across exemplars. However, exemplars may vary in terms of context (eating in the kitchen vs at a park), object (eating an apple vs a sandwich), and agents (eating alone vs together). We clarified these points in the paper, when introducing the visual features :

“Naturalistic videos of actions can vary along numerous axes, including visual features (e.g. the setting in which the action takes place or objects in the scene), action-specific features (e.g. semantic action category), and social-affective features (e.g. the number of agents involved or perceived arousal). For example, an action like ‘eating’ may vary in terms of context (in the kitchen vs at a park), object (eating an apple vs a sandwich), and number of agents (eating alone vs together). Drawing these distinctions is crucial to disambiguate between context, actions, and agents in natural events. […] Visual features ranged from low-level (e.g. pixel values) to high-level features related to scenes and objects (e.g. activations from the final layer of a pretrained neural network).”

Combining these scene-related features with action-related features does, indeed, reach an explanatory power almost as high as that of social-affective features in Experiment 1 (but not in Experiment 2). This can also be inferred from our original Figure 4 by looking at the effect sizes.

In our view, this strengthens our conclusions. Considering that this analysis discounts any variance shared with high-level visual, semantic, and action features, the significantly greater unique contribution of social-affective features is striking. However, the difference in effect size between the two experiments becomes more apparent than in our previous analysis (Figure 4), and we now discuss this in the manuscript :

“Although the assignment of features to domains was not always straightforward, our results were robust to alternative assignment schemes. For example, high-level visual features can be seen as bordering the semantic domain, while features like the number of agents or the amount of activity can be seen as visual. However, feature assignment was not the main factor driving our results, which stayed the same even when the activity feature was assigned to the visual group. More strikingly, the social-affective feature group explained significantly more variance than all other features grouped together in both experiments (Figure 4 —figure supplement 2). This is a particularly stringent test as it pits the unique and shared contributions of all visual, semantic, and action features against the four social-affective features. In Experiment 1, the combined contribution of visual and action features approached that of social-affective features, while in Experiment 2 the difference was larger. Together with the larger contribution of the number of agents in Experiment 2, this suggests that Experiment 2 may have captured more social information, potentially thanks to the exhaustive sampling of the stimuli which allowed each participant to arrange the videos according to different criteria.”

We also agree that the number of agents straddles the boundary between visual and social features. However, our control analysis showed that the number of agents did not explain the unique contribution of social-affective features to behavior (Figure 4 —figure supplement 3 ).

Finally, we took “activity” to pertain to actions (is it a static action like reading, or a high-movement action like running?) We agree that it could also be considered a visual feature along the lines of motion energy. However, the assignment of this feature to the visual group does not impact our conclusions, which we now note :

“However, feature assignment was not the main factor driving our results, which stayed the same even when the activity feature was assigned to the visual group.”

Author response image 1. Unique contributions of visual features (environment, FC8, activity), action features (action category, effectors, transitivity) and social-affective features (number of agents, sociality, valence and arousal).

Author response image 1.

Author response image 2. Social-affective features (sociality, valence, and arousal) still contribute more variance than visual features (environment, FC8, activity, and number of agents) and action features (action category, transitivity, and effectors).

Author response image 2.

(2) I liked the attempts to try to minimize the correlations between features, which is a known problem when working with naturalistic stimuli. However, in light of the significant correlations between some of the features (in particular between sociality and the number of agents), I am concerned about biases in the estimation of the β weights, which in turn may have impacted the results of the variance partitioning analysis. Moreover, if the features sociality and number of agents are correlated, and if we accept that the number of agents might be a visual rather than a social feature, how do we know that the results obtained for the feature sociality really are based on a social-affective feature?

Indeed, despite our efforts, there were still moderate correlations between certain related features. However, the variance inflation factors were low (1.34 in Exp 1 and 1.37 in Experiment 2), suggesting that the regressions did not suffer from collinearity. Furthermore, different assignments of features to groups (see above) did not impact the overall pattern of results.

It is true that these analyses do not allow us to draw conclusions about individual features, as variance partitioning with more than three sets of features becomes intractable. Specifically, the role of sociality and number of agents is unclear.

We performed an additional analysis looking at the unique contributions of the number of agents, sociality, and the grouped affective features (valence and arousal): Figure 4 —figure supplement 4. This has the advantage of removing any shared variance between the number of agents and sociality. Based on this analysis, it appears that valence and arousal best accounted for behavioral similarity estimates, with the number of agents also contributing a significant amount, particularly in Experiment 2. However, this analysis does not partial out the contributions of other features (visual or action-related). Grouping the two affective features together may have also advantaged them over the single social features. Given these limitations and the post-hoc, exploratory nature of this new analysis, we prefer to focus our interpretation on feature domains and their contributions. Nonetheless, we have adjusted our claims in light of this new finding, as outlined below.

We discuss this additional analysis in the manuscript :

“Furthermore, an additional analysis looking at the separate contributions of the number of agents, sociality, and affective features (valence and arousal) found that the affective features contributed the greatest variance in both experiments (Figure 4 —figure supplement 4). For technical reasons, this analysis compared the joint contribution of both affective features to each single social feature and did not discount the impact of variance shared with visual or action-related features. Despite these limitations, the results suggest that the contribution of the social-affective feature group is not driven by the number of agents or the variance it shares with sociality, and highlight the role of affective features (valence and arousal) in explaining behavior.”

We also discuss these findings in the Discussion :

“An exploratory follow-up analysis showed that this effect was primarily driven by affective features (valence and arousal), with the number of agents as a secondary contributor. Recent work found that affective features drive the perceived similarity of memories of real-life events (Tomita et al., 2021), suggesting that these features bridge the action, event and memory domains in organizing mental representations.”

Interestingly, this new finding is in line with our EEG results, where the number of agents, valence and arousal correlated with the neural responses, while sociality did not. Although this is an exploratory finding, we briefly discuss it :

“Note that this finding is mirrored in our behavioral results, where we observed larger unique contributions from valence, arousal and the number of agents than sociality (Figure 4 —figure supplement 4).”

(3) Methods (page 4, line 89): What was the reason to choose 8 additional videos with no agents?

We added this explanation to page 4 , as well as the Methods :

“and 8 additional videos with no agents, included to add variation in the dataset, see Materials and methods, section (Behavior: Stimuli)”

“The videos with no agents were included to provide variation in the dataset along visual properties that did not pertain to actions or agents, as well as variation in the overall dataset in terms of number of agents per video.”

(4) Figure 1: It might be helpful for the reader to indicate in the figure which of the features are considered to be visual, action-related or socio-affective.

We added a color-coded legend to the figure to make it clearer at a glance how features were assigned to groups .

(5) Figure 1B: Please provide some more details on how these plots were generated, and what they show.

We added the following caption to this figure panel :

“Behavioral rating distributions in the two stimulus sets. The z-scored ratings were visualized as raincloud plots showing the individual data points, as well as probability density estimates computed using Matlab’s ksdensity function (Allen et al., 2019).”

This panel shows the distribution of labeled features across our two stimulus sets, in particular that the ratings were consistent and stimuli spanned a wide range along each feature.

(6) Multi-arrangement task (page 7, line 130f): At several points in the manuscript, the authors write that they collected behavioural similarity ratings when referring to the multi-arrangement task. This wording might be confusing since the multi-arrangement task did not require an explicit rating (in contrast to the ratings the authors collected from a separate set of participants for the features sociality, valence, arousal, activity, and transitivity). The authors may want to use a different wording when referring to the results of the multi-arrangement task.

Thank you for bringing this to our attention. Indeed, these were not ratings in the sense that the other features were behaviorally rated. We replaced such wording throughout the manuscript with either ‘similarity estimates’ or ‘similarity judgments’. (This latter wording is more accurate in our view, since participants were instructed to judge the similarity of the videos in order to place them on screen.) See pages 7, 8, 27, 31, 43 etc.

(7) Related to the previous point, I'm somewhat concerned regarding the fact that participants were not provided with specific instructions as to the criteria they should follow when performing the multi-arrangement task. Whereas the authors argue that this was done to emphasize natural behaviour, in the view of this reviewer this approach comes with the risk of not knowing what the participants did in the end. To give concrete examples, it is possible that some participants focused on the context of the actions, while other participants focused on the valence, while yet another set of participants focused on some low level visual features. Moreover, participants might have changed their strategies throughout the experiment if they figured out that a criterion they first considered useful no longer worked for some of the configurations. Whereas the reported leave-one-subject-out correlations were above chance in both experiments, they were overall quite low, which might be related to the issue described above. How can the authors rule out such behaviour (and if not, how could this impact the interpretation of the results)?

In this work, we specifically focused on intuitive (and thus unconstrained) similarity. Indeed, this means that participants could have used different criteria or changed their strategies over time; however, we argue that this is a strength of our approach. We used the iterative multiple arrangement task as implemented in Meadows specifically because by varying the groups of stimuli shown in each trial it is able to recover a multidimensional representation, with many potential features coming into play. Although this method is not as exhaustive as, for example, an odd-one-out task (e.g. Hebart et al., 2021), it was a good compromise in terms of handling a large stimulus set and recovering a multidimensional structure. While the leave-one-subject-out correlations were only moderate, they reached a similar level as seen in related prior studies (e.g., Tarhan et al., Neuropsychologia, 2021) with smaller, more homogenous datasets. Finally, we note that all reported measures in the paper are in terms of Kendall’s Tau-A, which is a more conservative metric than the more commonly reported Spearman’s correlation.

We also argue that our ability to find and replicate the main organizing axes of this space, despite the many sources of potential variability you outline above, suggests that there is a common element to participants’ multiple arrangement strategies. Recent work on intuitive action understanding supports this by suggesting that intuitive arrangements may be guided by judgments of action goals (Tarhan et al., Neuropsychologia, 2021). We have added the following lines to the Results :

“The multiple arrangement task was unconstrained, which meant that participants could use different criteria. Although this may have introduced some variability, the adaptive algorithm used in the multiple arrangement task enabled us to capture a multidimensional representation of how actions are intuitively organized in the mind, while at the same time ensuring sufficient data quality.”

(8) Figure 4: The overall differences between the three feature types appear to be quite small. What is the contribution of the number of agents for the social features (see also my corresponding comment on Figure 1a/ correlations between these two features)?

Although the differences are small, they are significant relative to each other and as a percentage of the true correlation. In Experiment 1, social-affective features explain more than twice the unique variance explained by visual and action features (although note that the lower reliability of this data makes the true correlation more difficult to trust as a measure of ‘best possible fit’). In Experiment 2, social-affective features uniquely contribute 58% of the true correlation, with only 9% contributed by action features and approximately 0.5% contributed by visual features. We added more information on this to the Results:

“Although the effect sizes were relatively low, social-affective features explained more than twice as much unique variance as either the visual or action features in Experiment 1, and six times as much in Experiment 2. Furthermore, given the limits placed on predictivity by the reliability of the behavioral data, affective features predicted a large portion of the explainable variance in both experiments.”

We had not analyzed the unique contributions of each feature within each group, as this is difficult to assess in the context of a variance partitioning analysis given the large number of features (please see our response to point 2). Our above analyses (see Figure 4 – supplements 3 and 4) suggest that the social-affective dominance is not driven by the number of agents.

(9) On page 10, 2nd paragraph, the authors report the variance explained by the ten predictors in Experiment 2. What about the corresponding value for Experiment 1?

We have edited the text to make this clearer. The section now reads as follows :

“In Experiment 1, the predicted squared Kendall’s τA of the full model (τA2=0.06±0.001 ) was higher on average than the true split-half squared correlation (τA2=0.04±0.002). This is likely to be due to the lower reliability of the behavioral similarity data in this experiment, and suggests that the ten predictors are able to explain the data well despite the overall lower prediction accuracy.”

Due to the lower reliability of this dataset , we did not want to over-interpret the prediction accuracy obtained here.

(10) Task EEG experiment (e.g. page 11, line 212f): participants had to perform a one-back action task in which they detected repetitions of the action category. Did they know what the action categories were (see also related question on the contents of Table 1)?

The task was not constrained by the action categories used in our dataset curation. Participants were simply instructed to press a button when they identified the same action in consecutive videos, and given ‘eating’ as a simple example. However, the catch video pairs contained more easily identifiable repeated actions (e.g. people doing ballet rather than people doing different sports, which might have been more difficult to label as a repeat). This ensured that participants could perform the task without biasing their semantic categorization. We have clarified this in the text :

“Participants were not instructed on what constituted an action, beyond being given “eating” as a simple example.”

(11) Figure 5B: maybe add 'noise ceiling' to the legend?

Thank you. We have added this .

(12) Discussion: the authors write that sociality did not correlate with the EEG patterns. Could it be that different participants used different strategies when performing this rating? (see corresponding comment on instructions for rating of sociality, page 22).

This is an interesting point. We assessed the reliability of the ratings using leave-one-subject-out correlations, and did not find sociality to be less consistent than other ratings (Figure 1 —figure supplement 1). In Experiment 1, sociality was more reliable than all other ratings, and significantly more reliable than valence, arousal and activity ratings. In Experiment 2, there was no significant difference between sociality and the other ratings, except transitivity, which was more reliable. The fact that we do see EEG correlations with other features like valence and arousal seems to suggest that the lack of an effect is not due to differences in reliability. We now mention this point :

“This effect was not likely to be driven by a lower reliability in the measurement of this feature, as sociality was more reliable than all other behaviorally rated features in Experiment 1 (Figure 1 —figure supplement 1).”

We also now discuss the reliability of the behavioral ratings (see also below comment 20).

(12) Discussion: page 18: The authors argue for a hierarchy from visual to action-related to socially relevant computations. How might the one-back task (focusing on repetitions of the same action category) have contributed to this result?

We thank the Reviewer for this interesting point. Although the task was not related to most of the features, it required a degree of explicit, semantic processing that may have contributed to the neural dynamics. Although we cannot fully answer this question with the current data, we added the following paragraph to the discussion :

“Importantly, these features emerged spontaneously, as the one-back task performed during the EEG recordings only related to action category. However, the semantic processing required to perform the task may have contributed to these computations. The emergence of features irrelevant to the task at hand (action category is not correlated with any other features in our dataset) suggests that this temporal hierarchy would also emerge in the absence of a task; however, future work can more directly test the impact of implicit and explicit (e.g. social-affective) processing on these neural dynamics.”

(13): Table 1 (page 20): The mapping between the verb labels provided in Table 1 and the 8 exemplars used for each activity did not become quite clear to me. It seems from Table 1 that some of the categories consisted of a range of different actions (e.g. crying, cuddling, feeding, giggling, socializing for the activity 'childcare'), while some of them consisted of one single action only (e.g. reading), which makes the categories hard to compare, both in terms of the underlying perceptual variability and the task difficulty associated with judging a repetition of that category. Please comment.

The reviewer raises an important point. The reason for this partially stems from the imperfect mapping between our broader, ATUS-based categories and the verb labels in the Moments in Time (MiT) dataset. This is related to broader issues with category labels in computer vision datasets, which are often criticized for not being meaningful distinctions for humans (e.g., the dozens of different dog breeds in Imagenet). We sought to divide the MiT videos into more meaningful categories as defined by ATUS, but due to the somewhat arbitrary nature of the verb labels in MiT, there is not a perfect one-to-one mapping. Some of the categories mapped directly onto the labels, while other categories were broader and represented by several labels. Furthermore, many videos could have been represented by several other labels in addition to the ones they were listed under (hence why some ‘driving’ videos were found under ‘socializing’, for example).

Our original curation procedure involved going through the MiT verb labels and selecting videos from any label that seemed relevant for each ATUS activity category. This led to an initial selection of approximately 500 videos, from which the two datasets were selected solely based on the ATUS activity categories without regard to the MiT verb labels (as described in the Methods). Although the verb labels may seem heterogeneous, we attempted to enhance and match perceptual variability across categories by selecting videos for each ATUS category that varied along different criteria such as environment and number of agents. For example, the ‘reading’ category, though represented by a single label in the Moments in Time dataset, contains videos depicting one or two agents of different ages, genders and ethnicities reading books or magazines both indoors (in a living room, in bed, in a library, on the subway etc.) and outdoors (in a park, by a river, in a hammock etc.). We have added this to the manuscript :

“Videos were chosen that were horizontally-oriented (landscape), of reasonable image quality, that clearly represented the activity in question, that clearly depicted an agent performing the activity, and that varied in terms of number of agents, gender and ethnicity of agents, and scene setting. Although some categories were represented by fewer verb labels than others in the final set, our curation procedure aimed to balance important features within and across action categories.”

Finally, any remaining heterogeneity across action categories was addressed through our feature-driven analysis approach, which was intended to capture any perceptual components that may affect how participants arrange the videos.

(14) Table 1 (page 20): Unless I misunderstood, the action 'socializing' occurred within four of the 18 activities (in addition to being one of the 18 activities). What was the reason for this? Could this have impacted the results, showing a stronger contribution of socio-affective features in comparison to action-related and visual features? Likewise, could the results obtained in this study be affected by the curation of the two stimulus datasets (adjusted to include both social and non-social activities)?

Our aim was to sample actions that involved both single and multiple agents in a balanced way, as this has been rarely done in previous work using controlled videos of actions, usually performed by one agent. Thus, some of the actions were found under the label ‘socializing’ and involved multiple agents performing the action (whether interacting or not). This is also due to the way the Moments in Time labels were assigned, as described above. The presence of a socializing category, however, was driven by the prominence assigned to this category in the 2019 American Time Use Survey under the broader ‘leisure’ umbrella. We acknowledge that the distinctions from ATUS are not perfect, but provided the best objective measure to curate our dataset. We have added a brief description of these shortcomings .

“We note that the ATUS distinctions are based on performed rather than observed actions. While imperfect, they provide an ecologically-relevant and objective criteria with which to define our action space.”

To your main concern, we do not see the presence of both social and non-social actions as a source of bias. First, most action categories were represented by both single-agent and multiple-agent videos. Although some categories inherently involved multiple people (socializing, instructing), others tended to involve single agents and no social aspect (sleeping, housework). Second, all features (including sociality and number of agents) were rated independently of action category, and feature distributions suggest that the stimuli varied along many axes (Figure 1b). Finally, each feature’s contribution to behavior was assessed independently of category labels.

Ultimately, we believe that an action space that includes both social and non-social activities is more representative of human actions in the real world, since many actions are directed towards others. On the other hand, no stimulus set is definitive, and it would be great to replicate these results with different types of stimuli. The question of how action categories should be defined is still an open one that we did not attempt to address here – although we argue that our feature-based approach mitigates this somewhat.

We have clarified our reasoning in text :

“Our final dataset included 18 social and non-social activities that lend themselves to visual representation (Table 1), to ensure a diverse and balanced stimulus set representative of the human everyday action space.”

(15) Page 20: What was the reason to use these specific common activities? Was there a specific cutoff?

We have clarified our selection of ATUS categories in the manuscript . The selected activities were second-level ATUS activities with a cutoff of at least 0.14 hours/day, with few exceptions (additions and omissions) that are now explicitly described in text :

“Action categories were selected from the second-level activities identified in the ATUS. We used a minimum cutoff of 0.14 hours/day to select common actions (Table 1). To diversify our dataset, we added a ‘hiking’ category (to increase variability in scene setting) and a “fighting” category (for variability along affective dimensions). In addition, “driving” was selected as a more specific instance of the “travel” categories in ATUS, as it is the most common form of transportation in the US. Some adjustments were also made to the “relaxing and leisure” category, by selecting two specific activities that were easy to represent and distinguish visually, as well as above threshold (“reading” and “playing games”). In addition, our “reading” category included both “reading for personal interest” and “homework and research”, as this distinction is difficult to convey visually. We omitted three leisure categories that were difficult to represent in brief videos (“watching TV”, “relaxing and thinking”, and “computer use for leisure, excluding games”), as well as the ”consumer goods purchases” category.”

We have also added the number of hours per day spent on each activity according to ATUS to Table 1.

(16) Page 21, line 404f: What was the purpose of the control videos depicting natural and indoors scenes?

We added this explanation to the Methods. Please see response to point 3 above.

(17) Page 21, line 415f: Please explain what 'repetitive' videos mean in this context.

After running our randomization procedure, there were a few videos within the same action category that shared a high degree of visual similarity (e.g. involving a similar number of agents in similar postures). The issue of visually similar stimuli was not apparent in the larger Experiment 1 set. We clarified our wording in text:

“From the remaining videos, a second set of 76 videos was sampled and manually adjusted to remove videos without agents (in the interest of experimental time) and any videos that were too similar visually to other videos in the same category (e.g., involving a similar number of agents in similar postures).”

(18) Page 22, behavioural ratings: What were the exclusion criteria for the behavioural ratings and the multiple arrangement task?

We thank the reviewer for spotting this oversight. We added this information in text:

“Participants were excluded if they responded incorrectly to catch trials (approximately 10% of trials) requiring them to label the action shown in the prior video, or if they provided overly repetitive ratings (e.g. using only two unique values or fewer out of five possible ratings throughout the experiment).”

(19) Page 22: Participants were instructed to rate 'how social the events were'. In the view of this reviewer, this leaves a lot of room for interpretation (and also differs from the instructions given in previous studies). Please comment.

We agree that our instructions did not much constrain participants’ judgments of sociality. Given the diversity of our natural videos (which involved interactions as well as people simply acting alongside each other), we wanted to capture the full spectrum of what sociality means to people, rather than focusing on a yes/no detection of social interaction. Previous studies are also quite heterogeneous in their measurement of sociality, with some focusing on interactions (e.g. Tucciarelli et al., 2019) and others defining it as person-directedness (Tarhan and Konkle, 2020) or relevance of one agents’ actions to another (Wurm and Caramazza 2019, Wurm, Caramazza and Lingnau 2017). In our view, the validity of our instructions is supported by the fact that sociality ratings were not less reliable than other ratings and among the most reliable of our social ratings. On the other hand, the different operationalization means that this feature may have captured different information than in previous studies, and may explain our lack of correlation between sociality and EEG data. We now mention this in the discussion:

“Alternatively, our operationalization of this feature (which was broader than in some previous studies, e.g. Tucciarelli et al., 2019; Wurm et al., 2017) may have led to differences in the information captured.”

(20) Page 22: How consistent were the behavioural ratings across participants? Was variability across participants higher for some features than for others?

This information is summarized in Figure 3 —figure supplement 1 and its caption. We now also discuss this in text:

“Behaviorally rated features differed in reliability in Experiment 1 (F(4,819) = 22.35, P<0.001), with sociality being the most reliable and arousal the least reliable (Figure 3 —figure supplement 1). In Experiment 2, however, there was no difference in reliability (F(4,619) = 0.76, P=0.55). Differences in reliability were mitigated by our use of feature averages to generate feature RDMs.”

(21) Page 23 line 454f: How was the subset of videos presented in trial 1 selected?

We added some more information on this procedure in text:

“The experiments included a training trial in which participants arranged the same seven videos (in Experiment 1) or eight videos (in Experiment 2) before beginning the main task. In both experiments, these videos were hand selected to represent clear examples from four categories.”

(22) Page 28 line 541f: According to which criteria did the authors center the 0.5 seconds window around the action?

The procedure was based on visual inspection of the videos. In most cases, the first 0.5 seconds were selected; however, a different segment was selected in cases where this made the action clearer (e.g. a video that began with someone catching a ball to then throw it was cropped around the throwing action, which could then be shown in its entirety). This was especially important given the brief nature of the 0.5 s videos. We have clarified this in text:

“The three-second stimuli were trimmed to a duration of 0.5 seconds centered around the action as determined by visual inspection, to ensure that the shorter videos were easily understandable. This helped improve time-locking to the EEG signals and allowed for a condition-rich experimental design.”

(23) Page 28 line 557f: What do the authors mean by ‘shuffled pairs’?

We clarified what we meant:

“The video pairs were presented in different orders to minimize learning effects, so that for each video pair (V1, V2), half of the presentations were in the order V1-V2 and half of them were in the order V2-V1.”

(24) Page 29, 2nd paragraph: Please provide details on the projector.

We added this information:

“The stimuli were presented using an Epson PowerLite Home Cinema 3000 projector with a 60Hz refresh rate.”

Reviewer #2 (Recommendations for the authors):

If I have read the paper (and code) correctly, it seems that a single RDM was used to model all timepoints in the EEG responses, and that single RDM was correlated with the RDMs generated by the pairwise decoding analysis for each timepoint. For purposes of modelling behavioral data – in which the subjects are making a judgment based on the entire video clip – a single RDM per video clip seems reasonable. However, for purposes of modelling time-varying neural responses, it may be less than ideal to use a single RDM to summarize each video. The usefulness of a single temporal summary RDM will depend on how homogenous the features are across the video clip. Having looked over a few of the video clips in Moments in Time, the clips appear to be generally slow-moving – i.e., no large changes in background elements or actors across the 3 seconds. This seems like a reasonable justification for the use of a single temporal summary RDM; I would encourage the authors to clarify their rationale for using one RDM for the whole timecourse instead of framewise RDMs. Quantitative arguments would be useful.

We agree that a time-resolved approach to modelling the video features could offer more information. Here, however, the videos used in the EEG paradigm were very brief – only 0.5 seconds long. Furthermore, we made efforts to ensure we selected videos where actions were clear and continuous, without sudden camera or actor changes. The shorter 0.5s duration of EEG stimuli likely increased visual homogeneity throughout each clip.

Finally, this would only apply to the visual features (specifically, the layer activations from AlexNet), as the other features we used for our EEG analysis did not vary across the frames (environment, number of agents, and behavioral judgments based on the entire clips). Still, to ensure that we maximized the amount of information present in our RDMs, our procedure for generating AlexNet features entailed extracting these features from each frame and averaging them across the 12 frames that the 0.5 s clips consisted of. Importantly, we did not see significant changes in AlexNet features across these frames.

In Author response image 3, we plotted the average correlation across frames between Conv1 and FC8 features extracted from AlexNet. For Conv1, the correlation was on average r = 0.89, SD = 0.09 ; for FC8, it was r = 0.98, SD = 0.03. Although there is more variability in correlations for Conv1, they are still high enough to justify our decision to select the average layer activation across frames. Furthermore, this variability diminishes in the FC8 features, which are expected to capture more high-level information, thus supporting our idea that the videos are sufficiently consistent over time to justify our approach:

Author response image 3. Average correlation of CNN features across frames.

Author response image 3.

Each dot in the scatterplots is a video, with the distribution of the correlations shown above.

We added the following to the manuscript:

“We did not use frame-wise RDMs to model these visual features; however, our approach of averaging features across video frames was justified by the short duration of our videos and the high correlation of CNN features across frames (Conv1: Pearson’s ρ= 0.89±0.09; FC8: ρ= 0.98±0.03).”

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

We all agree the manuscript has been improved, the issues raised by the reviewers have been addressed well, and the manuscript will make a nice addition to the literature. But one of the reviewers highlighted a remaining concern that I wanted to draw your attention to. This concern could be addressed with additional analyses or additional discussion and motivation of the approach you used (I will leave it up to you which avenue you want to pursue). I’ll quote from the reviewer:

“Among the visual features the authors investigate is motion. To me, this is a visual feature that would seem to have a strong potential to explain the EEG data in particular, since motion contrast strongly drives many visual areas. I am not completely satisfied with the way the authors have quantified motion, and I have a medium concern that their motion model is a bit of a straw man. First, they refer to the model as a “motion energy” model, which is not technically correct. Motion energy (i.e. Adelson and Bergen, 1985; Nishimoto et al., 2011) reflects image contrast, whereas optic flow does not (at least to the same degree). For example, a gray square moving 2 degrees on a slightly less gray background will generate the same magnitudes of optic flow vectors as a starkly black square on a white background moving the same distance. Motion energy will be substantially greater for higher-contrast stimuli. As such, it's likely that motion energy would be a better model for visual responses (at least in early visual areas) than optic flow. They also choose to compute optic flow densely, with one value per pixel. Thus, if actors in the various videos are in slightly different locations, the optic flow metric could be quite different, and the RDM may characterize arguably similar stimuli as distinct. I think that a multi-scale pyramid of optic flow (or better still, motion energy) would have a better chance of capturing motion selectivity in the way that the brain does.

We have removed our reference to the optic flow model as a motion energy model, and added a new motion energy model:

“To assess whether whether a motion energy model (Adelson and Bergen, 1985; Nishimoto et al., 2011; Watson and Ahumada, 1985) would better capture the impact of motion, we performed control analyses by computing motion energy features for each video using a pyramid of spatio-temporal Gabor filters with the pymoten package (Nunez-Elizalde et al., 2021).”

As the Reviewer predicted, the motion energy model correlated better with behavior than our optic flow model. This correlation was significant in Experiment 1, but not in Experiment 2, in line with our previous results regarding the role of visual features across the two experiments. In the variance partitioning analysis, the addition of the motion energy model to the visual feature group did not change the pattern of results in either experiment. We describe these findings in text:

“Similarly, our decision to quantify motion and image properties separately by using an optic flow model may have reduced the explanatory power of motion features in our data. Indeed, a motion energy model (Adelson and Bergen, 1985; Nunez-Elizalde et al., 2021) significantly correlated with behavior in Experiment 1, but not in Experiment 2. However, the addition of this model did not change the pattern of unique feature contributions (Figure 4 —figure supplement 2).”

There is one more choice that the authors make that I'm not entirely comfortable with, which relates to this same issue. They choose to use only the models that can be related to behavioral data to model the EEG data. This choice is motivated by the desire to relate brain activity to behavior, which is a worthwhile endeavor. However, I think there is an assumption underlying this choice: the authors assume that the signal that they measure with EEG must reflect neural representations that guide behavior and not low-level perceptual processes. I do not think that this is guaranteed to be the case. EEG measures a spatially diffuse signal, which may well be dominated by activity from quite low-level areas (e.g. V1). I think the question of whether the EEG signal reflects relatively high-level cognitive processing or relatively low-level perceptual processing – for any given experiment – is still open. For most of the models they exclude, I don't think this is a big issue. For example, I think it's reasonable to test luminance basically as a control (to show that there aren't huge differences in luminance that can explain behavior) and then to exclude it from the EEG modeling. However, I'm less happy with exclusion of motion, based on a somewhat sub-optimal motion model not explaining behavior.

The combination of these two issues has left me wanting a better quantification of motion as a model for the EEG data. My bet would be that motion would be a decent model for the EEG signal at an earlier stage than they see the effects of action class and social factors in the signal, so I don't necessarily think that modeling motion would be likely to eliminate the effects they do see; to my mind, it would just present a fairer picture of what's going on with the EEG signal."

In selecting only models that explained behavior for our EEG analysis, we did not intend to assume that EEG activity only reflects behaviorally relevant features. By adding Conv1 to the behaviorally relevant visual models (FC8 and environment), we hoped we would capture low-level visual processing while at the same time keeping the number of features tractable and linking brain and behavior (as noted by the Reviewer). However, we agree that this approach neglects the role of motion, and we have performed new analyses with the motion energy model.

Indeed, the motion energy model correlated with our EEG data during a sustained time window, with a peak during early time windows. The addition of motion energy to the group of visual features in the variance partitioning analysis increased the unique contribution of visual features and decreased the contribution of action features, suggesting that the ‘action’ portion of the signal we originally detected included some shared variance with motion. However, the temporal hierarchy was not changed in the fixed-effects analysis, which leads us to believe that the three feature groups explain distinct portions of the signal. Most importantly, the portion of variance uniquely explained by social-affective features was unchanged by the addition of the motion energy model.

We describe these results in text:

“Motion has been shown to drive the response of visual areas to naturalistic stimuli (Russ and Leopold, 2015). To better assess the effect of motion on EEG responses, we performed an additional analysis including the motion energy model. There was a sustained correlation between motion energy and EEG patterns beginning at 62 ms (Figure 6 —figure supplement 3). In the variance partitioning analysis, the addition of motion energy increased the unique contribution of visual features and decreased that of action features, indicating that the action features share variance with motion energy. However, the three stages of temporal processing were preserved in the fixed-effects analysis even with the addition of motion energy, suggesting that the three feature groups made distinct contributions to the neural patterns. Importantly, the unique contribution of social-affective features was unchanged in both analyses by the addition of the motion energy model.”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Dima DC, Tomita TM, Honey CJ, Isik L. 2021. Social-affective features drive human representations of observed actions. Open Science Framework. hrmxn [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Supplementary file 1. Additional information about the stimulus sets and features used in analysis.

    (a) Breakdown of scene setting and number of agents across the two final stimulus sets. (b) Features quantified in both stimulus sets and used to generate feature representational dissimilarity matrices (RDMs) in the representational similarity analysis.

    elife-75027-supp1.docx (15.5KB, docx)
    Transparent reporting form

    Data Availability Statement

    Behavioral and EEG data and results have been archived as an Open Science Framework repository (https://osf.io/hrmxn/). Analysis code is available on GitHub (https://github.com/dianadima/mot_action; Dima, 2021).

    Behavioral and EEG data and results have been archived as an Open Science Framework repository (https://osf.io/hrmxn/). Analysis code is available on GitHub (https://github.com/dianadima/mot_action, copy archived at swh:1:rev:af9eede56f27215ca38ddd32564017f1f90417d0).

    The following dataset was generated:

    Dima DC, Tomita TM, Honey CJ, Isik L. 2021. Social-affective features drive human representations of observed actions. Open Science Framework. hrmxn


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES