Action Recognition in a Crowded Environment

Laura Fademrecht; Judith Nieuwenhuis; Isabelle Bülthoff; Nick Barraclough; Stephan de la Rosa

doi:10.1177/2041669517743521

. 2017 Dec 21;8(6):2041669517743521. doi: 10.1177/2041669517743521

Action Recognition in a Crowded Environment

Laura Fademrecht ^1,^✉, Judith Nieuwenhuis ¹, Isabelle Bülthoff ², Nick Barraclough ³, Stephan de la Rosa ⁴

PMCID: PMC5751920 PMID: 29308177

Abstract

So far, action recognition has been mainly examined with small point-light human stimuli presented alone within a narrow central area of the observer’s visual field. Yet, we need to recognize the actions of life-size humans viewed alone or surrounded by bystanders, whether they are seen in central or peripheral vision. Here, we examined the mechanisms in central vision and far periphery (40° eccentricity) involved in the recognition of the actions of a life-size actor (target) and their sensitivity to the presence of a crowd surrounding the target. In Experiment 1, we used an action adaptation paradigm to probe whether static or idly moving crowds might interfere with the recognition of a target’s action (hug or clap). We found that this type of crowds whose movements were dissimilar to the target action hardly affected action recognition in central and peripheral vision. In Experiment 2, we examined whether crowd actions that were more similar to the target actions affected action recognition. Indeed, the presence of that crowd diminished adaptation aftereffects in central vision as wells as in the periphery. We replicated Experiment 2 using a recognition task instead of an adaptation paradigm. With this task, we found evidence of decreased action recognition accuracy, but this was significant in peripheral vision only. Our results suggest that the presence of a crowd carrying out actions similar to that of the target affects its recognition. We outline how these results can be understood in terms of high-level crowding effects that operate on action-sensitive perceptual channels.

Keywords: action recognition, adaptation, crowding, eccentricity, peripheral vision

Introduction

In recent years, an increasing number of researchers point to the necessity that a full understanding of human social behaviour requires probing social cognitive processes, such as action recognition, under more natural experimental conditions (De Jaegher, Di Paolo, & Gallagher, 2010; Schilbach et al., 2013). Yet, surprisingly little work has been done in this regard. For example, nonverbal social interactions in real life often require humans to recognize different actions that appear in the visual periphery. Take for example a situation where you are out with a group of friends and everyone is chatting with each other. Whilst you are talking and directing your gaze to one of your friends, your peripheral vision might help you to notice another friend who is frequently checking his watch. Here, actions viewed in the periphery might serve to direct your attention towards socially relevant gaze locations. Yet, whether we are able to identify actions in the visual periphery under such conditions is so far poorly understood.

There is some evidence that probing action recognition under more naturalistic conditions provides results that differ from those obtained with standard psychophysical setups. For example, Keefe, Wincenciak, Jellema, Ward, and Barraclough (2016) used life-size photo-realistic actors presented three-dimensionally and their results indicate that complex judgments about the actors (the actor’s expectation about the weight of a box to be lifted) are different depending on whether participant view the stimuli on large-scale compared to small-scale screens. Moreover, action recognition performance in the visual periphery as probed by life-size human stick figures (see example in Figure 3) is different from action recognition performance probed by small point-light humans. For example, we have previously shown that action recognition of life-size stick figures (∼32° visual angle [VA]) is excellent up to 75° eccentricity (Fademrecht, Bülthoff, & de la Rosa, 2016, 2017). In contrast, the detection and discrimination of small point-light walkers (11° VA) was significantly decreased already at 12° eccentricities (Ikeda, Blake, & Watanabe, 2005; Ikeda, Watanabe, & Cavanagh, 2013; Thompson, Hansen, Hess, & Troje, 2007). These discrepant findings indicate that one cannot necessarily generalize the findings from experiments using desktop computers to more natural viewing conditions. For this reason, the current study examined action recognition using life-size action stimuli.

Figure 3. — Adaptation paradigm: Timeline of the adaptation phase followed by an experimental trial of the experimental phase.

Size is not the only important factor to make viewing conditions appear natural. Humans are social beings that often gather together. As a result, actions are rarely viewed in isolation in real life. The presence of other people sufficiently close to a target actor could in principle induce well-known crowding effects. Previous research indicates that the deleterious effect of crowding on visual recognition of objects and actions is particularly pronounced in the visual periphery (Levi, 2008; Whitney & Levi, 2011). This is typically explained by the decline of visual acuity towards the periphery (Levi, 2008). Although crowding effects have also been reported for action recognition (Ikeda et al., 2013; Ikeda & Watanabe, 2016), little is known about whether crowding affects foveal and peripheral action recognition alike. There is some evidence that crowding already occurs for direction discrimination of point-light walkers in the fovea. Thornton and Vuong (2004) demonstrated that flankers’ walking direction influenced the perception of the walking direction of a target. Specifically, they found longer reaction times for reporting the walking direction of a target walker when target and flanking walkers faced different directions compared to when they faced the same direction. This effect of crowding on walking direction discrimination has also been found in the near periphery. Ikeda et al. (2013) showed that crowding occurred only with walking flankers, but not with scrambled walker flankers. These results suggest that crowding of biological motion is not due to low-level motion crowding effects but rather occurs on some higher level of visual processing, and that the presence of bystanders takes a toll on action recognition both in the fovea and near periphery.

Despite the clear demonstrations of crowding that these studies have provided, the degree to which these results apply to the ability to recognize actions when life-size stick figures are used is not known. Note that the recognition of certain types of actions (i.e., judging whether an action is a slap or a waving) is particularly relevant for social interactions in everyday life. For example, the ability to discriminate whether a person is waving or preparing to slap allows the observer to choose an appropriate action. There is some evidence that the discrimination of an action and its direction are partly dissociable suggesting that they are not mediated by the same mechanism (de la Rosa, Ekramnia, & Bülthoff, 2016; Ikeda & Watanabe, 2016). Hence, the degree to which action direction discrimination generalize to action discrimination is not known and therefore understanding the effect of crowding on action discrimination requires further investigation.

To this end, we conducted three experiments to examine action discrimination in the presence of other people under naturalistic conditions. In particular, we investigated whether the effect of crowding depends on target eccentricity and whether different types of crowd influence action discrimination differently. We used a setup in which actions were carried out by life-size human stick figures to provide both form and motion information of an action. Furthermore, our display allowed the assessment of action discrimination across the entire horizontal visual field (for more details, see Fademrecht et al., 2016, 2017).

Experiment 1

We used a visual adaptation paradigm to examine action recognition. Visual adaptation refers to the transient change in percept of a stimulus after prolonged exposure to an adapting stimulus. For example, in colour adaptation, one perceives a white square to have a greenish tint after adapting to a red square. In an action adaptation paradigm, one of two actions (e.g., hug and clap) is typically used as an adaptor. Its presentation is followed by an ambiguous action which is a weighted average of the hug and clap actions. Participants frequently report that this perceptually ambiguous stimulus looks more like a clap after participants have adapted to a hug and vice versa (e.g., de la Rosa, Streuber, Giese, Bülthoff, & Curio, 2014; de la Rosa et al., 2016). Action adaptation effects can be well described in terms of neural populations (Webster, 2011) tuned to different actions (visual action channels). Adaptation causes the response of the visual action channel that is sensitive to the adapted action (e.g., hug) to be reduced. Hence during the subsequent presentation of the ambiguous stimulus, which contains originally an equal perceptual amount of hug and clap action, the hug channel’s response is smaller compared to that of the clap channel. Therefore, the observer perceives the ambiguous stimulus to look more like a clap. Interestingly, action adaptation and, more generally, visual adaptation effects agree well with physiological and brain imaging observations (Barraclough, Keith, Xiao, Oram, & Perrett, 2009; Barraclough & Jellema, 2011; Grill-spector & Malach, 2001; Kourtzi & Kanwisher, 2001). Because adaptation allows the selective targeting of neural populations, visual adaptation has also been termed the psychologist’s microelectrode (Frisby, 1979). Here, we used this method to selectively target neural processes underlying action recognition and to investigate their sensitivity to visual eccentricity and crowding.

In Experiment 1, we investigated the impact of two factors on the recognition of actions (hug and clap) performed by a life-size moving figure. These factors were the presence or absence of a nearby crowd and foveal or peripheral viewing conditions. We used an action adaptation paradigm and presented the test actions in central vision and at 40° eccentricity in three ‘crowd’ conditions: the moving figure was presented (a) alone, (b) in a crowd of static actors and (c) in a crowd of actors performing idle movements.

Methods

Participants

We recruited 14 participants (8 females) from the local community of Tübingen. Participants received monetary compensation for their participation in the experiment. Their age ranged from 21 to 36 years (M = 25.5; SD = 4.1). Participants’ visual acuity was normal or corrected to normal with contact lenses (glasses could obliterate parts of the visual periphery). Participants in all experiments provided written informed consent prior to the experiment. The study was conducted in accordance with the Declaration of Helsinki and under the guidelines of the ethics board of the University of Tübingen.

Stimuli

We created two kinds of action stimuli: adaptor stimuli and test stimuli. In addition, we also generated crowd stimuli. All stimuli depicted moving human stick figures. Adaptor and test actions were carried out by one stick figure (target) presented at two positions on the screen (at fixation or at 40° eccentricity to the right). All stick figures were oriented so that the action was executed towards the participant.

Adaptor stimuli

In the current study, we chose two actions (clap and hug) to extend the adaptation paradigm designed by de la Rosa et al. (2014) to other social actions. The clap and hug actions were recorded from one actor via motion capture using a MVN Suit (XSens, Netherlands) containing 17 inertial and magnetic sensor modules. The sampling rate of the sensors was 120 Hz. Both actions started with a neutral body position and lasted 1,385 ms. Each action sequence ended at the point in time just before the actor started moving back to the neutral position. To display the actions, we mapped the recorded motion capture data onto a life-size avatar built as a grey stick figure (height: 170 cm, height 24° VA) using the Unity 3D (Unity Technologies, USA) game engine (see Figure 3). We used a stick figure instead of a more realistic avatar to prevent other visual cues like appearance or gender from influencing participants’ decisions about the action. Its position was defined by the position of a point midway between both hips.

Test stimuli

We used the same stick figures to create our test stimuli. These stimuli performed actions obtained by morphing between both adaptor stimuli (hug and clap) as described by de la Rosa et al. (2016) and Ferstl, Bülthoff, & de la Rosa (2017). First, we calculated weighted averages (weight_hug = 1 − weight_clap) for each rotation of each body joint (e.g., knee, elbow) of the hug and the clap actions for each time-normalized frame (to assure that the two action movies that are being morphed have the same length). The morph weights ranged from 0.0 to 1.0 in steps of 0.1 to create nine morphed actions and two nonmorphed actions. All sequences lasted 1,385 ms. We presented these 11 actions at fixation to all participants and asked each of them to indicate verbally which action looked most ambiguous in terms of hug or clap. Which morph step was perceived as most ambiguous was individual to each participant and was used thereafter to create test stimuli specifically for each participant. We determined the test stimuli for each participant separately for the following reason. Measuring an adaptation aftereffect involves measuring the shift in perception of a stimulus after being adapted. This shift in perception is usually the largest around the point of subjective equality (ambiguity). By tailoring the ambiguous stimuli to each participant’s individual perception, we ensured that the shift in perception after adaptation was at the point of subjective equality and therefore maximal for each participant.

Once the most ambiguous action for each participants had been determined, the second step consisted of creating four additional morphed actions whose morph levels were equally spaced and symmetrical (step size: 0.025) around the chosen ambiguous action morph level for a total of five ambiguous test actions. For all participants, the weights we used varied between a minimum of 0.33 and a maximum of 0.63. We used five test actions that were perceptually discriminable by the participants so that they were not confronted with the exact same single test stimulus in all trials in the experiments reported later in this article.

Crowd stimuli

The crowd consisted of 16 stick figures. First, the stick figures were spatially distributed evenly (separated by 9.33° VA) along an arc of a circle positioned at 6 m virtual distance away from the participant (i.e., 2 m behind the target, who was acting out the adaptor and test actions). Thereby, the crowd spanned 140° VA. We then applied a random positional jitter along the x (left-right) and z (forward-backward) dimensions, which was maximally one eighth of the angular separation between adjacent crowd members (i.e., jitter range ±1.17° VA). This was meant to ensure the crowd’s spatial distribution appeared more natural (Figure 1). We created two different crowds: an idly moving crowd and a static crowd. All avatars of the idle crowd performed distinct small movements like stepping from one foot to the other or shaking one leg. They were animated for the same duration as the adaptor and test action stimuli. In the static crowd condition, we presented the first frame of each avatar animation for the same duration. The idle movements were selected from Rocketbox Libraries (Microsoft, Ireland) and applied to each figure of the crowd randomly. Selection criteria for the idle animations were that the stick figures never lifted their arms above the chest and that the animation was calm and moderately paced. We applied these criteria to ensure a clear distinction between the actions of the target stick figures and of those of the crowd members.

Figure 1. — **Illustration of the spatial distribution of a crowd**. Participant (black) seated at a table in the middle of the panoramic screen arena. Red circle: Adaptor or test stimulus in front of the crowd. Dotted line (grey) illustrates the arc of a circle positioned at 6 m virtual distance away from the participant. The small blue axes indicate the jitter that was applied to each crowd member along the x- and z-axis along the arc of a circle. Displayed distances and angles are approximate and not to scale.

The most central crowd members were positioned at a horizontal distance of 4.7° VA ±1.17° from the target stick figure. The shoulder width of the target stick figure was 6° VA and varied between 3.6° VA and 5° VA for the crowd stick figures (due to spatial jitter along the z dimension). When the jitter in x- and z-coordinates was maximal, the distance between the shoulders of the crowd avatars and the target stick figure was 0.1° VA. When stick figures executed actions or idle movements, however, their arms were moving, this could lead to slight transient overlaps between them.

Target and crowd stick figures were clearly distinguishable, as the target stick figure was always presented in front of the crowd members (ca. 2 m).

Apparatus

We used a large panoramic screen with a semicylindrical two-dimensional (2D) projection system for the presentation of the stimuli (Figure 2; for more detailed information see Fademrecht et al., 2016). The screen was 3.2 m high and 7 m long. Its main vertical portion used for presenting the stimuli was 2.5 m away from the participants and covered 230° horizontally and 125° vertically of their visual field. It extended onto the floor towards the location where participants were seated (see Figure 2). The screen was equally lit with a mid-grey light. The Unity 3D (Unity Technologies) game engine in combination with a custom written control script was used to control presentation of the stimuli on the screen and collect keypress responses given by the participants on a keyboard. The use of the game engine in combination with the large screen leads to the impression of the avatars standing in a virtual room extending the real room in which the participants were sitting. The game engine ensured that all depth cues including correct lighting, occlusion and size scaling were present except for stereo cues. In this fashion, participants perceived the stick figure as being life size and located 4 m (target) or 6 m (crowd) away from the participants. During the experiment, participants were required to focus their gaze on a white fixation cross presented on the screen straight ahead of them.

Procedure and Design

Participants were seated in the middle of the screen arena and their heads were stabilized with a chin and forehead rest placed on a desk in front of them (see Figure 2). All tests were performed with the target presented at fixation and at 40° eccentricity to the right. All participants, before starting the tests described later in this article, had performed the verbal rating task mentioned earlier to design their personal ambiguous stimulus set.

Baseline Condition

First, in a baseline condition, we probed each participant for his or her perception of the (ambiguous) test stimuli obtained earlier without the presentation of an adaptor. Each trial began with the presentation of a fixation cross and, after 500 ms, the test stimulus. The fixation cross was continuously present during stimulus presentation. After a blank of 500 ms the question: ‘What did it look like’? and the response options ‘hug’ and ‘clap’ appeared on the screen. Participants were asked to respond by pressing the appropriate key on a keyboard (keys 0 and 1 were used). The answer was not timed and there was no time restriction. Each of the five ambiguous test stimuli was presented three times in pseudorandom order. Presentation location (0° vs. 40° eccentricity) was also randomized.

Adaptor Conditions

After the baseline measurements, we used an adaptation paradigm to test how sensitive action recognition is to our experimental manipulations. Our adaptation paradigm consisted of the presentation of an adaptor followed by a test stimulus. The original hug and clap actions served as adaptor stimuli whilst the five ambiguous morphs were used as test stimuli. We used this adaptation paradigm with all possible combinations of our experimental conditions: three crowd conditions (no crowd, static crowd and idly moving crowd), two different eccentricities (0° and 40° eccentricity) and two adaptor stimuli conditions (clap and hug). In the 0° eccentricity condition, adaptor and test stimuli were presented at 0°, whilst in the 40° eccentricity condition, adaptor and test were both presented at 40° eccentricity. The position of the crowd remained the same in both eccentricity conditions. In the no crowd condition, adaptor and test stimuli were presented alone on the screen. In the other two conditions, the crowd was visible during both the adaptation and test phases. We completely crossed all levels of the factors adaptor, eccentricity and crowd (2 × 2 × 3) and all participants were tested on every combination. Each factor combination was presented in a blocked fashion.

A schematic outline of an experimental block is shown in Figure 3. Each started with an initial adaptation phase followed by an experimental phase. During the initial adaptation phase, an adaptor stimulus (hug or clap) was shown 26 times (interstimulus interval [ISI] = 500 ms). This phase was included to maximize any adaptation aftereffect. Thereafter, the experimental phase consisted of several experimental trials. Each experimental trial started first with four presentations of the adaptor (ISI = 500 ms) to ‘top-up’ the initial adaptation. Immediately after that, a 500-ms blank screen was presented together with a 1000 Hz beep warning about the imminent display of the test stimulus. Subsequently, one of the five ambiguous test stimuli appeared followed by the answer screen (Figure 3). Participants had unlimited time to respond. The next experimental trial started immediately after the participants gave their response via keypress. Participants were asked to report their subjective feeling regarding the action category (hug or clap) of the test stimulus. Participants were explicitly instructed to judge the test stimuli, (not the adaptor), as either hug or clap. Within each block, each of the five ambiguous stimuli was presented three times for a total of 15 trials per block, whilst stimulus presentation was randomized.

An Eyelink II eye tracker mounted on the chin rest recorded participants’ eye movements. Participants were asked to fixate on the fixation cross during each trial. We had planned to remove from analysis trials for which participants moved their gaze away from the fixation cross by more than 2° during the stimulus presentation. Due to a technical error, however, the eye-tracking data could not be used. However, previous research using the same testing environment had shown that participants could reliably fixate (proportion of invalid trials was less than 0.8%) even during stimulation of the visual periphery (Fademrecht et al., 2016).

Results

To analyse the data, we calculated the proportion of clap responses for each of the experimental conditions. Note that the results would be identical if we had chosen to calculate the proportion of hug responses instead of clap responses.

Our main goal in this experiment was to assess the adaptation aftereffect (defined here as the difference of proportion of clap responses between hug and clap adaptor conditions) on action perception (see Figure 4). A two-way repeated measure analysis of variance (ANOVA) with crowd and eccentricity as within-subject factor demonstrated that the main effects of crowd, F(1, 13) = 1.57; η²_partial = 0.11; p = .232, and eccentricity, F(2, 26) = 0.16; η²_partial = 0.01; p = .853, as well as their interaction, F(2, 26) = 0.81; η²_partial = 0.06; p = .457, were all nonsignificant. Hence, there was no difference in adaptation aftereffects in central and peripheral vision and the static crowd as well as the idly moving crowd had little influence on the adaptation aftereffect in comparison to the no crowd condition. Moreover, all adaptation aftereffects were significantly different from zero (Holm corrected) in all crowd conditions at 0° and 40° eccentricity (all p < .001).

Figure 4. — Overall adaptation aftereffects for each crowd condition (no crowd, static crowd and idly moving crowd) at 0° and 40° eccentricity. Colours represent the three crowd conditions. Error bars represent standard errors of the mean.

Discussion

Experiment 1 revealed that neither the presence of a crowd nor peripheral presentation significantly affected action adaptation. Similar adaptation aftereffects in central vision and at 40° eccentricity indicate that action recognition mechanisms that are susceptible to adaptation can discriminate actions even in the far periphery. These results are in line with other research that suggests that participants have little difficulty in recognizing moving actions in the visual periphery up to 45° eccentricity (Fademrecht et al., 2016).

The presence of a static or an idly moving crowd did not influence the effects of action adaptation. This finding is in accordance with previous research which showed that low-level adaptation aftereffects (e.g., orientation adaptation aftereffects) are not at all or little affected by crowding. For example, Blake, Tadin, Sobel, Raissian, and Chong (2006) demonstrated that crowding does not reduce the orientation adaptation aftereffect, at least when high contrast stimuli are presented. Similarly, Pelli and Tillman (2008) reported that crowding affects the discrimination of target orientation but has little effect on the occurrence of an orientation adaptation aftereffect. Hence, one explanation might be that adaptation aftereffects, in general, are little affected by crowding.

An alternative explanation could be that neural populations (action channels) are sensitive to a specific action (akin to the action-sensitive units in the Giese and Poggio model [2003]). According to this view, in order for the crowd to induce crowding effects, the crowd actions need to activate at least one of the two action channels (clap or hug channel) involved in the perception of the test stimulus. Yet, neither the static nor the idle crowd showed actions that could activate those channels. Hence, adaptation aftereffects should be unaffected by these crowds. According to this explanation, a crowd might only modify the adaptation aftereffect if its members display clap and hug actions. In Experiment 2, we tested this hypothesis.

Experiment 2

Experiment 2 replicated Experiment 1 with the only difference that we used a crowd whose members were carrying out the adaptor actions (active crowd). We examined action adaptation aftereffects with the same adaptation paradigm whilst manipulating eccentricity (0° vs. 40°) and crowd (no crowd vs. active crowd).