Perception in dynamic scenes: What is your Heider capacity?

Farahnaz A Wick; Abla Alaoui Soce; Sahaj Garg; River Grace; Jeremy M Wolfe

doi:10.1037/xge0000557

. Author manuscript; available in PMC: 2020 Feb 1.

Published in final edited form as: J Exp Psychol Gen. 2019 Feb;148(2):252–271. doi: 10.1037/xge0000557

Perception in dynamic scenes: What is your Heider capacity?

Farahnaz A Wick ^1,², Abla Alaoui Soce ², Sahaj Garg ³, River Grace ⁴, Jeremy M Wolfe ^1,²

PMCID: PMC6396302 NIHMSID: NIHMS1002458 PMID: 30667269

Abstract

The classic animation experiment by Heider and Simmel (1944) revealed that humans have a strong tendency to impose narrative even on displays showing interactions between simple geometric shapes. In their most famous animation with three simple shapes, observers almost inevitably interpreted them as rational agents with intentions, desires and beliefs (“That nasty big triangle!”). Much work on dynamic scenes has identified basic visual properties that can make shapes seem animate. Here, we investigate the limits on the ability to use narrative to share information about animated scenes. We created 30-second Heider-style cartoons with 3–9 items. Item trajectories were generated automatically by a simple set of rules, but without a script. In Experiments 1 and 2, ten observers wrote short narratives for each cartoon. Next, new observers were shown a cartoon and then presented with a narrative generated for that specific cartoon or one generated for a different cartoon having the same items. Observers rated the fit of the narrative to the cartoon on a scale from 1(clearly does not fit) to 5(clearly fits). Performance declined markedly when the number of items was larger than three. Experiment 3 had observers determine if a short clip of a cartoon came from a longer clip. Experiment 4 had observers determine which of two narratives fit a cartoon. Finally, in Experiment 5, narratives always mentioned every item in a display. In all cases of matching narrative to cartoon, performance drops most dramatically between 3 and 4 items.

Keywords: visual working memory, dynamic scene understanding, tracking capacity, social cognition

We have the compelling impression that we can perceive the entire environment surrounding us in rich detail. However, it is well known that we cannot actually fully process all the local information in our environment at once. As a consequence, we selectively attend to some objects or regions and, as a result, those regions are more fully processed than other aspects of the current scene. This selection can be based on our current goals and/or lower level features (Wolfe & Horowitz, 2017; Theeuwes, 2018). One factor that can shape deployment of attention is known as the animate monitoring bias (New, Cosmides & Tooby, 2007; Simion, Regolin & Bulf, 2008). Attention is preferentially deployed to animate agents such as people or animals compared to inanimate objects such as plants and vehicles. Motion also influences the prioritization of inanimate items in our field of view (Buren, Uddenburg & Scholl, 2016). Thus, if a ball rolls towards you, it is more likely to capture your attention than if it is moving away or if it is stationary.

We deploy attention to motion because, in the real world, objects in motion are more likely to be important. All the more so, if the motion is intentional. One evolutionary key to successful social relationships is the ability to understand the mental states or intentions of other people. Motion is one clue to those intentions. For instance, if you see someone shaking their fists at another person, you automatically think that the person moving their hands might be “angry”. Using visual input to understand and recognize that people have beliefs and desires different from our own is an aspect of theory of mind (Premack & Woodruff, 1978).

Inferring intentions of other people is so important for our survival that it has been suggested that neural mechanisms to detect biological motion might have evolved for this very reason (Frith & Frith, 1999). The rules for inferring animacy and intentions on the basis of motion have been the subject of considerable psychophysical research (Pantelis et al, 2014; Gao, McCarthy & Scholl, 2010; Barrett, Todd, Miller & Blythe, 2005; Tremoulet & Feldman, 2006, 2000; Leslie, Friedman & German, 2004; Blythe, Todd, Miller, 1999, Heider & Simmel, 1944) and computational modeling (Baker, Jara-Ettinger, Saxe & Tenenbaum, 2017; Pantelis et al, 2016). Others have considered the role of these phenomena in the context of mindreading (Kuhlmeier, Wynn & Bloom, 2003, Gallese & Goldman, 1998).

In this paper, our interests are focused on the limits of the ability to infer the goals and intentions of others from their motion. It seems intuitively clear that there must be a limit. For instance, it might be difficult to simultaneously consider the separate intentions of 100 agents. To bring this problem to a controlled lab setting we used animations with simple geometric shapes. The tendency to ‘anthropomorphize’ simple shapes was famously demonstrated by Heider and Simmel (1944). In their study, observers were asked to view a short, simple animation of two triangles, a circle and a rectangular frame with a “door” (see Figure 1). When observers were asked to describe the animation, they produced narratives involving the three shapes instead of a literal description of the movement. For present purposes, a narrative can be defined as an organized interpretation of a sequence of events. Given the classic Heider and Simmel cartoon, most narratives involved a romantic relationship between the small triangle and the circle who were trying to escape the aggressive triangle. In producing such stories, observers are adopting what Dennett (1989) called the “intentional stance”, imposing intentions on the shapes rather than simply offering a report of the visual features and motions. This mechanism of remembering events in the form of a story seems to be reflexive and automatic (Scholl & Tremoulet, 2000). It could be thought of as a way to ‘parse’ a dynamic stimulus into meaningful units (Tse, Cavanagh & Nakayama, 1998; Zacks, 2004). Turning motion into a story is reminiscent of ‘chunking’ processes in memory (Miller, 1956), where the imposition of meaning onto a stimulus allows more of that stimulus to be coded into memory. Thus, the string “14921776” becomes two dates rather than eight digits. In a similar way, a string of motions might be recoded as the square avoiding the circle. This narrative recoding differs from chunking of digits because, while it could be seen as a more compact representation of the input, narrative recoding in this manner would not allow the observer to recover the precise details of the unchunked stimulus in the way that those two dates can be restored to a string of eight digits. In this paper, we examine the limits on the use of narrative in perceiving and remembering motion. Specifically, suppose that an observer generates a narrative based on a cartoon with N moving shapes each with its own motion. Will a second observer be able to look at a cartoon and determine if that cartoon inspired that narrative? The motions we used could interact (e.g. object A ‘chases’ object B) but we did not use group motion (e.g. a school of fish) because we assume that, for present purposes, the group would become a single object.

Figure 1. — A single frame, redrawn from the animation used to study perception of intention in Heider and Simmel’s (1944) experiment.

Our question about processing limits or capacity is different than the questions that have been the usual focus of interest in studies of the perceived animacy of simple shapes. Most work has focused on the stimulus properties that produce a perception of animacy. Historically, Michotte (1963) did the classic work describing when motions and interactions produce a causal interpretation (Did this object cause the motion of that object?). More recently, Scholl and Gao (2013) produced a taxonomy of motions that look animate (Did that spot ‘decide’ to move or did it simply move when hit by some other item?). Similarly, Tremoulet and Feldman (2000, 2006) asked observers to assess the animacy of a small moving dot that changed direction and speed of movement. Dots that underwent larger direction or speed changes were perceived to be more animate than dots that underwent smaller changes. More complex states like “chasing” can be modulated by subtle cues such as whether an object is oriented toward another object. You do not tend to chase things that you are not looking at (Gao, McCarthy & Scholl, 2009, see review: Scholl & Tremoulet, 2000; Gao, Newman & Scholl, 2010). Pantelis and Feldman performed a series of studies exploring how human observers attribute mental states to autonomous virtual agents and how they categorize those mental states in dynamic displays where agents forage for food and each agent can explore, gather, attack or flee from other agents (Pantelis et al., 2016; Pantelis et al, 2012; Pantelis & Feldman, 2012).

We are asking about the stories that arise when people apply these rules to moving objects and attempt to tell others about what they have seen. Assuming that different observers follow similar rules of inference, observers seem likely to agree about the interpretation of simple events or behaviors. For example, if one object makes contact with another and the second, previously stationary object moves, an observer might say that object A hit object B and caused it to move. Observer B is likely to agree. Multiple objects can generate similarly simple stories if objects can be grouped. For instance, a single “wolf” object can be perceived as “chasing” a collection of “sheep” objects (Gao, McCarthy & Scholl, 2009). But what happens when the numbers of agents or groups of agents gets larger? If we view 3, 4, or more items, each performing its own rule-governed behavior, what do we perceive? If one observer creates an account of the activity in a multi-item display, would another observer recognize that account as clearly referring to that display? What is the limit on the number of agents that we can integrate into a coherent, generally agreed upon narrative? The purpose of this paper is to make an estimate of that limit – a limit we will call our ‘Heider capacity’ in honor of the classic Heider and Simmel work.

We define ‘Heider capacity’ as the ability to infer intentionality of agents in a dynamic scene and communicate that intentionality to others in the form of a narrative. It can be thought of as “shared capacity” based on communication and agreement between observers. The term “capacity” should not be overstressed here. We are attempting to characterize a limit on the ability to perceive and/or communicate the contents of simple, artificial scenes. To anticipate, our data will show that this shared capacity appears to be quite limited. Heider capacity varies, to some degree as a function of how it is probed. However, over a series of experiments, performance falls as the number of agents rises. The general trend is when the number of items in a display exceeds three, observers are markedly less likely to agree about what they have seen.

General Method

We generated cartoons composed of simple geometric shapes in motion, as in the original Heider animation described above. However, rather than being scripted, the movements of objects in our cartoons were governed by a set of stochastic rules, described below. Therefore, the story line was much less predetermined in our cartoons than it was in Heider’s. In Part 1 of the experiment, we asked a group of observers to watch the cartoons and produce narratives. In Part 2, a separate group of observers saw a cartoon and read a narrative that was derived from that cartoon or from another cartoon with the same ‘characters’ but with different motions. We measured the ability of these new observers to determine if the narrative matched the cartoon. From the change in performance as a function of the number of characters on screen, we derive an estimate of what we are calling the “Heider capacity”.

Experiment 1: Measuring Heider capacity with simple shapes

Method

Participants

Part 1: Collect narratives

Twenty observers were recruited from Amazon Mechanical Turk to view cartoons and produce a narrative for each cartoon viewed. All observers were from the United States, gave informed consent and were paid $8.00 for approximately 45–60 minutes of their time. The informed consent procedures were approved by Brigham and Women’s Hospital IRB.

Part 2: Measure Heider capacity

Ninety-six observers were recruited through Amazon Mechanical Turk to measure Heider capacity from the narratives and cartoons. All observers were from the United States, gave informed consent and were paid $2.00 for approximately 10–15 minutes of their time. The informed consent procedures were approved by Brigham and Women’s Hospital IRB. For a novel experiment where we did not have an estimate the effect size, we tested a large sample in the hope of achieving a robust effect (see power discussion below). We used the same sample size in most of the subsequent experiments.

Stimuli

Heider-style cartoons were created using Matlab with Psychtoolbox (Brainard, 1997; Pelli, 1997; Kleiner et al, 2007). Each cartoon contained moving circles, squares and triangles of uniform size and a larger, unmoving black rectangle representing a wall or obstacle. We generated a total of 20 cartoons, populated by 3, 4, 5, 7 or 9 moving shapes or ‘characters’. Four cartoons were created for each set size, each populated by the same characters, but with movements dictated by a unique combination of rules. Each cartoon was 30 seconds in length. The length was chosen based on pilot experiments: shorter length cartoons (15–20 seconds) produced impoverished/uninteresting narratives whereas narratives from longer cartoons (45–60 seconds) showed quite marked primacy-recency effects. Systematic manipulation of the length could be an interesting follow-up study. The cartoons were generated and stored offline. The cartoons are available at https://osf.io/atc9x/.

The numbers of circles, squares and triangles were distributed as evenly as possible; i.e. if the set size was seven, there would be two of each shape, plus an extra instance of one shape. Each shape was randomly assigned a unique color and a behavior from the following list: chasing, repulsion, attraction, moving to a specific location on the screen (e.g. one corner), jittering, and avoiding the stationary rectangle. Behaviors were selected with replacement. Thus, two items might be ‘chasing’ in the same cartoon. The cartoons selected for the experiments were visually inspected to ensure that shapes exhibited different mixes of behaviors. For instance, it would not be permitted for two, five-element cartoons to have two chasers and an item moving to a corner, even if the motion paths and the shapes assigned to these motions were not identical.

Colors.

Twenty-eight visually distinguishable colors were generated and selected from the following website: http://phrogz.net/css/distinct-colors.html using the default settings. Therefore, every shape or character in the cartoons had a ‘visually distinct’ color. Different colors were assigned to shapes of the same type (for instance, there were no two ‘blue’ triangles in a cartoon). Colors will have some range of variation because the stimuli were viewed on many different screens. However, if a narrative declares that the red square was following the blue circle, we have no reason to believe that normal variation in color across platforms would interfere in understanding of the narrative.

Movement rules.

Each cartoon consisted of 1800 static frames presented one after the other for approximately 17 ms each. Depending on the motion rule, the items could move a minimum of 1 and a maximum of 6 pixels per frame. If we assume a display subtending approximately 53° by 35° (Dell 28-inch monitor with a resolution 1920 × 1280 pixels, viewing distance of 50 cm), individual items were 2° in diameter. The black rectangle would subtend a visual angle of 2° x 10° (see Figure 2). The resulting item velocities were in the average range from 1° to 4°/s. The velocities of the items were not constant and could increase or decrease by 1 pixel per frame on average. Of course, viewing conditions for Amazon Turk observers will vary. However, over a reasonable range, it seems unlikely that this variation will be a critical variable in this task. Consider, for example, that the “narrative” unfolding in a movie is not radically different if the observer is in the first row of the theater or viewing the film on the screen on the back of the airline seat in row 32.

Figure 2. — Screenshots from cartoons of different set sizes (indicated by the number on the top left corner) in (a) Experiment 1 and (b) Experiment 2.

Initial positions and directions for each shape were generated at random. There were many possibilities for collisions between items, in which case, the shapes bounced off each other conserving momentum. Objects also bounced off the edges of the display and the stationary rectangle. The specific motion rules are described below:

Chasing:

Two shapes in the cartoon were selected whenever this rule was used. The chaser would move towards the direction of the chased shape and the chased item would pick a random direction and move in that direction away from the chaser when that chaser approached within a ~200 pixel radius.

Attraction:

Again, two shapes were selected whenever this rule was used. Each shape would move towards each other in a straight line. When they were within 50 pixels of each other, they would move along a random but conjoint trajectory, circling and bumping into each other. Note that during collisions with other shapes, obstacles or edges of the display, these shapes could separate and subsequently rejoin each other.

Repulsion:

Two shapes were assigned to this rule. The shapes would move toward each other on the screen but if they were within ~100 pixels of each other, they would exert a force (calculated from the distance between the shapes with an added constant) on each other so it would seem like they are ‘pushing’ each other back. After repelling each other, there was a 30% probability that the shapes would wander away from each other and a 70% chance that they would, again, move toward each other for another round of repulsion.

Moving to a corner of the screen or to a random location:

A shape moved to a corner or other specific location from its current location. Since the velocities never dropped to zero, the shape would bounce around that corner of the display or move around randomly in the vicinity of its specific goal location.

Jittering:

The shape would change its heading in the range of +/− 20° per frame. This would give the impression of jittering or ‘shaking’ along a trajectory. This property could be assigned to any item on the screen with a probability of 30%. Therefore, there could be multiple items, involved in other behaviors, that could have this property.

Avoiding the stationary rectangle:

These items moved on straight line trajectories. They would slow down and change direction whenever the item was within ~100 pixels of the stationary rectangle

Procedure

Part 1: Collecting narratives

Each observer completed 12 trials, including 2 practice trials. They viewed two cartoon versions for each of the five set sizes. On each trial, observers pressed a key to start the cartoon. They were encouraged to take notes as they viewed each cartoon, but could not pause or replay them. After the cartoon, a textbox appeared and observers were asked to write a story about the animation that was at least 25 words long. If the textbox contained less than 25 words, a pop-up display would alert the observer of the minimum word limit and prevent them from proceeding to the next trial. We did not give any specific guidelines for the stories, except to state that it was not useful to give purely physically descriptive accounts (e.g. It would not be useful to say “There were 5 shapes. They were red, green, blue, yellow, and purple. They moved around.”).

There were five set sizes: 3, 4, 5, 7 and 9. Each experiment consisted of 12 cartoons: 2 practice trials and 2 cartoons for each of five set sizes. The practice trials consisted of cartoons from set sizes 3 and 9 and these cartoons used in the practice trials were not repeated during the experiment. Narratives were collected from ten observers for each cartoon in the stimulus set. These narratives were rated independently by three lab assistants (all naïve to the purpose of the study) for their accuracy and fit with the corresponding cartoon. The guidelines provided to assess a fit were to assign a point for each of the following 6 criteria met: shapes could be identified based on the physical description, behaviors could be identified, narrative was “entertaining” (to avoid purely physical descriptive accounts), and the narrative did not contain specialized references (e.g. to pop culture). Points were given if narrative attributed emotions to shape and if the spatial structure and locations of shapes were described accurately. The five highest rated narratives per cartoon were selected for Part 2 of the study. Narratives from practice trials were discarded. The narratives were corrected for spelling and grammatical mistakes.

Part 2: Measuring Heider capacity

On each trial in Part 2, an observer viewed a cartoon, read a narrative, and then rated how well the narrative matched the cartoon. No observers from Part 1 participated in Part 2. There were two conditions (described below) with 48 observers in each condition. If we use a chi-sq test to compare correct responses between set sizes, 117 observations are adequate to detect an effect of Cohen’s medium size (.3) with a power of .9 at a significance level of .01. Our 480 observations in each condition (48 observers, 10 observations per observer) should be more than adequate to see any interesting effects of set size. We use similar sample sizes in all our experiments.

Observers completed 12 trials including 2 practice trials, viewing two cartoon versions from each set size. The two practice trials consisted of cartoons with set sizes 3 and 9 and these cartoons along with the corresponding narratives were not repeated during the experiment. In each trial, observers pressed a key to start the cartoon. They were encouraged to take notes but could not pause or replay the cartoon. After the cartoon ended, a narrative was shown that either matched or mismatched the cartoon just viewed (cartoon-first). In a separate condition with 48 new observers, the narrative was presented first followed by the cartoon (narrative-first). After the observers had been exposed to both a cartoon and a narrative, they rated whether the narrative fit the cartoon on a scale from 1 (clearly does not fit) to 5 (clearly fits). An equal number of matched or mismatched narratives were shown for each set size. Data from practice trials were not included in the analysis. Each cartoon was shown only once during the experiment to avoid repetition or learning effects.

Results

Part 1

On average, the selected narratives produced in Part 1 were 43.6 words long (SD = 20.5). When the set size shown in the cartoon was less than 5, the narratives usually described some behaviors of all items in the cartoons (see Fig 3, left). As set size increased however, narratives tended to be focused on a subset of the items (approximately 3 items), which generally revolved around shapes that moved in and out of the two spaces separated by the black rectangle. Here are two sample narratives:

Figure 3. — *Left:* Average number of shapes mentioned in narratives used In Experiment 1 and 2. Recall there were four cartoons for each set size involving the same actors. The number of items were averaged across 20 narratives per set size. *Right:* Average number of actions mentioned in the narratives used in Experiments 1-5. The error bars represent standard error.

Set size 3

The triangle and circle played together, chasing each other all over the field and having a great time. The poor square was left out, and was timidly trying to join in and get them to notice but to no avail. The square was left out and ignored by the other two.

Set size 9

The triangles had been friends for a while now. One of them convinced the other to go to the party. The green one was super hyper, but generally stayed with his triangle friends. The circles, meanwhile, were very extroverted and talked to everyone at the party.

As shown in Figure 3, we counted the number of shapes mentioned in narratives from Experiment 1. Only shapes that were explicitly mentioned (‘that blue triangle’) were counted. Some narratives contained shapes (such as ‘the nervous square’) entering or leaving a group of other items and in these cases, only the shapes that were described performing some action were counted. The group would not be counted in this case. In narratives where group behavior was clearly described, rather than the behavior of individual shapes, the total numbers of items in the groups were used. Thus, if a narrative stated ‘the triangles attacked the circles’, all the triangles and circles would be considered to have been mentioned, though one could argue that multiple objects had been reduced to two groups. As can be seen in Figure 3, the average number of items described in narratives is around three regardless of the number of shapes in the scene. As the set sizes in the cartoon increased, observers did not or could not increase the numbers of shapes in the narrative. Note that they did not, for example, report on shapes A, B, and C for the first 10 seconds of the cartoon and C, D, and E for some later portion. Stories tended to be about three shapes. An important implication here is that two observers watching a cartoon with 9 shapes, for example, might not recognize each others’ narratives because each might be paying attention to a different three-item subset. This three-item limit is suggestive of the limits on visual working memory (Luck & Vogel, 1997; Oksama & Hyona, 2004; Wolfe, Reinecke & Brawn, 2006) and on working memory more generally (Cowan, 2017) though the similarity could be coincidental.

We counted the number of action words used in the narratives used in Experiment 1. We counted words that described interactions between two items or behaviors of single items. Compound descriptions like ‘pushing and kicking’ were considered to be a single action word. Behaviors describing emotion such as ‘jittery square’, ‘moved angrily’ and position of items such as ‘close to red square’ were also considered action words. If the action of an item was repeated in the story, the repeats were counted separately. As can be seen in Figure 3 (right), the average number of action words (~ 4 words) used to describe behavior in narratives is similar across set sizes.

Part 2

As shown in Figure 4, the rating scale data from Part 2 can be used to generate Receiver Operating Characteristic (ROC) curves for each set size for the entire group of observers. Points on the ROC are determined by shifting a decision criterion. Thus, all ratings above 3 might be taken as “match” responses and all other responses as “mismatch”. The match responses give rise to true positive and false positive proportions and, thus, to a point on the ROC. Moving the criterion to a rating of 2 give a different set of proportions and a different point, and so on. Area under the curve (AUC) or d′ values can be derived from these curves (Macmillan & Creelman, 1996). If observers were guessing, the points of the ROC should fall along the diagonal (dotted lines in Figure 4). If observers can discriminate between matched or mismatched narratives, then the ROC curve will lie above the chance diagonal, curved towards the top left-hand corner.

Figure 4 shows a straightforward result. In both conditions (cartoon-first, narrative-first), when there are three items in the cartoon, the second observer can recognize whether the cartoon she is seeing fits with the story that another observer is telling. When the set size is greater than 3, performance deteriorates markedly. This can be quantified by calculating d′ values from the area under the ROC curves (see Table 1). Thus, we would say that the ‘Heider capacity’, as measured in Experiment 1, appears to be about 3.

Table 1:

d’ values for each set size in each experiment.

	Experiment 1		Experiment 2		Experiment 3

Set size	Cartoon first	Narrative first	Cartoon first	Narrative first	Cartoon first	Clip first

3	1.73	1.61	1.27	1.67	1.11	0.93
4	0.57	0.59	0.55	0.93	0.97	0.22
5	0.38	0.67	0.29	1.14	0.64	0.70
7	0.37	0.89	0.02	0.92	0.46	0.32
9	0.43	0.65	0.40	0.91	0.77	0.59

Open in a new tab

To understand the performance differences between set sizes within and across the two conditions, we counted the number of ‘correct’, ‘neutral’ and ‘incorrect’ responses. A response was coded as correct (or incorrect respectively) if the narrative matched the cartoon and the observer’s rating was greater than 3 (recall that a rating of 5 meant that the observer agreed that the narrative clearly fits the cartoon). A rating of 3 was coded as ‘neutral’. We compared these coded ratings for successive set sizes and found that performance on set size 3 is significantly different from set size 4 for both cartoon-first (χ² (2) = 13.48, p < 0.005, ϕ = 0.27) and narrative-first (χ² (2) = 10.52, p < 0.005, ϕ = 0.23) conditions. No other pairwise comparisons between set sizes greater than 3 were significant in either condition (all χ² (2) <1.5, p > 0.4, after alpha correction of p < 0.025 for multiple comparisons). We performed a two-way repeated measures ANOVA with Set size and Condition as the independent variables and used the average coded responses for each set size as the dependent variable. As would be expected, this shows a main effect of Set size, F(4, 376) = 10.31, p < 0.001, partial η² = 0.09. There was no effect of Condition (cartoon-first vs narrative-first: F(1, 94) = 3.43, p = 0.07). The interaction was not significant, F(4, 376) = 0.88, p = 0.470 after Greenhouse-Geisser correction. We used a repeated-measures ANOVA as it is equivalent and theoretically more powerful than the non-parametric Friedman’s test for comparisons across two-classifiers when the ANOVA’s assumptions are met (Demšar, 2006).

Since these experiments were conducted on Amazon Turk, response time is not an interesting measure. Observers finished the experiments at their leisure within the allotted one hour. Average completion time was 21 minutes.

Discussion

Why is the Heider limit approximately 3 items?

Though our intuition might suggest that two observers should be able to agree about what happens in a scene, even when it involves more than three actors, this appears not to be the case in Experiment 1. A limit of about 3 items is similar to and might be related to limits on working memory (Cowan 2001) and/or motion tracking though this experiment does not prove that connection. In multiple object tracking (MOT), 3–4 items is a typical limit on the number of items that can be tracked. Some multiple object tracking (MOT) studies have shown that we can track anywhere from 4 up to 8 identical objects at once under the right conditions (Alvarez & Franconeri, 2007). Tracking performance depends on crowding of items in the display, within or across visual hemifields, and the speed at which objects travel (see Scimeca & Franconeri, 2015, for a review). Our displays use parameters similar to those that produce MOT capacities around 4 to 6 items.

Even if 4+ items were tracked, the apparent capacity could be depressed if the basic features of color and shape are not firmly tied to their items. Feature binding failures certainly occur when observers are asked about stimuli that are defined by conjunctions of two or more features (Treisman & Schmidt, 1982). For instance, a cartoon containing a blue square chasing a red circle could yield a narrative describing a red square and a blue circle (“binding errors” or “illusory conjunctions”). Scholl and others (1999) have found that in multiple object tracking (MOT) displays, observers may successfully report the targets’ location and motion direction, while failing to report accurate shape or color (Scholl, Pylyshyn & Franconeri, 1999). Even when all objects have unique shapes and/or colors, these unique identities are not remembered well in tracking tasks. In a multiple identity tracking (MIT) study, observers were asked to track the locations of unique moving objects. At the end of the trial, they were asked to report the identity of a probed target and it was found that limits were at least as severe as those seen in MOT tasks (Oksama & Hyona, 2008). In a related MIT task, Horowitz et al. (2007) used unique cartoon animals as stimuli. Their observers watched these animals move about the screen. At test time, all animals were occluded and the observers were asked to locate a specific animal. Observers could typically locate only 1 or 2 such animals in a display. If these limits are relevant to performance in our Heider task, it is fairly easy to see how it would be difficult to agree on a story, once the number of actors got much beyond 3 or 4. One might have expected somewhat better performance, given that observers were allowed to take notes, but apparently this did not help a great deal. Unfortunately, we did not collect these notes.

Beyond tracking limits, performance in these displays might also be limited by visual crowding (Whitney & Levi, 2011). As set size increases, crowding will increase in our displays. Crowding could exacerbate the binding errors, mentioned above (Treisman, 1996; Cave & Wolfe, 1999). Features like color, shape, and motion might be transposed between objects. Crowding and binding problems would be lessened if the items were more distinctive. Accordingly, in Experiment 2, we replicate Experiment 1 with a set of shapes that are intended to be harder to confuse with each other.

Experiment 2: Measuring Heider capacity with distinct shapes

Since distinct shapes are known to reduce the effects of crowding, because of a more efficient representation of their features and locations (Whitney & Levi, 2011), we repeated Experiment1 with the more distinctive set of stimuli shown in Figure 4.