The visual system discounts emotional deviants when extracting average expression

Jason Haberman; David Whitney

doi:10.3758/APP.72.7.1825

. Author manuscript; available in PMC: 2011 Jun 25.

Published in final edited form as: Atten Percept Psychophys. 2010 Oct;72(7):1825–1838. doi: 10.3758/APP.72.7.1825

The visual system discounts emotional deviants when extracting average expression

Jason Haberman ¹, David Whitney ²

PMCID: PMC3123539 NIHMSID: NIHMS282513 PMID: 20952781

Abstract

There has been a recent surge in the study of ensemble coding, the idea that the visual system represents a set of similar items using summary statistics (Alvarez & Oliva, 2008; Ariely, 2001; Chong & Treisman, 2003; Parkes, Lund, Angelucci, Solomon, & Morgan, 2001). We previously demonstrated that this ability extends to faces and thus requires a high level of object processing (Haberman & Whitney, 2007, 2009). Recent debate has centered on the nature of the summary representation of size (e.g., Myczek & Simons, 2008) and whether the perceived average simply reflects the sampling of a very small subset of the items in a set. In the present study, we explored this further in the context of faces, asking observers to judge the average expressions of sets of faces containing emotional outliers. Our results suggest that the visual system implicitly and unintentionally discounts the emotional outliers, thereby computing a summary representation that encompasses the vast majority of the information present. Additional computational modeling and behavioral results reveal that an intentional, cognitive sampling strategy does not accurately capture observer performance. Observers derive precise ensemble information given a 250-msec exposure, suggesting a rapid and flexible system not bound by the limits of serial attention.

At any given moment, we are actively identifying one or very few items (Luck & Vogel, 1997; Rensink, O'Regan, & Clark, 1997; Simons & Levin, 1998; Simons, Nevarez, & Boot, 2005). However, this counters our rich perceptual experience. Either our intuition is an illusion (often referred to as the grand illusion; Noë, Pessoa, & Thompson, 2000) or we perceive more information than has been revealed by previous studies. There has been a recent surge in the study of ensemble coding, the visual system's ability and natural tendency to represent sets of similar items using summary statistics (Ariely, 2001; Chong & Treisman, 2003; Haberman & Whitney, 2007, 2009; Parkes, Lund, Angelucci, Solomon, & Morgan, 2001). For example, averaging has been established for low-level features and textures, such as orientation (Dakin & Watt, 1997; Parkes et al., 2001), direction and speed of motion (Watamaniuk, 1993; Watamaniuk & Duchon, 1992), position (Alvarez & Oliva, 2008; Morgan & Glennerster, 1991), shadow orientation (Koenderink, van Doorn, & Pont, 2004), size information (Ariely, 2001; Chong & Treisman, 2003), and even facial expression/identity (de Fockert & Wolfenstein, 2009; Haberman & Whitney, 2007, 2009; Sweeny, Grabowecky, Paller, & Suzuki, 2009). Although it is generally agreed that the visual system extracts summary statistical properties about most low-level features (e.g., position, orientation, motion, etc.), it is unclear whether this is the case for information about sizes of averages (Myczek & Simons, 2008).

In a series of elegant simulations, Myczek and Simons (2008) found that many of the results purporting average size perception could be explained by our current understanding of working memory (i.e., sampling just a few items from the set), negating the necessity of a putative mechanism dedicated to average size representation (although this is debated; Ariely, 2008; Chong, Joo, Emmanouil, & Treisman, 2008; Simons & Myczek, 2008). Although Myczek and Simons's work was limited strictly to average size perception, it is important to consider that such a cognitive sampling strategy could extend to other visual domains (e.g., faces); perhaps summary statistical representations are automatically computed, prior to a stage of selection, only for very low-level visual features, such as position, motion, color, lightness, and orientation.

We have previously shown that observers perceive the average expression in a crowd of faces with great precision and reliability (Haberman & Whitney, 2007, 2009). Observers derive this mean representation despite lacking information about the set's constituents. However, an important question is whether the perception of average facial expression is actually the result of a dedicated, automatic, summary statistical process (e.g., linear pooling, as in the orientation domain; Parkes et al., 2001). It may be the case that the perception of average facial expression results from the cognitive sampling of one or two face(s) from the entire set, rather than from an explicit averaging mechanism.

It is important to note that, in all visual domains—whether texture perception, global motion perception, or average expression perception—subsampling some number or percentage of stimuli over space or time will eventually adequately explain performance (e.g., Morgan, Chubb, & Solomon, 2008). Thus, we distinguish between automatic, implicit, unintentional subsampling, which is ubiquitous even in “low-level” texture perception, and cognitive subsampling, in which intentionally collecting only one or two items is sufficient to match averaging performance. Our goal was to test whether ensemble perception occurs automatically and unintentionally and is not bound by the limits of a cognitive sampling strategy.

Given the complexity of emotional face processing, one might expect that perceiving average facial expression should rely on a serial sampling of very few face images (i.e., cognitive subsampling). Indeed, emotion recognition is processed with relative sluggishness; faces typically do not exhibit popout effects (Brown, Huey, & Findlay, 1997; Kuehn & Jolicœur, 1994; Nothdurft, 1993), even from among other faces that possess highly salient differences. Searching for an emotional face tends to be a deliberate, serial process (although see Hansen & Hansen, 1988). Ensemble coding of faces—the perceptual averaging of emotional expression (Haberman & Whitney, 2007, 2009)—might therefore be expected to be a serial, deliberate process as well.

Here, we show that, on the contrary, summary information about groups of faces is derived very quickly, is sensitive to overall statistics of crowds of faces, and is not driven by cognitive subsampling of 1–3 items. We measured sensitivity to summary statistics (i.e., average expression in groups of faces) using a method-of-adjustment (MOA) technique. Observers adjusted a test face to match the perceived mean expression of a preceding set of faces that contained emotional outliers. The emotional outliers introduced additional variance into the set, which made summary representation more difficult (Morgan et al., 2008). We examined whether observers would compensate for the increased variance by preferentially representing the local mean (the mean of the set, excluding the outliers) over the global mean (the mean of all of the items in the set). Through three behavioral experiments, along with Monte Carlo simulations, we show that observers more precisely represented the local mean expression of a 12-item set after only a 250-msec exposure. A cognitive subsampling strategy cannot adequately account for the speed and precision of this effect. These experiments provide evidence that, under some circumstances, the visual system coarsely codes summary statistics about the expressions of faces in crowds, all while discounting deviant information.

EXPERIMENT 1A

In the first experiment, we used an MOA technique to assess the precision with which the observers would represent the mean emotion of a set of faces. Observers adjusted the expression of a test face to match the perceived mean of the previously displayed set of faces. Unlike in previous research, however, these sets contained emotional outliers (i.e., faces that substantially deviated from the overall set mean). The introduction of emotional outliers addresses a particular issue of interest: Do observers incorporate the outliers in their assessment of the overall set mean, or do they disregard the outliers? Outlier facial expressions would tend to disrupt a serial sampling strategy (Myczek & Simons, 2008) more than a global, parallel averaging process (e.g., linear pooling, as in the orientation domain; Parkes et al., 2001) because including outliers in one's sample would heavily distort or shift the mean representation (particularly if observers sampled only a couple of items).

The MOA technique offers an alternative measure of summary statistical precision, as described below. It is important to note that we pushed the limits of statistical face representation abilities by presenting sets of 12 faces for 250 msec.

Method

Participants

Five individuals (3 female, mean age = 23.6 years) affiliated with the University of California at Davis, participated. Informed consent was obtained for all of the volunteers, who were compensated for their time and had normal or corrected-to-normal vision. All of the research was approved by the university's Institutional Review Board.

Stimuli

We created three sets of 50 faces by linearly interpolating (Morph 2.5, 1998) between 2 emotionally extreme faces of the same person, taken from the Ekman gallery (Ekman & Friesen, 1976). To create the range of morphs, multiple facial features (e.g., corners of the mouth, bridge of the nose, center of the eye, etc.) were matched between the emotionally extreme faces. The software then linearly morphed between the start- and end-points specified and outputted 50 image files. The stimulus sets ranged from happy to sad, sad to angry, and angry to happy. The amalgam of 150 faces formed the stimulus set, a virtual circle of emotions that was functionally infinite (Figure 1).

(A) Sample trial from Experiment 1. Observers viewed sets of 12 faces for 250 msec. Each set contained 2 emotional deviants. (B) Circle of emotions used in the experiments. A random face along this continuum was displayed during the test phase, and observers used the mouse to adjust the test face to match the emotional mean of the previously displayed set. The solid circle represents the set mean (local), and the dotted circle indicates the location of the deviants. Note that this is a sparse representation of the stimulus set.

Morphed faces were nominally separated by emotional units (e.g., Face 2 was one emotional unit sadder than Face 1). The label emotional unit is arbitrary, and we do not mean to imply that every emotional unit corresponds to a categorically distinct emotion. Although emotion representation is thought to unfold nonlinearly in emotion space (Russell, 1980), previous testing revealed that our stimulus set is psychophysically linear (i.e., all morphs were equally discriminable; Haberman, Harp, & Whitney, 2009). Face images were grayscale (98% maximum Michelson contrast) and occupied 3.04 × 4.34 degrees of visual angle. The set of 12 faces on the screen occupied 12.16 × 13.02 degrees of visual angle. Faces were presented in a 3 × 4 grid (Figure 1A). The background relative to the average face had a maximum Michelson contrast of 29%.

In previous studies, ensemble face perception was investigated using sets of faces that had a uniform distribution of emotional valences (Haberman & Whitney, 2007, 2009). Similar to those in previous experiments, the sets in this experiment initially contained three instances of four emotional expressions. Faces were ±3 and ±9 emotional units around a randomly selected set mean, and all set members were distinguishable from each other. The difference in this experiment was that 2 of the members, selected randomly, were replaced with emotional outliers; sets therefore contained 10 faces ±3 and ±9 emotional units around a randomly selected mean and 2 identical outlier faces whose expression was ±60 units away from the initial set mean (Figure 1). This skewed the distribution and mean expression away from a uniform distribution and increased overall set variance. Note that there was a local mean expression corresponding to the average expression of the 10 faces that were within ±9 emotional units of each other (excluding the outliers). There was also a global mean corresponding to the average expression of all 12 faces on the screen (i.e., incorporating the outlier faces that were ±60 emotional units away from the local mean).

Procedure

Observers saw the set of 12 items for 250 msec, followed by a single test face. The initial expression of the test face was random. Using the mouse, observers adjusted the test face to match the perceived average expression of the preceding set. The adjustment task allowed observers to cycle through the morph circle (Figure 1) and choose any one face from the set of 150. Observers pressed the left mouse button to indicate their choice, and the next trial began 500 msec after the buttonpress.

Each run had 200 trials, and observers performed four to six runs (800–1,200 total trials).

Most previous experiments exploring summary statistics (Ariely, 2001; Chong & Treisman, 2003; Haberman & Whitney, 2007) incorporated some form of a two-alternative forced choice (2AFC) paradigm in which chance performance was 50% correct. However, chance performance on an MOA task is defined as 1 divided by the number of stimuli (1/150 in our experiment). Rather than categorizing each response as correct or incorrect, however, MOA allows us to derive how far observers were from the actual set mean on every trial. In other words, we can plot observers' complete error distributions.

Results and Discussion

On each trial, there was a local mean expression and a global mean expression. The local mean was the average emotion of all set members, excluding the two outliers. The global mean was the average emotion of all set members. Figure 2 shows the error distribution around the local and global means—that is, the difference between the observer's selected test face on each trial and local and global means, respectively. A Von Mises distribution (a circular normal distribution) was fit to the error distribution. Unlike in a Gaussian distribution, the area under a Von Mises curve must integrate to 1. Since our stimuli formed an emotional circle, the Von Mises was the appropriate distribution to use. It was formalized as

{\frac{\exp [k * \cos (x - a)]}{[2 π * besseli (0, k)]}}

where a was the location (i.e., where along the circle the points cluster) and k was the concentration (i.e., inversely related to SD, so the larger the number, the more concentrated the distribution). We assessed the precision of mean representation by measuring the SD (converted from k) of the Von Mises distribution; the smaller the SD of the curve, the more precise the mean representation. Figure 2 shows that all of the observers had a smaller SD for local mean than for global mean, suggesting that the observers were more sensitive to the local mean of the set. Monte Carlo resampling revealed that the error distributions for the 5 individual observers were narrower for the local mean than for the global mean, and statistical tests reached significance for 4 observers (only P.L.'s was not significant). A paired t test across observers also revealed a significantly smaller SD (i.e., better precision) for local mean than for global mean [t(4) = 4.26, p = .013]. The greater sensitivity to the local mean suggests that observers either filtered or suppressed the outlier information—that is, that which was impossible to integrate with the rest of the set.

Experiment 1's results, including the method-of-adjustment error distributions and the proportion of responses at each possible separation between the user-selected test face and the actual set mean. For each observer, we plotted distance from the global mean (including outliers) and distance from the local mean (excluding outliers) and fit a Von Mises distribution to the data. Note the narrowing width of the curve in the local mean condition relative to the global mean (also reflected in smaller SDs), consistent for each observer. The axes have been converted from radians to emotional units for readability.

The analysis described above is collapsed across the sign of the emotional outliers. That is, it treats outliers occurring 60 units above the mean the same as outliers occurring 60 units below the mean. The sign of the outlier may be informative, however, so we reanalyzed the data, fitting curves to trials in which observers saw only high or only low outliers (see Figure 3A). The results provide additional evidence that observers filter, at least to some extent, outlier information in the set. Figure 3A shows that, when the outliers are negative, adjustment to the global mean is offset in the positive direction. The peak of the adjustment curve corresponds closely to where the local mean occurs (i.e., the mean when outliers are excluded), approximately 10 emotional units above the global mean. The converse is also true (Figure 3A). This trend is consistent among all observers, and paired t tests confirm that the absolute offset (the a parameter from the Von Mises equation; how far away from 0 the curve peaks) is larger for outliers in the global mean adjustment curve than for those in the local mean adjustment curve [t(4) = 4.20, p = .014].

Analysis when the signs of the outliers are taken into account. (A) Regardless of the sign of the outliers, observers tended to ignore them when adjusting to the mean of the set, indicated by the offset of the curve from 0. (B) When flipping and averaging the data from panel A, there is no significant difference in the proportion of responses that occurred at the outlier location as compared with those that occurred at the antioutlier location (i.e., the areas surrounding 60 units from the mean).

One might expect that, if the analysis were broken down by the sign of the outlier, there would be a disproportionate amount of observer responses corresponding to the location of the outlier, relative to the region where there was no outlier. However, a paired t test examining the proportion of responses that occurred in the outlier regions with the antioutlier regions revealed no difference [t(10) = 0.76, p = .31]¹ (Figure 3B). This indicates that, after viewing a set with an outlier, observers were not more likely to pick a face near the outlier than they were to pick a face far from the outlier. This definitively rules out the possibility that observers subsampled only one item from the entire set, since, had this been the case, there would have been a greater number of responses in the outlier region than in the antioutlier region. Subsampling is further explored in Experiment 2.

This experiment demonstrated that observers have greater sensitivity to the local mean of the set than to the global mean, and therefore seem to discount the emotional outliers. The value of a summary statistic, such as average texture, orientation, motion, or facial expression, is only useful insofar as it captures ensemble information in the stimulus. Outliers disrupt ensemble information by increasing set variance; therefore, discounting the deviant information might be advantageous. Outlier discounting is a computationally simple way to mitigate variance and increase the reliability of a summary statistic, such as average expression.

EXPERIMENT 1B

It is possible that, instead of implicitly filtering deviant information, observers were aware of the presence of the outliers on every trial and consciously ignored them in their estimate of the average expression? Such a strategy would indicate that the emotional outliers can be detected and may even pop out relative to the rest of the faces, which could undermine the computational efficiency with which ensemble coding is thought to operate (Alvarez & Oliva, 2009). Although some work suggests that emotionally deviant information is not available preattentively (Nothdurft, 1993), there is at least some (controversial) evidence to suggest that it is (Hansen & Hansen, 1988). Even if observers did not detect the emotional deviance in a parallel fashion, other low-level features may have distinguished the outlier faces. In a separate control experiment, we explicitly tested whether the outlier faces popped out and how well observers represented those faces. If the representation of the outliers was poor and they were no more detectable than other faces in the set, a strategy in which observers consciously or deliberately discount the outliers would be improbable.