Accuracy and speed of material categorization in real-world images

Lavanya Sharan; Ruth Rosenholtz; Edward H Adelson

doi:10.1167/14.9.12

. 2014 Aug 13;14(9):12. doi: 10.1167/14.9.12

Accuracy and speed of material categorization in real-world images

Lavanya Sharan ¹, Ruth Rosenholtz ², Edward H Adelson ²

PMCID: PMC4132332 PMID: 25122216

Abstract

It is easy to visually distinguish a ceramic knife from one made of steel, a leather jacket from one made of denim, and a plush toy from one made of plastic. Most studies of material appearance have focused on the estimation of specific material properties such as albedo or surface gloss, and as a consequence, almost nothing is known about how we recognize material categories like leather or plastic. We have studied judgments of high-level material categories with a diverse set of real-world photographs, and we have shown (Sharan, 2009) that observers can categorize materials reliably and quickly. Performance on our tasks cannot be explained by simple differences in color, surface shape, or texture. Nor can the results be explained by observers merely performing shape-based object recognition. Rather, we argue that fast and accurate material categorization is a distinct, basic ability of the visual system.

Keywords: material perception, material categories, material properties, real-world stimuli

Introduction

Our world consists of surfaces and objects, and often just by looking we can tell what they are made of. Consider Figure 1. The objects are easy to identify: a stuffed toy, a cushion, and curtains. It is also clear that these objects are composed of fabric. We may not be able to name all the fabric types in Figure 1, but we know that the surfaces pictured were made from fabric and not plastic or glass. This ability to identify materials is critical for understanding and interacting with our world (Adelson, 2001). By recognizing what a surface is made of, we can predict how hard, rough, heavy, hot, or slippery it will be and act accordingly. We avoid edges of knives and broken glass but not hemlines of garments. We exert more effort to lift a ceramic plate than a plastic plate. We act more quickly when spills occur on absorbent surfaces like paper or fabric. Material categorization, i.e., being able to tell what things are made of, is a significant aspect of human vision, and as far as we are aware, we were the first to systematically study its basic properties (Sharan, 2009).

Here are examples of everyday objects that are mainly composed of fabric. We can identify the objects in these images (left to right: stuffed toy, cushion, curtains) just as easily as we can recognize what they are made of. Unlike object and scene categorization, little is known about how we perceive material categories in the real world.

Most studies of material appearance have focused on the human ability to estimate specific reflectance properties such as albedo, color, and surface gloss. In order to measure the precise relationship between stimulus properties and perceived reflectance, researchers often use synthetic or highly restricted stimuli so that stimulus appearance can be varied easily. By using such controlled stimuli, a number of facts about reflectance perception have been established. It is known that the perceived reflectance of a surface depends not only on its physical reflectance properties (Gilchrist & Jacobsen, 1984; Pellacini, Ferwerda, & Greenberg, 2000; Xiao & Brainard, 2008), but also on its surface geometry (Bloj, Kersten, & Hurlbert, 1999; Boyaci, Maloney, & Hersh, 2003; VanGorp, Laurijssen, & Dutre, 2007; Ho, Landy, & Maloney, 2008), the illumination conditions (Fleming, Dror, & Adelson, 2003; Maloney & Yang, 2003; Gerhard & Maloney, 2010; Olkonnen & Brainard, 2010; Brainard & Maloney, 2011; Motoyoshi & Matoba, 2011), the surrounding surfaces (Gilchrist et al., 1999; Doerschner, Maloney, & Boyaci, 2010; Radonjić, Todorović, & Gilchrist, 2010), the presence of specular highlights (Beck & Prazdny, 1981; Todd, Norman, & Mingolla, 2004; Berzhanskaya, Swaminathan, Beck, & Mingolla, 2005; Kim, Marlow, & Anderson, 2011; Marlow, Kim, & Anderson, 2011) and specular lowlights (Kim, Marlow, & Anderson, 2012), the presence of binocular disparity and surface motion (Hartung & Kersten, 2002; Sakano & Ando, 2010; Wendt, Faul, Ekroll, & Mausfeld, 2010; Doerschner et al., 2011; Kerrigan & Adams, 2013), image-based statistics (Nishida & Shinya, 1998; Motoyoshi, Nishida, Sharan, & Adelson, 2007; Sharan, Li, Motoyoshi, Nishida, & Adelson, 2008), and object identity (Olkkonen, Hansen, & Gegenfurtner, 2008). Recent work has extended this understanding of reflectance perception to include translucent materials (Fleming & Buelthoff, 2005; Motoyoshi, 2010; Fleming, Jakel, & Maloney, 2011; Nagai et al., 2013; Xiao et al., 2014) and real-world surfaces (Obein, Knoblauch, & Viéntot, 2004; Robilotto & Zaidi, 2004, 2006; Ged, Obein, Silvestri, Rohellec, & Viéntot, 2010; Giesel & Gegenfurtner, 2010; Vurro, Ling, & Hurlbert, 2013).

Despite the tremendous progress that has been made on the question of how we estimate reflectance properties, little is known about how we recognize material categories. How do we know that the surfaces in Figure 1 are made of fabric? What cues do we use to distinguish fabric surfaces from nonfabric surfaces, or for that matter, any given material category from the rest? When we first studied these questions (Sharan, 2009), no one had examined the broad range of visual imagery encountered in real-world materials, of the sort shown in Figure 1. Nothing was known about the accuracy of material categorization, or its speed. Unlike the case of object and scenes (Thorpe, Fize, & Marlot, 1996; Everingham et al., 2005; Grill-Spector & Kanwisher, 2005; Fei-Fei, Fergus, & Perona, 2006; Russell, Torralba, Murphy, & Freeman, 2008; Deng et al., 2009; Greene & Oliva, 2009), there were few suitable datasets (Dana, van Ginneken, Koenderink, & Nayar, 1999; Matusik, Pfister, Brand, & McMillan, 2003) to study material categorization.

In the work described here, we started by collecting a diverse set of real-world photographs in nine common material categories, some of which are shown in Figures 2 and 3. We presented these photographs to human observers in a variety of presentation conditions—unlimited exposures, brief exposures, image-based degradations, etc.—to establish the accuracy and speed of material categorization. We found that observers could categorize high-level material categories reliably and quickly. In addition, we examined the role of surface properties like color and texture, and of object properties like surface shape and object identity. Simple strategies based on color, texture, or surface shape could not account for our results. Nor could our results be explained by observers merely performing shape-based object recognition. We will describe these findings in greater detail in the sections that follow.

An example of a material categorization task. Three of these images contain plastic surfaces while the rest contain nonplastic surfaces. The reader is invited to identify the material category of the foreground surfaces in each image. The correct answers are: (left to right) wood, plastic, plastic, leather, plastic, and glass.

Examples from our image database of material categories: (a) fabric, (b) glass, (c) leather, (d) metal, (e) paper, (f) plastic, (g) stone, (h) water, and (i) wood. We used an intentionally diverse selection of images; each category included a range of illumination conditions, viewpoints, surface geometries, reflectance properties, and backgrounds. This diversity in appearance reduced the chances that simple, low-level information like color could be used to distinguish the categories. In addition, all images in our database were normalized to have the same mean luminance to prevent overall brightness from being a cue to the material category.

Since we first presented these findings (Sharan, Rosenholtz, & Adelson, 2009), others have validated our results and gone on to demonstrate that while material categorization is fast and accurate, it is less accurate than basic-level object categorization (Wiebel, Valsecchi, & Gegenfurtner, 2013) and that visual search for material categories is inefficient (Wolfe & Myers, 2010). It has been shown that correlations exist between material categories and perceived material qualities such as glossiness, transparency, roughness, hardness, coldness, etc. (Hiramatsu, Goda, & Komatsu, 2011; Fleming, Wiebel, & Gegenfurtner, 2013). Newer databases have been developed that capture the appearance of real-world materials beyond high-level category labels (fabric synset of Deng et al., 2009; Bell, Upchurch, Snavely, & Bala, 2013). In our own subsequent work, we built a computer vision system to recognize material categories in real-world images and showed that even the best-performing computer vision systems lag human performance by a large margin (Liu, Sharan, Rosenholtz, & Adelson, 2010; Hu, Bo, & Ren, 2011; Sharan, Liu, Rosenholtz, & Adelson, 2013). We will return to the implications of these recent developments in the Discussion section.

Are material categorization tasks easy or hard?

The first question we set out to answer was: Are material categorization tasks easy or hard? Should we expect observers to be good at them or be surprised that they can do the tasks at all? Consider the images in Figure 2. Three of these images contain surfaces made from plastic while the rest contain surfaces made from nonplastic materials. Identifying the images that contain plastic surfaces from ones that do not is not obviously straightforward. The surfaces in Figure 2 differ not only in their reflectance properties, but also in their three-dimensional (3-D) shapes, physical scale, object associations, and even the ways in which they are illuminated and imaged. The plastic surfaces of the bag handles, the sippy bottle, and the toy car look quite different from each other. They also bear many similarities in appearance to the nonplastic surfaces. The glasses are multi-colored like the bag handles and translucent like the sippy bottle. The plastic and nonplastic surfaces belong to similar object categories—bags, containers for liquids, and toy vehicles. Given the variations in the appearance within material categories and the similarities across categories, we should expect material categorization to be challenging.

We evaluated performance at material categorization in two ways. In the Material RT experiment, we measured the reaction time (RT) required to make a categorization response. In the Material RapidCat experiment, we measured categorization accuracy under conditions of rapid exposure, in an effort to compare the time course of material category judgments to those for object and scenes. In both experiments, observers were presented photographs from our database of material categories, one at a time, and were asked to report if surfaces belonging to a target material category were present. Performance was averaged over several choices of target material categories and all photographs within a category.

Stimuli

Color photographs of surfaces belonging to nine material categories were acquired from the photo sharing website, Flickr.com, under various forms of Creative Commons Licensing. The nine categories are shown in Figure 2. Appendix A describes an additional experiment that validates this specific choice of material categories. For each category, we collected 100 images in total, 50 close-ups of surfaces and 50 regular views of objects. Each image contained surfaces belonging to a single material category in the foreground and was selected manually from approximately 50 candidates to ensure a range of illumination conditions, viewpoints, surface geometries, backgrounds, object associations, and material subcategories within each category. All images were cropped to 512 × 384 pixel resolution and normalized to equate mean luminance.

In studies of visual recognition, the accuracy and speed of category judgments are often intimately connected to the choice of stimuli (Johnson & Olshausen, 2003). To ensure that our observers were judging the material category, and not simply low-level image characteristics such as color (brown surfaces are usually wooden) or power spectrum (energy in higher spatial frequencies denotes fabric), we aimed to have our stimuli capture the natural range of material appearances. Consider the fabric selection in Figure 3. The images of the satin ribbon, the crocheted nylon cap, the woven textiles, and the flannel bedding look very different from each other. The four fabric surfaces have different material properties, are of different colors and sizes, and have distinct uses as objects. And yet, it is clear that these surfaces belong to the fabric category and not any of the other eight categories in Figure 3.

Observers

Thirteen observers with normal or corrected-to-normal vision participated, eight in the Material RT experiment and five in the Material RapidCat experiment. All of them gave informed consent and were compensated monetarily.

Procedure

General methods

For all experiments unless noted otherwise, stimuli were displayed centrally on an LCD monitor (1024 × 768 pixels, 75 Hz) against a midgray background using the Psychophysics Toolbox for MATLAB (Brainard, 1997).

Material RT experiment

Observers were asked to make a material discrimination (e.g., plastic vs. nonplastic), as quickly and as accurately as possible. As illustrated in Figure 4a, each trial started with the fixation symbol (+), and after observers initiated the trial with a key press, a photograph from our database appeared. Observers indicated the presence or absence of a target material category with key presses. Reaction times greater than 1 s, which accounted for 1% of the trials, were discarded. Auditory feedback, in the form of beeps, signaled an incorrect or slow response. To account for the minimum time taken to make a decision and execute the motor response, observers were asked to complete two easy categorization tasks that served as baselines: discriminating a red versus blue disc and a line tilted at 45° versus −45°. In all three tasks, the target (e.g., red disc or an image containing paper) was present in 50% of the trials.

Measuring the accuracy and speed of material categorization with RTs. (a) On each trial, observers indicated the presence or absence of a target category (e.g., stone) with a key press. Auditory feedback was provided, and RTs greater than 1 s were discarded. (b) Errors made by the observers are plotted against their median RTs for the baseline categorization tasks (red and orange; Material RT experiment), material categorization task (green; Material RT experiment), and real versus fake task (blue; Real-Fake RT experiment); there is no evidence of a speed-accuracy trade-off. Chance performance corresponds to 50% error. (c) RT distributions for correct trials, averaged across eight observers, are shown here for each type of task. Compared to baseline tasks, material categorization is slower by approximately 100 ms.

Each observer completed 400 trials: 50 trials of red versus blue, 50 trials of 45° versus −45°, and 300 trials of material categorization divided equally between three target categories. Trials were blocked by task. Material categorization trials were further blocked by target category. For each target material category, the distracters were selected uniformly from the other material categories in our database. The order of tasks and target material categories were counterbalanced between observers. The stimuli for the material discrimination task subtended 15° × 12°. The stimuli for the baseline tasks were smaller, of size 4° × 4°. Before starting the experiment, observers were shown examples of the judgments they were expected to make and were given a brief practice session.

Material RapidCat experiment

Photographs from our database were presented for 40, 80, or 120 ms and were immediately followed by perceptual masks, as illustrated in Figure 5a. On each trial, the task was to report whether the photograph, of size 15° × 12°, belonged to a target material category (e.g., fabric). Observers pressed keys to indicate target presence. The target category was present in half the trials; distracters were drawn randomly from the other eight material categories. Each observer completed 900 trials, 100 for each of the nine target material categories. Trials were blocked by stimulus presentation time and target material category. Images in each material category were divided as evenly as possible amongst targets and distracters as well as across the three presentation times. The presentation order, the split of database into target and distracter images, and the presentation times associated with each target category were counterbalanced between observers. Like Greene and Oliva (2009), we created our masks using the Portilla-Simoncelli texture synthesis method (Portilla & Simoncelli, 2000). The Portilla-Simoncelli method matches the statistics of the mask images to the statistics of the stimulus images at multiple scales and orientations, which allows for more effective masking than the commonly used pink noise masks.

Measuring the accuracy and speed of material categorization with rapid presentations. (a) On each trial, the stimulus was presented for either 40, 80, or 120 ms, and it was followed by four perceptual masks for 27 ms each. Observers indicated the presence or absence of a target category (e.g., stone) with a key press. (b) Accuracy at detecting a given material category, averaged across five observers and nine material categories, is well above chance (0.5) for all three presentation times; this rapid recognition performance is similar to that documented for objects and scenes. Errors represent 1 *SEM*.

Results

In all tasks, chance performance corresponded to 50% accuracy.

Material RT experiment

Figure 4b plots error rates versus median RTs for all tasks. From Figure 4b, it is clear that observers were able to complete our tasks. The accuracy averaged across observers was 90.5% on the material categorization task, 95.2% on the color discrimination task, and 93% on the orientation discrimination task. When we considered only the correct trials, median RT taken over all material categories was 532 ms whereas in the baseline conditions, the median RT was 434 ms for the color task and 426 ms for the orientation task. Figure 4c shows the distribution of RTs averaged across observers for correct trials. There was a significant difference in RT between the conditions shown in Figure 4c, χ²(3) = 21.75, p < 0.001. Post hoc analysis with Wilcoxon signed-rank tests and a Bonferroni correction for alpha revealed significant differences between the material categorization task and the baseline tasks (color: Z = −2.52, p = 0.012; orientation: Z = −2.52, p = 0.012). The condition indicated in blue in Figure 4b and c will be described later.

These results demonstrate that observers can make material categorization judgments accurately, even when the images are drawn from a diverse database. However, they take longer to make material category judgments than simpler judgments of color or orientation discrimination. This result is not surprising as the baseline tasks we chose involve a single, simple feature judgment unlike the material discrimination task. Compared to the baseline tasks, the additional time taken to process and respond in the material tasks is approximately 100 ms. Given our initial expectations about the difficulty of material categorization, we find this difference to be small.

Material RapidCat experiment

Our observers achieved 80.2% accuracy even in 40-ms exposures, as shown in Figure 5b. A repeated measures ANOVA determined that accuracy was significantly affected by the image exposure time, F(2, 8) = 24.92, p < 0.001. Post hoc comparisons using the Tukey HSD test revealed a significant increase in accuracy from 40 ms to 80 ms (88.8%) as well as from 40 ms to 120 ms (92.4%). The accuracy at 120 ms is similar to the accuracy recorded for the material tasks in the Material RT experiment, where image exposures were considerably longer and were determined by observers, t(11) = −0.86, p = 0.411.

These results establish that material categorization can be accomplished with brief exposures. The performance at 40 ms (80.2%) is similar to that reported for 2-AFC tasks of animal detection, 85.6% at 44 ms (Bacon-Mace, Mace, Fabre-Thorpe, & Thorpe, 2005), and basic-level scene categorization, 75% at 30 ms (Greene & Oliva, 2009), which suggests that the time course of material category judgments is comparable to those for objects and scenes.

It is useful to speculate about observer strategies in the Material RT and Material RapidCat tasks. In deciding whether an image contained a target material, observers may have employed heuristics based on color, reflectance, or shape. For example, wooden surfaces tend to be brown, metal surfaces tend to be shiny, and plastic surfaces tend to be smooth. Alternatively, observers may have employed heuristics based on object knowledge. For example, bottles tend to be made of plastic, handbags of leather, and clothes of fabric. The diversity of our stimuli makes it unlikely that observers can get away with such strategies. However, the possibility that observers were merely recognizing a diagnostic surface property, such as color, or inferring the material category from object knowledge cannot be ruled out without further experiments. To understand the role of surface properties and object knowledge in material categorization, we conducted additional experiments that are described next.

What is the role of surface properties in material categorization?

The material that a surface is made of determines its reflectance properties and to some extent, its geometric structure (e.g., wax surfaces tend to be translucent and to have rounded edges). Numerous studies have examined how the visual system estimates reflectance and geometric shape properties of surfaces in the world. These studies have shown that human observers can reliably estimate certain aspects of surface reflectance such as color, albedo, and gloss (for reviews, see Gilchrist, 2006; Adelson, 2008; Brainard, 2009; Shevell & Kingdom, 2010; Anderson, 2011; Maloney, Gerhard, Boyaci, & Doerschner, 2011; Fleming, 2012). Observers can also estimate surface shape up to certain ambiguities (for a review, see Todd, 2004). Based on these findings, one might ask: Are judgments of material categories merely judgments of surface properties such as reflectance and shape?

We addressed this question by measuring the contributions of four surface properties—color, gloss, texture, and shape—to material category judgments. We modified the photographs in our database, as described in Stimuli, to either emphasize or deemphasize these surface properties. We then presented the modified images to observers and compared the material categorization performance on the modified images to that on the original photographs. The differences in performance reveal the role of each surface property in high-level material categorization. Similar analyses have been used to understand the cues underlying rapid animal detection in natural scenes (Nandakumar & Malik, 2009; Velisavljevic & Elder, 2009).

In the Material Degradation I experiment, we deemphasized information about color, texture, and gloss, as shown in Figure 6 (first row). If observers are unable to identify material categories in the presence of these significant degradations, it would imply that material categorization is based on simple feature judgments of color, texture, and gloss. In the Material Degradation II experiment, we emphasized information about shape, texture, and color by removing information about all other aspects of surface appearance, as shown in Figure 6 (second and third rows). If observers are able to identify material categories in such images, it would imply that cues based on shape, texture, and color, in isolation, are sufficient for material category recognition.

While designing the manipulations presented in Figure 6, we had to balance many constraints such as the importance of a particular surface property for material categorization and the ease with which it could be manipulated in a photograph. For a diverse database such as ours and in the absence of any knowledge of imaging conditions, it is difficult to isolate the contribution of a given surface property to surface appearance. Indeed, separating the contributions of surface reflectance, 3-D shape, and illumination in a single image, even in controlled conditions, is an active research topic in computer vision (Tominaga & Tanaka, 2000; Boivin & Gagalowicz, 2001; Dror, Adelson, & Willsky, 2001; Tappen, Freeman, & Adelson, 2005; Grosse, Johnson, Adelson, & Freeman, 2009; Romeiro & Zickler, 2010; Barron & Malik, 2011, 2013).