The Advantage of a Ground Surface in the Representation of Visual Scenes

Zheng Bian; George J Andersen

doi:10.1167/10.8.16

. Author manuscript; available in PMC: 2013 Aug 27.

Published in final edited form as: J Vis. 2010 Jul 1;10(8):16. doi: 10.1167/10.8.16

The Advantage of a Ground Surface in the Representation of Visual Scenes

Zheng Bian ¹, George J Andersen ¹

PMCID: PMC3752837 NIHMSID: NIHMS505856 PMID: 20884591

Abstract

The present study used change detection tasks to examine whether there is an advantage of a ground surface in representing visual scenes. In 6 experiments a flicker paradigm (Experiment 1 through 4) or a one-shot paradigm (Experiment 5 and 6) was used to examine whether changes on a ground surface were easier to detect than changes on a ceiling surface. Overall, we found that: (1) there was an advantage in detecting changes on a ground surface or changes to objects on a ground surface; (2) this advantage was dependent on the presence of a coherent ground surface; (3) this advantage could propagate to objects connected to the ground surface through “nested” contact relations; (4) this advantage was mainly due to improved encoding rather than improved retrieval and comparison of the ground surface; and (5) this advantage was dependent on the presentation duration of the scene but not the number of objects presented in the scene. Together, these results suggest a unique role of the ground surface in organizing visual scenes.

An important goal of vision is to recover a description of the world from visual images that can be used to guide behavior (Marr, 1982). Our phenomenal experience in perceiving the visual world is that we recover a richly detailed description of the environment. However, recent studies have shown considerable limitations in this description. Studies have found that observers have difficulty detecting significant changes in a scene if a change occurs during a saccade (Henderson & Hollingworth, 1999, 2003), during a blink (O’Regan et al, 2000), during a blank interval inserted between an original and modified scene (Rensink, O’Regan & Clark, 1997; Simons, 1996; Hollingworth, Schrock, & Henderson, 2001), during a film cut (Levin & Simons, 1997; Simons, 1996), or during a “mudsplash” between an original and altered scene (O’Regan, Rensink, & Clark, 1999). This phenomenon, referred to as “change blindness”, has also been demonstrated in real world interactions when the observer’s view of a real scene was temporarily blocked (Simons & Levin, 1998).

The results of change blindness studies demonstrate that observers do not recover a coherent and detailed representation of the visual world. Instead, limited information is encoded and available for further processing. Given this limitation are there any principles used by the visual system to organize the description of the visual world? Previous research has suggested that the ground surface is used by the visual system as a common reference frame to encode the distance of objects on the surface (Gibson, 1950; He & Ooi, 2000). In the present study we assessed this hypothesis in detail by examining whether the ground surface is used as the foundation for organizing a description of the visual world with relational information of objects, object parts, location and distance encoded relative to this foundation.

Background surfaces, including the ground and ceiling surface may be used as the foundation for 3D scene representations because they provide layout information of objects within scenes. For example, Bian, Braunstein and Andersen (2005) found that the perceived depth order of two objects could be altered by optical contact with either a ground or ceiling surface. Studies have also found that many visual tasks are performed in accordance with background surface information. These studies include tasks such as visual search (He & Nakayama, 1992), detection of the direction of apparent motion (He & Nakayama, 1994a), texture segregation (He & Nakayama, 1994b), depth from binocular disparity (He & Ooi, 2000), and the perception of subjective contours (Gillam & Nakayama, 2002). Boundary extension, a phenomenon in which observers tend to report seeing more of the background scene than was originally presented in a picture, was found in pictures with scene layout information but not in pictures with a blank background (Gottesman & Intraub, 2002, 2003). Prior experience with a background scene can have a priming effect on judging the layout information in the scene (Sanocki & Epstein, 1997; Sanocki, 2003). Improved encoding has also been found for information related to the layout of a scene (e.g. the position or the presence/absence of objects in a scene) as compared to information less related to the layout of a scene (e.g. the color of the objects) (Aginsky & Tarr, 2000). In addition, imaging studies using fMRI have found that an area in parahippocampal cortex, referred to as the parahippocampal place area or PPA, responded strongly to layout of 3-D scenes but only weakly to arrays of objects without a coherent background surface (Epstein & Kanwisher, 1998). It was suggested that PPA encodes the spatial layout of the local environment (Epstein, 2005). The results of these studies, considered together, suggest that background surfaces are important for the perception and organization of visual scenes.

However, not all background surfaces have the same ecological importance. Many studies have suggested that the ground surface is the most important background surface among all environmental surfaces. The importance of the ground surface in perceiving 3-D space was discussed as early as 1000 years ago in Alhazen’s writings (translation: 1989). In addition, Gibson (1950) argued that the ground surface, compared to other environmental surfaces such as ceilings and side walls, serves a unique role in organizing the visual world. The ground surface supports almost all objects and the locomotion of most land-dwelling animals, including human beings (Gibson, 1950). Objects not in direct contact with the ground are usually supported by the ground surface through a series of “nested contact relations” (Meng & Sedgwick, 2001). In addition, the ground surface is universal whereas other surfaces, in addition to the ground surface, are usually present in artificial environments such as buildings.

The importance of the ground surface, as compared to other environmental surfaces, is evident when considering the information available when viewing the 3D world. An important characteristic of the optical projection of light to the eye is that perspective information (changes in the projected angles of a rigid object as a function of distance) is present when viewing a 3D scene. Perspective information is present for any single object that is visible in a scene. Throughout the scene variations in perspective, for objects located at different distances, defines a structure. This perspective structure can be used to define important properties of the visual world, such as the horizon or the slant of surfaces, from gradients. Perspective structure can also be used to define the layout of the scene including the relative distances of objects and is present for any surface receding in depth such as ground and ceiling surfaces. Although the perspective structure can be identical for ground and ceiling surfaces the usefulness of this information to the observer can vary. Consider, for example, a ground surface extended in depth and visible up to a fixed distance. If the image is rotated 180 deg to produce a ceiling surface the perspective structure for the ground and ceiling surface is identical. However the utility of the perspective structure for determining information important to the observer, such as layout, will not be the same. When the perspective structure defines a ground surface one can derive egocentric distances of objects in the scene by using eyeheight (the distance from the eye of the observer to the ground). Specifically, absolute distance d can be determined from two alternative calculations. Absolute distance can be specified as

d = H ∕ Sin η,

where d is the absolute distance, H is the eyeheight of the observer and η, is the slant of the ground surface (Ooi, Wu and He, 2001; see also Sedgwick, 1986). Alternatively, absolute distance along a textured ground surface can be determined by

d = H \times (\cos α_{1} ∕ \sin α_{2}) \times \tan (β_{1} ∕ β_{2}),

where H is the eyeheight of the observer, α₁ and α₂ are the projected angles from the observer to two texture elements on the ground surface and β₁ and β₂ are the projected extents of the texture elements (i.e. the calculation of tan(β₁/β₂) is the texture gradient of the surface).

However, when the perspective structure defines a ceiling surface egocentric distance cannot be determined by the scaling of eye height because the eye height relative to a ceiling surface is undefined. Although slant information can be determined from texture gradients present in ground and ceiling surfaces (since it does not require eyeheight information, see Howard & Rogers, 2002), absolute distance can only be determined for a ground surface (see also Thompson, Dilda & Creem-Regehr, 2007). We are not ruling out other possibilities that absolute distance can be recovered without the use of eye height scaling. For instance, He, Wu, Ooi, Yarbrough, and Wu (2004) proposed a sequential surface integration process (SSIP) that the visual system uses to judge egocentric distance. According to this process, near distance could be accurately recovered by the visual system through near depth cues such as binocular disparity and vergence. This information is then used as an anchor to perceive distance further away. However, the availability of eye height scaling to recover egocentric distance gives the ground surface a unique advantage over other environmental surfaces, which suggests that the ground surface may serve as the foundation for the perceptual organization of scenes.

Recent studies have shown that perceived egocentric distance is mediated by ground surface information by manipulating optical contact (the contact of images in the 2-D projection) between an object and the ground surface (Meng & Sedgwick, 2001, 2002; Ni, Braunstein & Andersen, 2004, 2005, 2007), by the presence of a discontinuity on the ground (Sinai, Ooi & He, 1998; Feria, Braunstein & Andersen, 2003; He et al, 2004; B. Wu, He & Ooi, 2007), by varying the way the ground surface was scanned (B. Wu, Ooi & He, 2004), or by manipulating the area of the ground surface that was the focus of attention (J. Wu, He & Ooi, 2008).

The importance of the ground surface in the perceptual organization of scenes has also been demonstrated by directly comparing the ground surface with other environmental surfaces. For example, McCarley and He (2000, 2001) used a search task in which objects were arranged to form an implicit ground or ceiling surface. They found faster response times when searching implicit ground versus implicit ceiling surface displays. Bian, Braunstein and Andersen (2005) found that when the ground surface and the ceiling surface provided conflicting information about the relative distance of objects in a scene, observers used information from the ground surface to determine the layout of the scene. When the two surfaces were sidewalls, observers did not show a preference to either surface. They referred to this result as the ground dominance effect. In a follow up study, Bian, Braunstein, and Andersen (2006) varied the relative location of the ground surface and the ceiling surface in the visual field, and found that the ground dominance effect was mainly due to the differences in the projections of ground and ceiling surfaces, with visual field location having a minor effect. Recent research has also found a ground dominance effect for older observers, although the magnitude of the effect was smaller than that found for younger observers (Bian & Andersen, 2008).

One possible reason for the unique role of the ground surface in the perceptual organization of 3-D scenes is evolution (He & Nakayama, 1992, 1994a, 1994b, 1995). An important function of the visual system is to encode features separated in space into representations which serve as the basis for higher-level processing. Patterns that occur repeatedly and that are more relevant to human behavior (e.g., locomotion in the world) may be encoded at a faster speed and in greater detail (McCarley & He, 2000). This would enable the visual system to process information in the environment more efficiently. Since the ground surface provides support, either directly or indirectly, to almost all objects and land-dwelling animals, it is possible that the ground surface serves as a common reference frame against which the locations of objects resting on the ground surface are coded (Gibson, 1950; see also Gibson 1979). He and Ooi (2000) showed that a common surface mediated the judged distance between objects on or close to that surface. More recent research has also found that change detection performance improved when the slant of a receding surface was increased (resulting in a slanted surface more similar to a ground surface; Ozkan & Braunstein, 2009). The results of this research, considered together, suggest that scenes may be organized in a hierarchical fashion with the ground surface used as the foundation for organizing a description of the scene, with objects, object parts, locations and distance information encoded relative to this global description. This hypothesis is consistent with recent research suggesting that the spatial representation of scenes is organized in a hierarchical manner (Rolls, Tromans & Stringer, 2008).

In the current study we used change detection tasks to examine in detail the hypothesis that the ground surface serves as the foundation for organizing a description of the scene with objects, object parts, locations and distance information encoded relative to this global description. If the ground surface is used as an organizing principal for the perception of 3-D scenes, then the ground surface, compared to other environmental surfaces, should be encoded more efficiently and in greater detail. For example, consider two different approaches to organizing a representation of the scene --- a hierarchical structure and a structure based on locally coded information. Furthermore, assume that the unit of information is distance between two points in space in a scene-centered coordinate system. The hierarchical representation could be organized with three levels consisting of a background surface as the top level in the structure, followed by the relative distance or spatial layout of objects, and then followed by the distance of object parts relative to the object. The locally coded representation would be organized as a single level with the distance between all object parts in the scene encoded. The hierarchical representation is more efficient than the locally coded representation because less information is required to encode the scene. For example consider a scene consisting of 5 objects with each object containing 3 parts. In the hierarchical representation the first level would consist of one unit of information (overall depth of the scene), the second level (spatial layout of objects) would consist of 10 units of information (distance between all pairwise sets of objects) and the third level (object parts relative to object) would consist of 15 units of information (3 distances for each object). In contrast, the locally coded representation, based on distance of object parts, would consist of 15 units of information (3 units for each object) and 90 units of information (9 units for all 10 pairwise sets of objects). Thus the hierarchical representation would require that 26 units of information be encoded to describe the scene whereas the locally coded representation would require 105 units of information to describe the scene. If the observer has limited viewing time then a greater proportion of the scene can be encoded in a hierarchical framework than a locally-coded framework. The purpose of this example is not to argue that scenes are encoded precisely in this manner. Rather, the purpose is to demonstrate how a ground surface may facilitate the encoding of the scene by using a hierarchical representation. A similar account was proposed by He and Ooi (2000) arguing that the visual system may encode the location of objects on a common visual surface using a quasi 2-D coordinate system (X, Y) instead of a 3-D Cartesian coordinate system (X, Y, Z). The advantage of using this strategy is that the visual system could encode the relative distances among objects more efficiently with reduced demand for computation. Our proposal compliments their theory by suggesting a hierarchical structure in the representation of both objects and object parts against a ground surface.

To examine this issue we used change detection tasks in which observers compared a current representation of a scene to a stored representation of a previously presented scene. In 6 experiments a flicker paradigm (Rensink, O’Regan & Clark, 1997) or a one-shot paradigm (Rensink, 2002) was used to compare change detection performance for ground and ceiling surfaces. In Experiments 1 and 2, we examined whether changes to a ground surface were easier to detect than changes to a ceiling surface and whether this effect was due to a preference to focus attention to the ground surface. In Experiment 3, we examined whether changes to objects on a ground surface were easier to detect than changes to objects on a ceiling surface and examined whether disrupting the coherent perspective structure of the background surface could affect the ground surface advantage in change detection. In Experiment 4 we examined if the ground surface advantage in detecting a change would propagate to objects not in direct contact with the ground surface. In Experiment 5, we examined whether the ground surface advantage in change detection was due to improved encoding of the ground surface, or whether improved performance could be due to improved retrieval and comparison of the ground surface. Finally, in Experiment 6 we examined the effect of varying the presentation duration and set size on the ground surface advantage.

Experiment 1

The purpose of the first experiment was to examine whether a change on a ground surface was easier to detect than a change on a ceiling surface. The displays simulated a ground and ceiling surface defined by a random checkerboard pattern. A flicker paradigm (Rensink, O’Regan & Clark, 1997) was used to present the original scene (A) and a modified scene (A’) in a sequence of A, A, A’, A’. The modified scene was produced by changing the luminance of one square in the original scene. If there is a ground surface advantage in detecting a change in the scene then detection performance should be faster and more accurate when the change is on a ground as compared to a ceiling surface.

Method

Observers

The observers were 9 undergraduate students (4 male and 5 female) from the University of California, Riverside. All observers were paid for their participation, were naive regarding the purpose of the experiment, and had normal or corrected-to-normal visual acuity.

Stimuli

The stimuli were computer generated 3-D scenes composed of a ground surface and ceiling surface, each with a 6 × 6 random black-white square texture. Each square was measured as 238 cm × 238 cm. The average luminance of the stimulus was 60.8 cd/m². Examples of the stimuli are shown in Figure 1. The simulated distances from the observer to the near and far ends of the plane were 571 cm and 2000 cm, respectively. (The calculation of the scene dimensions was based on an eye-height of 120 cm.)

Design

Two independent variables were manipulated: (1) the surface in which a change occurred (ground or ceiling), and (2) the inter-stimulus-interval (ISI) between two consecutive scenes (80 ms, 160 ms, or 240 ms). For “change” trials, 18 squares in the center of each surface were randomly selected to be the candidate targets that changed luminance across consecutive scenes (on each change trial only 1 of the 18 squares changed luminance). This manipulation produced 108 change trials. We also included 36 no-change trials (6 trials for each of 6 combinations) to ensure that observers followed instructions. A total of 144 trials were evenly divided into two blocks. Eight practice trials (6 change trials and 2 no-change trials) were inserted at the beginning of each block. The two experimental blocks were preceded by a 48-trial practice block composed of 36 change trials and 12 no-change trials. The order of the trials for each observer in each block was randomized.

Apparatus

The displays were presented on a 21-inch (53 cm) flat screen CRT monitor with a pixel resolution of 1280 by 1024, controlled by a Windows XP Professional Operating System on a Dell Dimension XPS workstation. The dimensions of the display on the monitor were 40.0 cm (W) × 30.0 cm (H), subtending a visual angle of 31.3° × 23.7°. A black viewing hood was placed in front of the monitor to cover the edges of the screen. A 19-cm diameter glass collimating lens, which magnified the images by approximately 19%, was located between the observer and the monitor. The purpose of the collimating lens was to remove accommodation as a flatness cue and thus increase the perceived depth of the 3-D scenes. The distance between the eyes and the collimating lens was approximately 10 cm and the distance from the eyes to the monitor was 85 cm. A chin rest was mounted at a position appropriate to this viewing distance. An optical mouse was used by the observers to initiate each trial and to respond.

Procedure

The experiment was run in a dark room. The observers viewed the display binocularly through the collimating lens with their head position fixed by a chin rest. On each trial, a white cross first appeared in the center of the screen. The observers were instructed to fixate the cross and press the left button of the mouse to initiate the trial. The cross then disappeared and a scene composed of two surfaces was presented. The initial scene (A) and the modified scene (A’) were presented for 250 ms each in the sequence of A, A, A’, A’, with a gray screen presented for various ISI (80 ms, 160 ms, or 240 ms) after each scene (see Figure 2). The purpose of presenting each scene twice was to create temporal uncertainty about when the change occurred (Rensink, O’Regan & Clark, 1997). The initial and modified scenes were thus alternated every 660 ms, 820 ms, or 980 ms, depending on the ISI. The task of the observers was to observe the scenes carefully and continuously and detect the square that changed luminance between two successive scenes. The observers were informed that the target square was likely to appear with equal probability on the ground or ceiling surface. Observers were shown examples of a change on a ground and ceiling surface. They were allowed to move their eyes once a trial began. They were instructed to respond as soon as they found a change by pressing the left button of the mouse, although they were not instructed that the response time was recorded. Twenty-five percent of the trials contained no change in the scene. If observers did not find the target square, they were instructed to continue viewing the scenes. On each trial, the sequence was repeated 10 times or until the observer responded. The average number of alterations that an observer needed to detect a change was recorded. Feedback was not provided during the practice trials or the experiment.

Example of the sequence of each trial used in Experiment 1. A trial continues for 20 alterations (40 scenes) or until the subject responds.

Results and discussion

Due to the relative small number of no-change trials (6 trials for each of the 6 combinations), we did not use the false alarm rate and the hit rate to calculate the sensitivity score (d’). Instead, the hit rate and the number of alterations needed to detect a change were measured as dependent variables. No-change trials served as “catch trials” to ensure that observers followed instructions. We established an exclusion criterion of a false-alarm rate of 10% or greater. All observers showed a false alarm rate less than 10%.

The hit rate (proportion of change trials detected) was calculated for each subject in each condition and analyzed in a 2 (surface in which a change occurred) by 3 (ISI) analysis of variance (ANOVA). The main effect of surface type was significant (F(1, 8) = 6.99, p < .05). The hit rate was 95.1% when the change occurred on a ground surface and 88.5% when the change occurred on a ceiling surface. The main effect of ISI was also significant (F(2, 16) = 3.86, p < .05). Post hoc comparisons (Tukey HSD Test) indicated a significant difference (p <. 05) between the 80 ms (93.8%) and 240 ms (88.9%) ISI conditions. No other pairwise comparisons were significant (p > .05). Although the interaction between surface type and ISI was not significant, (F(2, 16) = 2.10, p = .15, see Figure 3) there was a trend for the hit rate to decline with greater ISIs for the ceiling surface.

Hit rate as a function of ISI and surface type from Experiment 1. Error bars represent ± 1 standard error.

A two-way repeated-measures ANOVA was also conducted on the mean number of alterations required to detect a change. Data from change trials in which no response occurred (“miss” trials) were not included in the analysis. The main effect of surface type was significant, (F(1, 8) = 10.18, p < .05, see Figure 4). The mean number of alterations for the ground and ceiling surface was 5.07 and 6.18, respectively. The main effect of ISI (F(2,16) = 0.12) and the interaction between ISI and the surface type (F(2,16) = 0.74) were not significant (p > .05).

The number of alterations needed for change detection as a function of ISI and surface type from Experiment 1. Error bars represent ± 1 standard error.

Overall the results indicate better performance in detecting a change on a ground surface than on a ceiling surface. Although the accuracy rate was similar between the two surfaces when the ISI was 80 ms, observers required an average of 1.5 more alterations, or 32% more time, to detect a change on a ceiling as compared to ground surface.

Experiment 2

Overall, the results of Experiment 1 were consistent with the hypothesis that a ground surface, compared to other environmental surfaces, serves a special role in the perceptual organization of scenes. We believe that this effect is due to the unique projection of the ground surface. That is, the ground surface recedes in depth from bottom to top of its projected image, whereas the ceiling surface recedes in depth from top to bottom of its projected image. Since a ground surface exists almost everywhere and is utilized more frequently in everyday behavior, our visual system should be more adapted to the projection of a ground surface than to the projection of a ceiling surface. However, this effect may be due to the location of the ground surface in the visual field. Previous studies have found improved performance for processing visual information in the lower, as compared to upper, visual field. These studies include tasks such as visual search (He, Cavanagh, & Intriligator, 1996; Ellison & Walsh, 2000), figure-ground segregation (Rubin, Nakayama & Shapley, 1996; Vecera, Vogel & Woodman, 2002), and visually-guided actions (Danckert & Goodale, 2001). Although we did not restrict eye movements and thus the ground surface may vary location in the visual field during a trial, the ground surface was always located in the lower part of the display whereas the ceiling surface was always located in the upper part of the display.

Another possible explanation for the results of Experiment 1 is that observers were focusing attention more to the ground surface than the ceiling surface. Previous research on change detection has found results suggesting that attention is an important factor in detecting changes in a scene. For example, Rensink, O’Regan & Clark (1997) found that objects rated with higher interest were easier to detect than objects rated with marginal interest.

In Experiment 2 we modified the displays in order to investigate the importance of location in the visual field and attention. The first modification was that one surface (a ground or ceiling) was presented on each trial. This modification ensured that observers were not preferentially attending to one surface over another surface within a trial. If the ground surface advantage observed in Experiment 1 was due to preferential attention to the ground surface, then performance should be similar between the two surfaces. However, if the results of Experiment 1 were due to a ground surface advantage then greater change detection should occur for displays with ground as compared to ceiling surfaces. The second modification was that the surface was presented in the bottom, middle, or top of the display, similar to the stimuli examined in Bian, Braunstein and Andersen (2006). If the ground surface advantage was due to the location of the ground surface in the lower visual field, then performance should be the same for ground and ceiling surfaces. However, if the ground surface advantage obtained in Experiment 1 was due to more efficient encoding of the visual representation, then greater accuracy and faster responses in detecting a change should occur for scenes with a ground surface regardless of the location in the display.