Focus cues affect perceived depth

Simon J Watt; Kurt Akeley; Marc O Ernst; Martin S Banks

doi:10.1167/5.10.7

. Author manuscript; available in PMC: 2009 Apr 9.

Published in final edited form as: J Vis. 2005 Dec 15;5(10):834–862. doi: 10.1167/5.10.7

Focus cues affect perceived depth

Simon J Watt ¹, Kurt Akeley ², Marc O Ernst ³, Martin S Banks ⁴

PMCID: PMC2667386 NIHMSID: NIHMS27508 PMID: 16441189

Abstract

Depth information from focus cues—accommodation and the gradient of retinal blur—is typically incorrect in three-dimensional (3-D) displays because the light comes from a planar display surface. If the visual system incorporates information from focus cues into its calculation of 3-D scene parameters, this could cause distortions in perceived depth even when the 2-D retinal images are geometrically correct. In Experiment 1 we measured the direct contribution of focus cues to perceived slant by varying independently the physical slant of the display surface and the slant of a simulated surface specified by binocular disparity (binocular viewing) or perspective/texture (monocular viewing). In the binocular condition, slant estimates were unaffected by display slant. In the monocular condition, display slant had a systematic effect on slant estimates. Estimates were consistent with a weighted average of slant from focus cues and slant from disparity/texture, where the cue weights are determined by the reliability of each cue. In Experiment 2, we examined whether focus cues also have an indirect effect on perceived slant via the distance estimate used in disparity scaling. We varied independently the simulated distance and the focal distance to a disparity-defined 3-D stimulus. Perceived slant was systematically affected by changes in focal distance. Accordingly, depth constancy (with respect to simulated distance) was significantly reduced when focal distance was held constant compared to when it varied appropriately with the simulated distance to the stimulus. The results of both experiments show that focus cues can contribute to estimates of 3-D scene parameters. Inappropriate focus cues in typical 3-D displays may therefore contribute to distortions in perceived space.

Keywords: accommodation, blur, cue combination, depth perception, stereoscopic displays, virtual reality

Introduction

Overview

Consider two viewing conditions: a complex real scene viewed binocularly and a computer display of the same scene. The computer display is carefully constructed so all the traditional depth cues—binocular disparity, texture gradients, occlusion, shading, etc.—are geometrically correct. Thus, the geometric patterns of stimulation striking the two eyes are the same in the two cases. Despite the fact that the stimulation patterns are the same, psychophysical research (e.g.,Buckley & Frisby, 1993; Ellis, Smith, Grunwald, & McGreevy, 1991; Frisby, Buckley, & Duke, 1996; Frisby, Buckley, & Horsman, 1995; van Ee, Banks, & Backus, 1999) and experience with virtual reality displays (Thompson et al., 2004) leads one to expect that the perceived 3-D structure will differ in the two cases: the depth in the computer display will appear flattened relative to the real scene from which it is derived.

A plausible cause for depth flattening is the fact that computer displays present images on one surface: the phosphor grid for cathode-ray displays (CRTs), the pixel grid for liquid crystal displays (LCDs), and the projection screen for projectors. This means that depth information from focus cues—accommodation and the retinal blur gradient—is inconsistent with the depicted scene. Instead the information specifies the depth of the display surface. We examined whether such inappropriate focus cues contribute to distortions in perceived depth when viewing 3-D computer displays.

Combining information from multiple depth cues

The 3-D structure of a visual scene is inferred from the 2-D retinal images. The visual system does not rely arbitrarily on one depth cue or another but combines information from multiple available cues to estimate the 3-D parameters of the scene. Consider the case of recovering the slant of a plane. The visual system’s estimate of slant from a given cue can be represented by

{\hat{S}}_{i} = f_{i} (S),

where S is the slant being estimated and f is the operation by which the visual system does the estimation; the cue is represented by the subscript. Estimates of slant from each cue (ŝ_i) are subject to error. When multiple cues are available, the most likely slant can be calculated from a weighted linear combination of the slant indicated by each cue (provided that the noises associated with cue measurement are independent and Gaussian distributed, and that all slants are equally likely)

\hat{S} = \sum w_{i} {\hat{S}}_{i},

(1)

where

w_{i} = \frac{1 / σ_{i}^{2}}{\sum_{i} 1 / σ_{j}^{2}} .

(2)

The weights (w_i) are proportional to the normalized inverse variances ( $(σ_{i}^{2})$ of the cue distributions (ŝ_i), so greater weight is assigned to less variable (i.e., more reliable) cues (Backus & Banks, 1999; Ernst & Banks, 2002; Ghahramani, Wolpert, & Jordan, 1997; Jacobs, 1999; Oruç, Maloney, & Landy, 2003). The variance of the combined estimate is lower than the variance of any single-cue estimate, so by combining information from several depth cues, the visual system can in principle estimate slant (or any other 3-D property) with greater precision than it can by relying on one cue alone. There are now many empirical studies showing that cue reliability is taken into account when combining sensory signals (e.g., Backus & Banks, 1999; Buckley & Frisby, 1993; Jacobs, 1999; Körding & Wolpert, 2004; van Beers, Sittig, & Denier van der Gon, 1998; van Beers, Wolpert, & Haggard, 2002). Furthermore, several studies have tested the quantitative predictions of this model by measuring the reliability of the underlying estimators when only one cue is informative and using these to predict performance when multiple cues are available (Alais & Burr, 2004; Ernst & Banks, 2002; Gepshtein & Banks, 2003; Hillis, Watt, Landy, & Banks, 2004; Knill & Saunders, 2003; Landy & Kojima, 2001). These studies show that performance is often close to that predicted by the statistically optimal model (in the sense of being the minimum variance unbiased estimate; Ghahramani et al., 1997).

Inappropriate focus cues in 3-D displays

The abovementioned research suggests that the visual system uses all available sources of information to compute 3-D scene parameters. This has important implications for 3-D computer displays because unmodeled depth cues could affect the percept, causing it to differ from the depicted scene. In almost all computer displays, the focal distance of the light from the display is fixed because the images are presented on one surface (for counter-examples, see Akeley, Watt, Girshick, & Banks, 2004; McQuaide, Seibel, Burstein, & Furness, 2002). This provides inappropriate depth information in two ways.

First, the variation in blur in the retinal image is consistent with the fixed distance of the display surface and not with the distances in the simulated scene. With real scenes, the amount of retinal blur varies because the distance of points in the scene varies with respect to the eye’s focal distance: the retinal image is sharpest for objects at the focal distance and blurred for points nearer and farther away. In computer displays, the variation in blur specifies the constant distance of the display surface and is thus a cue to flatness.

Second, accommodation provides an extra-retinal cue signaling the constant distance of the display surface. As the eye looks around a real scene, commands are sent to the ciliary muscles to change the refractive power of the crystalline lens and thereby minimize blur for the fixated part of the scene. As the eye looks around the simulated scene in a computer display, the focal distance of the light does not vary appropriately, so this again signals flatness rather than the simulated depth variation.

If blur and accommodation provide inputs to the calculation of depth, their erroneous values can in principle adversely affect percepts of 3-D scene structure.

Inappropriate motion parallax in 3-D displays

In many settings (including psychophysical experiments), the observer’s head position is not strictly constrained. For a viewing distance of 28.5 cm (used in our first experiment), head movements of a few millimeters could result in a detectable signal to depth from motion parallax (Rogers & Graham, 1982). As with focus cues, residual motion parallax specifies the distance to the display rather than distances in the simulated scene. If parallax is figured into the brain’s calculation of depth, its erroneous value will adversely affect 3-D percepts.

Unlike the problem of inappropriate focus cues, there are straightforward solutions to this problem: one can track head position and update the image accordingly (Welch et al., 1999) or one can immobilize the head position. Therefore, we did not explicitly examine whether residual motion parallax contributes to distortions in perceived depth when viewing 3-D displays (but see the Isolating information from accommodation and blur section).

Implications for psychophysics

Powerful 3-D computer graphics has revolutionized research on depth perception. Psychophysicists no longer have to rely on shadow casters (Gibson, Gibson, Smith, & Flock, 1959), glass plates (Ogle, 1950), or other mechanical means to create stimuli. Using modern computer graphics, they can now create realistic 3-D images and independently manipulate depth cues. As a result, great advances have occurred in the last three decades. However, if focus cues affect perceived depth from conventional computer displays, many observations in the depth perception literature may not be representative of vision in the natural environment. Here we describe two illustrative examples from the literature: (1) the perceived depth of computer-displayed versus real ridges, and (2) the slant-contrast illusion.

Buckley and Frisby (1993) examined the perceived depth of CRT-displayed and real ridges. The stimuli depicted vertical or horizontal parabolic ridges. The authors independently manipulated the disparity- and texture-specified depths of the ridges. With CRT stimuli, they did this in conventional fashion by programming different disparity and texture signals. With the real ridges, they did it by distorting the texture on the card covering wooden forms to create the desired texture gradient viewed from the observer’s eye. The data from the CRT stimuli (vertical ridges) revealed clear effects of disparity and texture: Disparity dominated when the texture-specified depth was large and texture dominated when the texture depth was small. In the framework of the cue-weight model (Equations 1 and 2), the disparity and texture weights changed depending on the texture-specified depth. The data from the real-ridge stimuli were quite different: The disparity-specified depth now dominated the percept. The important point for our purposes is that the CRT-based and real-ridge results differed dramatically.

Buckley and Frisby (1993) speculated that focus cues played an important role in the striking difference between the CRT and real results. In Appendix C, we quantify and generalize their argument by translating it into the framework of the weight model. The fact that more depth was perceived in real than in CRT-displayed ridges suggests that focus cues contributed to the depth calculation in their experiments (see also Frisby et al., 1995).

We cannot tell from the Buckley and Frisby (1993) experiments whether depth percepts were veridical once focus cues were consistent with the depth specified by disparity and texture. The reason is that responses were judged depth in cm and we cannot know whether the mapping between perceived depth and depth responses is veridical. For our purposes, the important point is that observers reported and therefore presumably saw more depth when focus cues were consistent with the depth specified by other cues.

Now consider the second example: the slant-contrast illusion (Sato & Howard, 2001; van Ee & Erkelens, 1996; Werner, 1937). In this illusion, a central object is presented that has the disparity and texture gradients of a fronto-parallel plane. It is surrounded by a surface that typically has the texture gradient of a frontoparallel plane but the disparity gradient of a slanted plane. The presence of the surrounding plane causes the central object to appear slanted in a direction opposite to the disparity-specified slant of the surround. Interesting psychophysical effects draw researchers’ attention, so several theories have been developed to explain the illusory slant. Most share the idea that disparity-encoding mechanisms have antagonistic, center-surround receptive fields for disparity (in analogy to the center-surround organization of receptive fields in the luminance domain). Such mechanisms are allegedly less responsive to zero- and first-order disparities (absolute disparity and the relative disparity associated with a slanted plane, respectively) than to second- and higher-order disparities (the disparity associated with curvature or discontinuities in depth) (Anstis, Howard, & Rogers, 1978; Brookes & Stevens, 1989; Gillam, Chambers, & Russo, 1988; Mitchison, 1993; Rogers & Graham, 1983; van Ee & Erkelens, 1996; Westheimer, 1986).

van Ee et al. (1999) measured the magnitude of the slant-contrast illusion when the stimulus was presented as a conventional computer display and as real surfaces. They observed a typically large illusion with the computer display, but no illusion at all with the real surfaces. The computer-displayed and real-surface stimuli had the same dimensions and were viewed from the same distance, so the disparity- and texture-gradient signals created by the two stimuli were identical. The fact that one produced the illusion and the other did not means that the encoding of disparity (and the texture gradient) per se cannot be the cause of the illusion. van Ee et al. argued that cue conflicts between geometric cues (disparity and texture) and inappropriate focus cues caused the illusion in the computer-displayed stimuli. The conflicts were eliminated in the real-surface stimulus and so the illusion was eliminated. Sato and Howard (2001) also showed that manipulating the magnitude of cue conflicts has a large effect on the slant-contrast illusion when the disparity signals are held constant. Our point is that cue conflicts between disparity, texture, and the previously unmodeled cues of blur and accommodation affect or may even cause the slant-contrast illusion. Thus, previous theories of the illusion are attempting to explain an illusion that may not occur in the natural environment, when all cues signal the same depth structure.

The potential importance of inappropriate focus cues is not restricted to stereoscopic vision. We argue in the General discussion section that investigations of any aspect of visual space perception should take the potentially confounding effects of those cues into account.

Recovering depth from blur

We define blur in the retinal image as the spread of the optical point-spread function (Westheimer, 1986). For a fixed accommodative state, the amount of blur in the image of an object is roughly proportional to the focus error in diopters (Green & Campbell, 1965; Mather & Smith, 2000; Smith, Jacobs, & Chan, 1989). Objects at different distances are blurred by different amounts, signaling depth variations in the scene. Interpreting this signal is complicated by two factors. First, the sign of depth variation is undetermined because the retinal images of objects nearer or farther than fixation can be equally blurred. Second, the magnitude of the depth signaled is ambiguous because for a given accommodative state, blur depends not only on the distance of an object from fixation, but also on the visual system’s depth of focus, which in turn depends on pupil size and the spatial frequency content of the input, neither of which is known independently (Green, Powers, & Banks, 1980). For these reasons, it seems unlikely that metric depth can be recovered directly from retinal blur. However, the continuous microfluctuations that occur in accommodation (Campbell & Westheimer, 1959) and chromatic aberration could be used to disambiguate the blur signal (Nguyen, Howard, & Allison, 2005; Pentland, 1987). Additionally, eye movements could be used to sample changes in blur dynamically as the observer focuses on different parts of the scene. The sign of depth variations could also be disambiguated by other depth cues including binocular disparity and occlusion.

Some psychophysical studies have reported a modest effect of the blur gradient on judgments of perceived depth (Marshall, Burbeck, Ariely, Rolland, & Martin, 1996; Mather, 1996, 1997; Mather & Smith, 2000, 2002; O’Shea, Govan, & Sekuler, 1997). In these studies, the blur gradient was varied artificially by blurring the displayed object in selected regions to simulate the effects of defocus, and most used brief presentations. This means that the abovementioned strategies for disambiguating the depth signaled by blur could not have been used. It is thus possible that the blur gradient is a more useful depth cue in natural viewing than previously realized.

Recovering depth from accommodation

The efferent signal to the muscles controlling the crystalline lens could be a depth cue because the magnitude of the response required to focus the retinal image depends directly on the distance from the eye to the fixated object. To be a useful depth cue, the accommodative system must respond reliably to changes in focal distance and the visual system must be able to monitor the muscle commands. Accommodation to isolated, high-contrast targets is reliably related to changes in a target’s focal distance (Campbell & Westheimer, 1959; Charman & Tucker, 1977; Heath, 1956). Indeed, accommodation can occur to changes in retinal blur that are below perceptual threshold (Kotulak & Schor, 1986).

In contrast to the blur gradient (and most other depth cues), accommodation can in principle be used to recover the absolute distance to fixation. Several studies have examined distance estimates with verbal or pointing responses based on the accommodative response to single targets and have shown that observers’ estimates are correlated with target distance, but that accuracy is poor and variability is high (Baird, 1903; Biersdorf, 1966; Dixon, 1895; Fisher & Ciuffreda, 1988; Foley, 1977; Hillebrand, 1894; Künnapas, 1968; Mon-Williams & Tresilian, 2000; Peter, 1915; Swenson, 1932; Wundt, 1862). In principle, accommodation can also provide information about surface structure if estimates of relative distance are compared over successive fixations. Accommodation, like blur, could therefore be a more useful depth cue in complex scenes than the existing psychophysical data suggest.

Direct versus indirect influence of focus cues on perceived depth

The above discussion examined how blur and accommodation could be used directly in estimating depth. Accommodation could also have an indirect effect on perceived depth by interacting with stereopsis. Binocular disparity is an important and reliable depth cue. But horizontal disparities are inherently ambiguous because 3-D layout cannot be determined from them without scaling by an estimate of viewing distance (Gårding, Porrill, Mayhew, & Frisby, 1995). To perform the scaling, the brain uses the eyes’ vergence and the horizontal gradient of vertical disparity (Rogers & Bradshaw, 1995). In principle, accommodation can also provide an estimate of fixation distance, which may in turn influence disparity scaling. In computer displays, the accommodative stimulus is the distance to the display screen and not the simulated distance. This erroneous information may affect depth percepts indirectly via disparity scaling.

There is a small literature on indirect effects of accommodation on perception. Fisher and Ebenholtz (1986), Mon-Williams and Tresilian (2000), and Wallach and Norris (1963) observed an influence of accommodation on depth interpretation (for a negative result, see Ritter, 1977). Heinemann, Tulving, and Nachmias (1959) and von Holst (1973) observed an influence of accommodation on perceived size.

Direct and indirect effects of focus cues were examined in Experiments 1 and 2, respectively.

Experiment 1

In the first experiment, slant specified by geometric cues (texture and binocular disparity) was varied independently from slant specified by focus cues.

Because so many reliable depth cues are available in natural viewing, focus cues should have only a small influence on the recovery of 3-D scene properties in natural conditions. The simulated scenes used in psychophysical experiments are often impoverished in order to study individual cues and their interactions. An example is the sparse random-dot stereogram, which allows researchers to isolate binocular disparity while making all other cues uninformative or unreliable. Focus cues may have more influence under these circumstances. To examine this possibility, we measured the effect of varying focus cues on slant estimates when the stimulus was defined by only binocular disparity or by only the texture gradient.

Methods

Observers

Three observers participated, aged 24–29 years. All had normal vision and stereoacuity. All were experienced psychophysical observers. One (AJW) was naïve to the experimental purpose. The other two knew the general purpose but not the specifics.

Apparatus

The layout of the apparatus is schematized in Figure 1. The stimuli were displayed on a conventional 21-in. CRT (KDS VS21e) with 1600 × 1024 resolution. Each pixel subtended 2.9 × 2.9 arcmin. To manipulate the information from focus cues, the monitor was rotated about the vertical axis passing through the center of its front surface.

Focus cues issuing from the phosphor grid specified a surface that was not exactly a plane for two reasons. (1) The surface containing the phosphor grid was slightly curved, and (2) the grid’s virtual distance was affected by refraction due to the front glass plate. (We could not use a flat-panel LCD because the luminance of such displays depends strongly on viewing angle.)

Dichoptic presentation of the left- and right-eye images was achieved using CrystalEyes™ liquid crystal shutter glasses. The monitor refresh rate was 100 Hz, so each eye’s image was redrawn at 50 Hz. It was crucial to have no artifactual cues to the monitor’s slant, so we were careful to eliminate cross-talk through the glasses (aided by drawing the images with the red phosphor only) and to eliminate the observer’s ability to see the monitor casing (accomplished by masking the casing and by periodically light-adapting the observer). We checked that observers could not determine the monitor’s slant in a pilot experiment. In the monocular-viewing conditions of the main experiment, observers wore a patch over their left eye.

We used anti-aliasing to specify the position of stimulus elements to subpixel accuracy. Stimuli were rendered using OpenGL (Segal & Akeley, 2002) and the associated utility library, GLUT (Kilgard, 1996). Precise reproduction of visual directions was achieved using a spatial calibration technique similar to the one described by Backus, Banks, van Ee, and Crowell (1999). A wire-filament loom was placed in a known position in front of the monitor and the experimenter aligned individual dots with the loom intersections. During calibration, the experimenter’s head was carefully positioned using a bite bar, which was adjusted so as to position the eyes’ centers of rotation in known positions relative to the display. Two-dimensional polynomial functions were used to fit the x and y values from the loom calibration to pixel space in which the stimuli were rendered. These equations provided a continuous look-up table relating pixel space and physical screen space. When the stimulus was drawn, each stimulus element (squares or lines) was subdivided into a series of smaller polygons and the position of each vertex of these was corrected using the look-up table. This procedure corrected overall dot positions and line endpoints, and it also closely approximated the correct calibration for the outlines of the stimulus elements. Because of the calibration procedure, the geometric properties of the stimulus were matched for all monitor slants. The spatial calibration procedure was carried out separately for the left and right eyes at each monitor slant used in the experiment.

During the main experiment, the observer’s head position was stabilized using a conventional chin rest. A sighting technique (Hillis & Banks, 2001) was used to position the chin rest precisely. We chose this method for head constraint to mimic the most common practice in the psychophysical literature. As discussed previously, it is possible that motion parallax resulting from small head movements may have provided an additional cue to the physical slant of the monitor. Possible implications of this, and additional control conditions in which the head was immobilized with a bite bar, are described in the Isolating information from accommodation and blur section. A response figure was presented on a second CRT. It was viewed via a mirror so that observers could respond without making head movements (Figure 1).

Stimuli

The stimuli were planes rotated about the vertical axis (tilt = 0°). We independently manipulated two cues to slant: (1) focus cues, which were manipulated by varying monitor slant, and (2) the simulated slant of the surface, which was specified by geometric information from disparity and texture cues. We refer to monitor slant as S_m and simulated slant as S_s, respectively (Figure 2).

Plan view of the stimulus configuration for Experiment 1. The slants *S_m* and *S_s* were defined relative to the cyclopean line of sight. Slant in both cases is the angle between the line of sight to the middle of the display monitor (dotted line) and the surface normal for each cue (red and blue lines). Positive slant (shown here) is "right side back".

S_s was specified either by binocular disparity (disparity condition) or by the perspective projection of a textured pattern (texture condition). For all viewing conditions and values of S_s and S_m, the stimulus width was 35° with respect to the cyclopean eye.

Figure 3 shows how the stimuli were created. The stimulus generation method was used in both the disparity and texture conditions; only the right-eye’s image was displayed in the latter case. The stimulus width was matched with respect to the cyclopean eye (midway between the two eyes). Therefore, its angular extent in the right eye (its width in the texture condition) varied slightly as a function of S_s. The angular extent of the stimulus in either eye (and all other geometric properties) was unaffected by variations in S_m. Stimulus height at the axis of rotation on average was 28°. Due to random aspects of the stimulus generation method, there were small variations in stimulus height and width from trial to trial. The distance from the cyclopean eye to the rotation axis of the stimulus was always 28.5 cm. We chose this distance because it was short enough to create discriminable changes in focal distance while being long enough to allow accurate accommodation.

The method of stimulus generation for Experiment 1. Step 1: Coordinates were defined for a homogeneous, frontoparallel pattern (randomly positioned squares or a Voronoi texture) 35° wide, measured at the cyclopean eye (CE, midway between the two eyes). Step 2: This pattern was scaled and translated in x such that after rotation by the angle *S_s*, it remained 35° wide, measured at the cyclopean eye. Step 3: The left- and right-eye’s images were determined by projecting the pattern onto the monitor plane using each eye’s position as the center of projection. The screen space was spatially calibrated (see text) so that the visual direction of each point on the stimulus was appropriate, and the retinal images at each value of *S_s* were geometrically equivalent at each monitor slant, *S_m*.

In the disparity condition, S_s was specified by the difference between left- and right-eye projections (calculated for each observer’s inter-ocular distance) of a pattern of randomly positioned square elements. We used squares instead of the more typical Gaussian blobs to provide a better stimulus to accommodation. The initial 2-D pattern (Figure 3, Step 1) was generated by drawing x and y square positions from a uniform random distribution. The average size of each square was 1.7 × 1.7 mm (1.7 mm ≈0.34° at the center of the stimulus). We minimized the informativeness of the texture cue by presenting few squares—roughly 0.2 square/deg²—in random positions. We also clipped the stimulus with an elliptical window (whose size and orientation varied randomly within a small range) so that the outline of the stimulus pattern did not provide a cue to S_s. The scaling process (Figure 3, Step 2) stretched the entire pattern, including the squares, so that when the stimulus was rotated, the angular width of the squares was on average constant across values of S_s. Each eye’s view was calculated by finding the intersection with the monitor plane of rays through the stimulus pattern and each eye’s center of rotation (Figure 3, Step 3). Using this procedure, the outline of each square was correctly projected in each eye’s view. This meant that the simulated slant of each square was consistent with S_s, and the monocular texture cue in each eye (including square density) was consistent with the disparity-specified slant. We could have used the conventional method, in which stimulus elements are shifted by equal and opposite amounts in the two eyes, thereby creating a texture gradient that is consistent with a frontoparallel plane. It was preferable, however, to use correct perspective projection because the conventional method yields a texture-specified slant of zero, which would have complicated the data interpretation.

In the texture condition, S_s was defined by the perspective projection of a Voronoi pattern (de Berg, van Kreveld, Overmars, & Schwarzkopf, 2000; See also Knill, 1998) viewed with the right eye. The stimuli consisted of 320 Voronoi cells on average. To create the Voronoi patterns, the initial pattern consisted of a grid of 20 × 16 regularly spaced points. The x and y coordinates of each point were then perturbed by a random amount in the range ±0.2 times the inter-point spacing (equivalent to 0.36° in the center), and the Voronoi pattern defined by these points was calculated. The resultant had ~0.33 Voronoi cells/deg². The stimulus was then scaled, rotated, and perspective projected into the monitor plane following the procedure in Figure 3. As with the random-dot stimulus, each line segment was correctly projected for the slant angle, S_s.

The average luminous intensity of a square or line seen through the shutter glasses was 0.9 cd/m², and the background luminance was 0.01 cd/m².

A new stimulus was drawn on each trial in both the disparity and texture conditions. In our experimental design, it was critical that the geometric information at a given value of S_s was equivalent for all values of S_m. We checked this empirically by viewing a simulated frontoparallel plane (S_s = 0°) through the calibration loom. The stimulus was identical at a range of values of S_m.

Our stimuli should have been good stimuli to accommodation because they were spatially complex and therefore contained a wide range of spatial frequencies (Charman & Tucker, 1977).

Procedure

Observers reported the amount of perceived slant for each combination of monitor slant (S_m) and simulated slant (S_s): 0°, ±10°, ±20°, and ±30°. They did so by setting the angle between two line segments to be equal to the perceived slant of the stimulus. The response figure consisted of a fixed horizontal line and a rotatable oblique line, the former representing the frontoparallel plane and the latter the perceived slant of the experimental stimulus. The oblique line started at a random orientation on each trial and could be adjusted by key presses in either direction in increments as small as 0.5°. This figure was viewed by the right eye in a second monitor via a mirror by making a small eye movement (Figure 1).

Before each trial, a small fixation square (0.35° × 0.35°) was presented in the center of the screen. The square was constructed and calibrated using the same methods as the stimulus. Its simulated slant was always frontoparallel, so it did not provide a cue to monitor slant. Each trial followed the same sequence. The fixation square first appeared for 1 s, then the stimulus for 2 s. Following stimulus offset, the response figure appeared on the second monitor and observers indicated the amount of slant they had seen. The response figure then disappeared followed by a 1-s blank display before the fixation square appeared on the main monitor for the next trial. The fixation square was not present during the stimulus presentation and observers were given no specific instructions about where to look. The observers completed six trials for each S_m × S_s combination in both the disparity and texture conditions: a total of 588 trials. Trials were blocked by monitor slant and viewing condition, both randomly ordered.

The apparatus was concealed behind a curtain when observers entered the room, and the experiment was conducted in complete darkness. Observers were always unaware of the monitor’s slant (the naïve observer was not aware that the monitor ever rotated). Between experimental blocks, observers were exposed to normal light levels to prevent dark adaptation. Before the main experiment, the observers completed two blocks of practice trials. All three observers reported a clear percept of depth in binocular and monocular conditions, and they were all readily able to do the task.

Results

Normalization of slant estimates

We cannot know the mapping between perceived slant and response setting, so we used the settings with the cues-consistent stimuli (S_m = S_s) to normalize the other data. We did this by transforming the raw data as follows. For each observer and condition, a response-mapping function was derived by least-squares fitting of a line (y = mx + c) to the mean slant estimates from the subset of the data for which S_m = S_s. If it is assumed that perceived slant was veridical for these cues-consistent stimuli, the settings can then be used as a yardstick to transform the data in the other conditions. The fitted function was used to scale each response into a normalized slant estimate. These values were then used to calculate the points in the data figures. Because the data were merely scaled to make effect sizes equivalent across conditions and observers, the relative effects within each condition were unaffected. The data were in every case well fitted by a line.

The slopes of the normalization functions for the disparity (blue-gray bars) and texture conditions (red bars) are shown in Figure 4. The observers’ settings in the cues-consistent conditions were reasonably consistent across the disparity and texture conditions with the exception of observer JDB, whose settings in the texture condition were considerably smaller than in the corresponding disparity condition. Despite possible differences in the use of the response measure, it seems likely that the texture-defined planes looked less slanted than the disparity-defined planes to this observer.

Effect of slant on slant settings for the cues-consistent stimuli for each observer and condition in Experiment 1. The abscissa values are different observers. Blue-gray and red bars represent the disparity and texture conditions, respectively. The dark-blue and green bars represent two additional monocular conditions, described in isolating information from accommodation and blur. The ordinate values are the slopes of the best-fitting lines relating slant to observer responses in the cues-consistent (*S_m* = *S_s*) subset of the data in each viewing condition. These values were used to normalize the raw responses for each observer (see the Normalization of slant estimates section).

Effects of monitor orientation on perceived slant

Figure 5 plots each observer’s average normalized slant estimates as a function of S_m in the disparity and texture conditions. Different colors represent different values of S_s. The solid lines are the best-fitting lines for each value of S_s. The data are plotted as a function of S_m, so effects of this variable are indicated by deviations from a horizontal line. (The data were fitted with lines for simplicity, but one would expect departures from linearity because the effect of focus cues is likely to vary with S_m; see the Evidence for reliability-based cue weighting section.) The normalization of the data is indicated here by the diamonds on the right side of each panel (see caption for explanation).

Average normalized slant estimates for each value of *S_s* as a function of *S_m* in Experiment 1. The upper row shows the data for the disparity condition and the lower row the data for the texture condition. The columns show the data for different observers. The horizontal dashed lines represent veridical estimates for each *S_s*. The colored symbols represent the data, different colors denoting different values of *S_s*. The circled points are the data for the cues-consistent (*S_m* = *S_s*) conditions. The colored lines are the best fits to the data for each *S_s*. The error bars in the upper left corner of each panel are ± the average *SEM*. The diamonds on the right side of each panel indicate the actual response settings for the cues-consistent stimuli at *S_m* = *S_s* = ±30°. The data were normalized such that the fitted settings at those points plotted at ordinate values of ±30°.

Consider the data for observer PRM. In the disparity condition, the data are clearly separated according to S_s, indicating that disparity was an effective slant cue. Monitor slant did not affect his judgments in this condition. For example, his slant estimates in the cues-inconsistent conditions (S_m ≠ S_s) did not differ noticeably from estimates in cues-consistent conditions (S_m = S_s; circled data points). In contrast, PRM’s slant estimates in the texture condition reveal a clear effect of monitor slant. Again, the data are separated according to S_s, indicating that the texture cue was effective. However, for most values of S_s, increasing or decreasing monitor slant had a systematic effect on his estimates, suggesting that focus cues affected perceived slant.

The results for observer AJW are similar. In the disparity condition, her slant estimates were less consistent than those of PRM, but there was no systematic effect of monitor slant. Her slant estimates in the texture condition varied systematically with S_m.

The results for observer JDB are more variable, but reasonably consistent with those of the other two observers. He showed no effect of monitor slant in the disparity condition and a somewhat inconsistent effect in the texture condition; in his data, the effect of monitor slant in the texture condition is most evident when one compares perceived slant when S_m = S_s to perceived slant when S_m = 0 (see Figure 6).

Average normalized slant estimates for *S_m* = *S_s* and *S_m* = 0 in Experiment 1. Each panel plots the normalized estimates as a function of *S_s*. Each column shows the data for a different observer. The upper and lower rows show the data from the disparity and texture conditions, respectively. The black circles are the data for *S_m* = *S_s* and the blue squares are the data for *S_m* = 0. The lines are best fits to the data. The slopes of the fitted lines to the cues-consistent data are constrained to be 1 as a result of the normalization process. Error bars are ±1 *SEM*.

Implications of the direct effect of focus cues for 3-D displays

Figure 6 illustrates implications for viewing simulated scenes as opposed to real scenes. The figure re-plots two subsets of the data: (i) the cues-consistent conditions (S_m = S_s), and (ii) the S_s = 0° condition. Normalized slant estimates are now plotted as a function of S_s instead of S_m. The cues-consistent condition is essentially equivalent to real-world viewing in that all cues specify the same depth structure. The S_m = 0 condition is the typical viewing situation in psychophysics in which the display surface is frontoparallel. The lines in Figure 6 are the best fits for each data subset. The slopes of the fitted lines to the cues-consistent data (black lines) are constrained to be 1 as a result of the normalization process. There was no systematic difference in the disparity condition between the cues-consistent and cues-inconsistent conditions. For all three observers, we calculated the difference between slant estimates in the cues-consistent and cues-inconsistent conditions at each value of S_s (except for S_s = 0, where the data in the two conditions are the same). The signs of the differences were adjusted so that a negative difference always indicated less estimated slant (stimulus appeared closer to frontoparallel) in the S_m = 0 condition irrespective of the sign of S_s. A one-sample t test showed that these difference scores were not significantly different from zero, indicating that slant estimates in the disparity condition were not reliably different in the cues-consistent and cues-inconsistent conditions, t(17) = 0.19, p = 0.85. This shows again that focus cues had no direct influence on slant percepts under binocular viewing. In the texture condition, all three observers reported seeing less slant when the monitor was frontoparallel (S_m = 0) compared to when all cues were consistent (S_m = S_s). The difference score analysis described above showed that this effect was statistically significant, t(17) = 4.18, p < 0.001. This suggests again that focus cues affected slant percepts directly under monocular viewing.

Isolating information from accommodation and blur

To determine if residual motion parallax contributed to the monitor-slant effect, we re-ran the monocular condition with the observers’ heads completely stabilized with a bite bar. To determine whether the blur gradient or accommodation made a greater contribution to the monitor-slant effect, we compared performance in two conditions: (1) the eye movement condition, in which observers made two horizontal eye movements from one edge of the stimulus to the other and back during the 2-s presentation, and (2) the fixation condition, in which observers maintained fixation on a small cross (0.75° × 0.75°) in the center of the screen before and during stimulus presentation. Accommodation should have varied much less in the fixation than in the eye movement condition, so by comparing the data in the eye movement and fixation conditions, we could assess the contribution of accommodation. By comparing the data in these two conditions to the original data in Figure 5 and Figure 6, we could determine the contribution of residual motion parallax.

The data were normalized using the abovementioned procedure. Figure 4 shows the slopes of the normalization functions for the eye movement (dark-blue bars) and fixation conditions (green bars). Once again, the observers’ settings were consistent across conditions with the exception of observer JDB, who made very small settings in the fixation condition (similar to those he made in the original texture condition). Again, this may be because for this observer the surfaces looked less slanted in this condition, although it is unclear why this should have been the case.

Figure 7 shows the results of the eye movement and fixation conditions in the same format as Figure 5. Results for the eye movement and the fixation conditions were quite similar to the results in the original texture condition (Figure 5, see also Figure 9). The similarity between the results in Figure 5 and Figure 7 implies that the monitor-slant effect in the texture condition was not caused by residual motion parallax or by differential accommodation accompanying eye movements. We conclude that retinal blur was the primary cause of the effect of monitor slant under monocular viewing.

Average normalized slant estimates for *S_s* as a function of *S_m* in the eye movement and fixation conditions in Experiment 1. The upper and lower rows show the data from the eye movement and fixation conditions, respectively. Each column shows data from a different observer. The horizontal dashed lines represent veridical estimates for each *S_s*. The colored symbols represent the data, different colors denoting different values of *S_s*. The circled points are the data for the cues-consistent (*S_m* = *S_s*) conditions. The colored lines are the best fits to the data for each *S_s*. The error bar in the upper left corner of each panel represents ± the average *SEM*. The diamonds on the right side of each panel indicate the actual response settings for the cues-consistent stimuli at *S_m* = *S_s* = ±30°. The data were normalized such that the fitted settings at those points plotted at ordinate values of ±30°.

Regression weights for *S_m* in Experiment 1. The abscissa values are the three observers and an overall summary. Different colors represent different viewing conditions. The ordinate values are the multiple regression weights for *S_m*, obtained by ntering the slant estimates in each case into a multiple regression analysis with *S_m* and *S_s* as factors. The overall weights were calculated by entering the data from all three observers into a single analysis. The regression weights are equivalent to the weights given to *S_m* in each condition, averaged across all values of *S_m* and *S_s*. Error bars are +95% confidence intervals for the regression weights.

Figure 8 re-plots two subsets of the data: the cues-consistent conditions (S_m = S_s), and the S_m = 0° condition, in the same format as Figure 6. The abscissa is S_s. The lines are the best fits for each data subset. The slopes of the fitted lines to the cues-consistent data (black lines) are constrained to be 1 as a result of the normalization process. All three observers in both conditions reported less slant when S_m = 0 than when S_m = S_s (cues consistent), with the exception of AJW in the eye movement condition (she, however, showed a consistent effect of monitor slant overall; Figure 7). One-sample t tests were carried out on the differences between slant estimates for S_m = S_s and S_m = 0°. The difference in reported slant was statistically significant: observers reported less slant when S_m = 0 than when S_m = S_s in the eye movement condition, t(17) = 2.63, p < 0.05, and fixation condition, t(17) = 3.46, p < 0.01. This suggests again that focus cues affected slant percepts directly under monocular viewing.

Average normalized slant estimates for *S_m* = *S_s* and *S_m* = 0 in the eye movement and fixation conditions. Each panel plots the normalized estimates as a function of *S_s*. Each column shows the data for a different observer. The upper and lower rows show the data from the eye movement and fixation conditions, respectively. Both of those conditions were conducted with monocular viewing. The black circles are the data for *S_m* = *S_s* and the blue squares are the data for *S_m* = 0. The lines are best fits to the data. The slopes of the fitted lines to the cues-consistent data are constrained to be 1 as a result of the normalization process. Error bars are ±1 *SEM*.

The effects of monitor slant are summarized in Figure 9. The normalized data from EACH observer in each condition were entered into a multiple regression analysis with S_m and S_s as factors. The figure plots the regression weight for S_m separately for each condition and observer, as well as an average weight for each condition. The regression weights are the average weights given to monitor slant across all values of S_m and S_s. Regression weights greater than 0 indicate an effect of monitor slant. No effect was observed in the disparity condition. A consistent effect was observed in the texture condition and it persisted in the eye movement and fixation conditions where head position was fixed. Thus, residual motion parallax with chin rest constraint had no discernible effect, perhaps because the head movements were small. The fact that the effect persisted in the fixation condition, where observers held fixation on one point in the stimulus, suggests also that accommodation accompanying 3-D eye movements had no effect.

Discussion

Summary of results

With monocular viewing, observers’ slant estimates were systematically affected by the orientation of the monitor surface (S_m). Observers reported seeing more slant when S_m = S_s (as occurs with real stimuli) than when S_m = 0 (as usually occurs in psychophysical experiments and with most 3-D displays). The effect for the conditions of our experiment was small but quite consistent. These results show that information from focus cues (specifically, retinal blur) can, under monocular viewing, contribute directly to the visual system’s estimate of 3-D surface orientation.

Evidence for reliability-based cue weighting

We next asked whether our findings are consistent with reliability-based cue weighting (Equations 1 and 2). To answer this, we first estimated the reliabilities of focus cues as well as texture and disparity cues for our stimuli. We then used those reliabilities to predict perceived slant for each combination of S_m and S_s in our experiment. Although we did not determine the single-cue reliabilities by our own experimental measurements, the exercise is useful for understanding the data.

According to Equation 2, the reliability of each cue is the normalized reciprocal variance of the underlying estimator for that cue. To estimate this variance for the disparity and texture cues, we used previous slant discrimination measurements for each cue in isolation. To estimate this variance for focus cues, we simulated slant discrimination from blur using previous measurements of the visual system’s depth of focus. Figure 10 plots estimates of the JNDs for slant from disparity, texture, and focus cues as a function of surface slant (tilt = 0) and distance. Details of the calculations are provided in Appendix A.

JND estimates for slant from disparity, texture, and focus as a function of slant and viewing distance. The different colored surfaces represent JNDs based on the individual cues. The disparity and texture JNDs were estimated from the data of Hillis et al. (2004) (see Appendix A). The focus JNDs were estimated by calculations described in Appendix A. The calculations determined how much slant would be required for the difference in defocus at the nearest and farthest points in the stimulus plane to exceed the visual system’s depth of focus. The estimated JNDs from focus cues become very large at far distances and small slants, so the top portion of the focus surface has been clipped at 40°.

The texture JNDs were estimated from measurements made by Hillis et al. (2004) for monocularly viewed Voronoi patterns (see Appendix A). They are represented by the orange surface in Figure 10. Texture JNDs decrease with increasing slant because the image changes associated with a given change in slant increase (Blake, Bülthoff, & Steinberg, 1993; Knill, 1998). Texture JNDs do not change with distance because doubling the size of a given textured surface and viewing it from twice the distance leaves the retinal image unchanged.

The disparity JNDs were derived from discrimination thresholds for slant from disparity alone (Hillis et al., 2004), measured using sparse random-dot stereograms (see Appendix A). They are represented by the blue surface in Figure 10. Disparity JNDs increase with viewing distance because the magnitude of binocular disparities for a given depth difference decreases with increasing viewing distance (Howard & Rogers, 2002; Ogle, 1950). Disparity JNDs also vary with slant, which is expected from the viewing geometry (Hillis et al., 2004). The variation is distance dependent: JNDs increase with slant at long viewing distances and decrease with slant at short ones (see also Banks, Hooge, & Backus, 2001; Knill & Saunders, 2003). The steep rise at large slant and short viewing distance probably reflects the influence of the disparity-gradient limit. In that situation, the horizontal disparity gradient increases significantly, and the two retinal images are difficult to fuse (Banks, Gepshtein, & Landy, 2004; Burt & Julesz, 1980; Hillis et al., 2004).

We could not measure thresholds for slant from blur independent of other slant cues, but we can make a rough estimate of JNDs by considering how much slant would be required for the difference in defocus at the nearest and farthest points in the stimulus plane to exceed the visual system’s depth of focus. We did this calculation for each combination of slant and distance in Figure 10, using the same stimulus viewing frustum as Experiment 1 (see the Methods section). Details are provided in Appendix A. For our dim viewing conditions, pupil size was 5–7 mm (Wyszecki & Stiles, 1982), so depth of focus was approximately ±0.33 diopters (Campbell, 1957; Charman & Whitefoot, 1977; Green & Campbell, 1965; Green et al., 1980). The red surface in Figure 10 represents the estimated focus JNDs as a function of slant and distance.

The focus JNDs are generally larger than the disparity and texture JNDs, but the differences depend on slant and viewing distance. Specifically, focus-cue JNDs increase with increasing distance and decrease with increasing slant. The optimal cue-combination scheme (Equations 1 and 2) predicts therefore that focus cues should have little effect on 3-D percepts for many viewing situations. At short distances and large slants, however, focus JNDs can be equal to or less than those for disparity and texture. In these cases, optimal combination predicts a noticeable effect of focus cues.

We can use the estimated JNDs to derive predictions of the effect of focus cues in the conditions of our experiment. The left panel in Figure 11 plots the estimated JNDs for slant from disparity, texture, and focus cues for the range of slants (±30°) and the viewing distance (28.5 cm) used in Experiment 1. From those JNDs, we estimated the standard deviations of the estimators associated with disparity, texture, and focus cues. Then using Equations 1 and 2, we calculated the slants an observer would perceive if he weighted the three cues optimally. The middle and right panels show those predicted perceived slants, plotted in the same format as Figure 5 and Figure 7. In the disparity condition, the optimal cue combination predicts a small effect of monitor slant because the standard deviation of the disparity estimator is generally small relative to that of focus cues. In the texture condition, the model predicts a more systematic effect of monitor slant because in many cases the standard deviation associated with the competing cue—the texture gradient—does not differ very much from the standard deviation associated with focus cues.

Estimated slant JNDs and predicted results for Experiment 1. Left: Estimated JNDs for slant from disparity, texture, and focus cues, plotted as a function of slant at the 28.5 cm viewing distance used in Experiment 1. The curves are a slice through the contours of Figure 10. Middle: Predicted perceived slant for the disparity-defined stimulus. Right: Predicted perceived slant for the texture-defined stimulus. The format of the middle and right panels is the same as Figure 5 and Figure 7. The curves are plotted as a function of *S_m*; each color represents a different value of *S_s*. The variance of each cue’s slant estimate was calculated from the estimated JNDs in the left panel. The predicted perceived slants were calculated using those variances and the cue-combination scheme described by Equations 1 and 2.

Our empirical findings (Figure 5 and Figure 7) are generally quite similar to these predictions. The data exhibit a small but consistent effect of monitor slant in the texture condition; that effect is similar in magnitude to the predicted effect. The data reveal no effect of monitor slant in the disparity condition, while a very small effect is predicted. From a multiple regression analysis of the predictions and data, we find that the average predicted weights given to focus cues in the disparity and texture conditions were 0.07 and 0.15, respectively, and that empirical weights were 0.01 and 0.12 (Figure 9).

Despite this general similarity, the model does not capture the details of the empirical findings. In particular, in the monocular viewing conditions we found a significant difference between slant estimates in the cues-consistent (S_m = S_s) and the S_m = 0° conditions. The model predicts only small differences between these conditions because focus-cue JNDs are large when S_m = 0. It is important to note that our predictions are based on a simple and untested model of how the visual system discriminates changes in slant from focus cues (Appendix A). We do not know how the brain actually computes slant from those cues. Therefore, it is quite possible that the discrepancy between the predictions and observed effects of focus cues resulted from inadequacies in our model. Furthermore, the reliability of slant estimates from focus cues surely depends on several factors including the spatial frequency, luminance, and contrast of the stimulus, as well as on fixation patterns and pupil size. Thus, a proper analysis would require empirical measurement of slant from focus for the stimuli used in the main experiment. Nonetheless, our analysis yields insight into the informativeness of focus cues as a function of slant and viewing distance, and the relationship to the informativeness of texture and disparity. Under reasonable assumptions, the pattern of effects across conditions in our empirical data was generally consistent with reliability-based cue weighting.

We examined only two conventional depth cues, disparity and texture, so it remains to be determined whether inappropriate focus cues also contribute to perceived depth for stimuli defined by other conventional cues.

Experiment 2

Overview and background

Experiment 1 revealed that focus cues can have a direct effect on 3-D percepts. Accommodation could also affect perceived depth indirectly through the process of disparity scaling. The disparity (δ) created by two points in space is related to viewing distance as follows:

δ \approx \frac{I Δ D}{D^{2}},

(3)

where ΔD is depth, D is viewing distance, and I is interpupillary distance (Howard & Rogers, 2002). To recover ΔD from δ, D must be estimated. We know that viewing distance is estimated from the eyes’ vergence and the horizontal gradient of vertical disparity (Rogers & Bradshaw, 1993, 1995). In principle, it could also be estimated from accommodation. In computer displays, the focal distance to the display surface is fixed and often quite different from the simulated distances in the virtual scene. If the stimulus to accommodation (the focal distance of the display surface) affects the estimate of viewing distance, the distance to simulated points nearer than the display surface will be overestimated and the distance to points farther than the display surface will be underestimated. Such estimation errors might affect disparity scaling and hence the depth interpretation.

There have been many studies of disparity scaling (e.g., Bradshaw, Glennerster, & Rogers, 1996; Glennerster, Rogers, & Bradshaw, 1996; Johnston, Cumming, & Parker, 1993; O’Leary & Wallach, 1980; Rogers & Bradshaw, 1993, 1995; van Damme & Brenner, 1997), but only one (Ritter, 1977) examined the contribution of focal distance directly, and he observed no effect.

Frisby et al. (1996) observed veridical disparity scaling with real stimuli. The general consensus is that disparity scaling is most accurate when multiple cues are available and consistent with one another (e.g., vergence, vertical disparity, familiar size), but that scaling is usually nonveridical. At near viewing distances, the visual system behaves as if distance is overestimated, and at far distances, as if distance is underestimated (Collett, Schwarz, & Sobel, 1991; Foley, 1980; Glennerster et al., 1996; Johnston, 1991; Johnston et al., 1993; Rogers & Bradshaw, 1995; Wallach & Zuckerman, 1963). Although many of these studies varied display distance and simulated distance concordantly, this pattern of results is also generally what one would expect if focal distance (of a fixed display) affects the distance used for disparity scaling.

In Experiment 2 we examined the contribution of accommodation to the estimate of the distance used to scale horizontal disparities. In particular, we examined the indirect influence of focal distance on disparity scaling by independently manipulating vergence distance (by varying absolute disparity) and focal distance, referred to hereafter as accommodative distance (by varying the distance to the display).