Limits of stereopsis explained by local cross-correlation

Heather R Filippini; Martin S Banks

doi:10.1167/9.1.8

. Author manuscript; available in PMC: 2010 Sep 16.

Published in final edited form as: J Vis. 2009 Jan 12;9(1):8.1–818. doi: 10.1167/9.1.8

Limits of stereopsis explained by local cross-correlation

Heather R Filippini ¹, Martin S Banks ²

PMCID: PMC2940423 NIHMSID: NIHMS231104 PMID: 19271878

Abstract

Human stereopsis has two well-known constraints: the disparity-gradient limit, which is the inability to perceive depth when the change in disparity within a region is too large, and the limit of stereoresolution, which is the inability to perceive spatial variations in disparity that occur at too fine a spatial scale. We propose that both limitations can be understood as byproducts of estimating disparity by cross-correlating the two eyes’ images, the fundamental computation underlying the disparity-energy model. To test this proposal, we constructed a local cross-correlation model with biologically motivated properties. We then compared model and human behaviors in the same psychophysical tasks. The model and humans behaved quite similarly: they both exhibited a disparity-gradient limit and had similar stereoresolution thresholds. Performance was affected similarly by changes in a variety of stimulus parameters. By modeling the effects of stimulus blur and of using different sizes of image patches, we found evidence that the smallest neural mechanism humans use to estimate disparity is 3–6 arcmin in diameter. We conclude that the disparity-gradient limit and stereoresolution are indeed byproducts of using local cross-correlation to estimate disparity.

Keywords: binocular vision, computational modeling, depth

Introduction

Stereopsis, the ability to perceive depth from binocular disparity, is limited by a number of factors. The variation in disparity from one part of the stimulus to another must be large enough to produce a discernible variation in perceived depth. This just-noticeable variation is the disparity threshold. But the magnitude of disparity variation must not be too great; otherwise, the two eyes’ images cannot be fused and the depth percept collapses. This maximum disparity is the fusion limit or D_max. Finally, the spatial variation in disparity from one part of the stimulus to another must not occur at too fine a scale. The finest perceptible variation is the stereoresolution limit. These limits to stereopsis are summarized in Figure 1. The upper panel is a stereogram of sinusoidal corrugations in which disparity amplitude increases from left to right and spatial frequency increases from bottom to top. View the stereogram at a distance of 40 cm and cross-fuse or divergently fuse to see the corrugations. One can perceive the sinusoidal depth variation in the middle of the stereogram but not elsewhere. The lower panel is a graph, replotted from Tyler (1975), showing the combinations of disparity amplitude and spatial frequency for which the corrugation in depth is perceived and the combinations for which it is not. Our purpose is to better understand the determinants of the boundary conditions for stereopsis.

Combinations of disparity and spatial frequency that yield stereoscopic depth percepts. The upper panels are a stereogram that specifies sinusoidal depth corrugations. Cross fuse or divergently fuse to see the corrugations. Disparity amplitude increases from left to right and spatial frequency from bottom to top. The lower panel is replotted from Tyler (1975). The shaded region represents the combinations of disparity amplitude and spatial frequency that produce stereoscopic depth percepts: specifically, the ability to see the depth corrugation. The unshaded region represents combinations that do not yield such percepts. Along the dashed line, the product of spatial frequency and disparity is constant. The percept from the stereogram in the upper panel corresponds roughly to the graph in the lower panel if you view the stereogram from 40 cm.

To estimate disparity, the visual system must determine which parts of the two retinal images correspond. Doing this by cross-correlating the two eyes’ images has been used successfully in computer vision (Clerc & Mallat, 2002; Kanade & Okutomi, 1994), in modeling human vision (Banks, Gepshtein, & Landy, 2004; Cormack, Stevenson, & Schor, 1991; Fleet, Wagner, & Heeger, 1996; Harris, McKee, & Smallman, 1997), and in modeling binocular interaction in visual cortex. The prevailing model of binocular integration in visual cortex is the disparity-energy model (Cumming & DeAngelis, 2001; Ohzawa, 1998; Ohzawa, DeAngelis, & Freeman, 1990). In this model, the output of binocular complex cells can be expressed as

C (X_{L}, X_{R}) = {(S_{L}^{even})}^{2} + {(S_{L}^{odd})}^{2} + {(S_{R}^{even})}^{2} + {(S_{L}^{odd})}^{2} + 2 S_{L}^{even} S_{R}^{even} + 2 S_{L}^{odd} S_{R}^{odd},

(1)

where S_L,R are the responses of simple cells in the left and right eyes and even and odd refer to the symmetry of the simple-cell receptive fields (Prince & Eagle, 2000). In the last two terms, the left eye’s response is multiplied by the right eye’s response. A bank of such cells, each tuned for a different disparity, making this computation performs the equivalent of windowed or local cross-correlation.

In this paper, we investigate whether the limits of stereopsis revealed in Figure 1 are byproducts of using local cross-correlation to estimate disparity from the two eyes’ images. To do so, we construct a local cross-correlator with biologically motivated properties and compare its behavior to that of human observers. In the first section, we consider the cause of the limit on the upper right side of the shaded area in Figure 1; we point out, as have Burt and Julesz (1980) and Tyler (1974, 1975), that the transition with increasing disparity from perceptible to imperceptible is well described by the disparity-gradient limit. By comparing human and model performances, we show that this limit is a byproduct of estimating disparity by correlation. In the second section, we consider the cause of the limit on the upper part of the shaded region in Figure 1. We again compare human and model performances and show that this stereoresolution limit is also a byproduct of estimating disparity by correlation.

The disparity-gradient limit

Sinusoidal corrugations in random-element stereograms cannot be discriminated when the product of corrugation spatial frequency and disparity amplitude exceeds a critical value (Tyler, 1974, 1975; Ziegler, Hess, & Kingdom, 2000). A similar phenomenon was observed by Burt and Julesz (1980) who reported that two-element stereograms cannot be fused when the angular separation between the elements is less than the disparity. Burt and Julesz argued that the limit to fusion is not disparity per se, as suggested by the notion of Panum’s fusional area (Panum, 1858). Rather the limit is a ratio: the separation divided by the disparity. This critical ratio is the disparity-gradient limit.

The disparity gradient is clearly defined for two-element stimuli. For elements P and Q in a stereogram, the coordinates in the left eye’s image are (x_PL, y_PL) and (x_QL, y_QL), and the coordinates in the right eye are (x_PR, y_PR) and (x_QR, y_QR). The separation is the vector S from the average position of P to the average position of Q. Its magnitude is

| S | = \sqrt{{[\frac{x_{PL} - x_{QL} + x_{PR} - x_{QR}}{2}]}^{2} + {[\frac{y_{PL} - y_{QL} + y_{PR} - y_{QR}}{2}]}^{2}},

(2)

and its direction is

{tan}^{- 1} [\frac{y_{PL} - y_{QL} + y_{PR} - y_{QR}}{x_{PL} - x_{QL} + x_{PR} - x_{QR}}] .

(3)

The disparity is the vector D; its magnitude is

| D | = \sqrt{{[x_{PL} - x_{QL} + x_{QR} - x_{PR}]}^{2} + {[y_{PL} - y_{QL} + y_{QR} - y_{PR}]}^{2}},

(4)

and its direction is

{tan}^{- 1} [\frac{y_{PL} - y_{QL} + y_{QR} - y_{PR}}{x_{PL} - x_{QL} + x_{QR} - x_{PR}}] .

(5)

The disparity gradient is |D|/|S|. In Burt and Julesz (1980), the direction of S was varied, but D was always horizontal. They found that two-element stereograms could not be fused when the disparity gradient exceeded 1 regardless of the direction of S. In other words, they found that the disparity-gradient limit was unaffected by the tilt of the stimulus (Stevens, 1979).

On a surface, the disparity gradient is not clearly defined because the gradient can in principle be measured in any direction. The gradient is, however, largest in the direction in which depth increases most rapidly (i.e., parallel to surface tilt). For this reason, we will define the disparity gradient in the direction of most rapidly increasing depth: i.e., S will be parallel to tilt. The definition of the disparity gradient for a horizontally oriented sawtooth corrugation is schematized in Figure 2.

Definition of the disparity gradient. The upper part of the figure is a stereogram in which disparity specifies a sawtooth corrugation in depth. The orange and blue points, P and Q, lie on one of the slats of the sawtooth, near the trough and peak of the wave, respectively. They are positioned such that their separation S is aligned with the direction of most rapidly increasing depth. The lower part of the figure shows how the disparity gradient is defined.

The observation that the product of spatial frequency and amplitude must not exceed a critical value is also a manifestation of the disparity-gradient limit. As shown in Figure 2, the disparity gradient for sawtooth slats can be expressed as

D G = A * S F,

(6)

where A is the amplitude of the sawtooth wave and SF is the spatial frequency. The disparity gradient for the discontinuities between the slats is infinite. While sine waves do not have a constant disparity gradient, the disparity gradient of the steepest part of the waveform will have a similar relationship to the amplitude and spatial frequency. The corrugations in Figure 1 are horizontal (as they were in Tyler, 1973, 1974, 1975) and are defined by horizontal disparity only; thus, S is vertical and D is horizontal.

Why does the disparity-gradient limit exist? There are two general hypotheses. First, the limit might be a manifestation of constraints built into the visual system to help minimize false matches between the two eyes’ images; we will refer to this as the constraint hypothesis. Second, the disparity-gradient limit might be a byproduct of the manner in which binocular correspondence is solved; we will refer to this as the correlation hypothesis.

The constraint hypothesis states that the disparity-gradient limit is a topology constraint. Consider a small frontoparallel surface with a vertical rotation axis (tilt = 0°). As we rotate the surface, the slant increases. At some point, one eye will see the surface “edge on” such that points in the direction of most rapidly increasing depth will be superimposed in one eye’s retinal image and not in the other eye’s image. When this occurs, the disparity gradient is 2 (Trivedi & Lloyd, 1985). If we rotate the surface further, the gradient exceeds 2, and the order of the image points is reversed in the two eyes (specifically, the points occur in opposite orders along an epipolar line in the two eyes’ images). This observation is, however, only correct for surfaces with a tilt of 0° (i.e., when the direction of S and D are both horizontal). Trivedi and Lloyd (1985) argued that the visual system avoids matching elements with different orderings in the two eyes by invoking the disparity-gradient limit. Specifically, matches consistent with a small gradient should be favored over matches with a large one. Invoking a disparity-gradient constraint in solving the correspondence problem is consistent with other constraints that have been proposed such as the uniqueness, ordering, and smoothness constraints (Li & Hu, 1996; Marr & Poggio, 1976; Pollard, Mayhew, & Frisby, 1985; Prazdny, 1985).

Although Trivedi and Lloyd (1985) showed that ordering is preserved when the disparity gradient is less than 2, they also noted that the converse is not true: all correctly ordered points will not necessarily have a disparity gradient less than 2. To expand this observation, we calculated the disparity gradient at which occlusion occurs in one eye for different surface tilts. Figure 3 shows the results. The critical gradient is indeed 2 at tilt = 0°, but it increases with increasing tilt until it becomes infinite at tilt = 90°. Thus, the value of the disparity gradient at which the order of image points in the two eyes’ images reverses is strongly dependent on surface tilt. If the constraint serves the function of avoiding matches of one element with two in the other eye (uniqueness constraint) or of matching elements in different orders in the two eyes (ordering constraint), one would expect the disparity-gradient limit to increase with increasing surface tilt. The fact that it does not (Burt & Julesz, 1980) suggests that the limit is caused by something else.

Critical disparity gradient vs. tilt. As the slant of a surface increases, the disparity gradient increases. The critical gradient is the disparity gradient at which one eye’s image is first occluded. The disparity gradient at which this occurs is tilt-dependent. This plot was generated with the eyes in forward gaze; it is unaffected by the eyes’ vergence.

The correlation hypothesis states that the disparity-gradient limit is a byproduct of the fundamental calculation involved in solving the correspondence problem. The normalized cross-correlation (Equation 9) between two images reaches its greatest value of 1 when the two eyes’ images are identical. Spatial variations in disparity cause differences in the two images and thereby cause a decrease in correlation. For this reason, one expects the correlation to decrease as the disparity gradient increases. The decrease in correlation occurs whether the disparity gradient is greatest horizontally (surface with tilt = 0°), vertically (tilt = 90°), or anything in-between. The fact that the disparity-gradient limit is not dependent on tilt is consistent with the correlation hypothesis and not with the constraint hypothesis.

We tested the idea that the disparity-gradient limit is a byproduct of estimating disparity by correlation by examining how the disparity gradient affects human stereopsis, how it affects the performance of a correlation model, and then comparing the two on the same task with the same stimuli.

Human methods

Observers

The observers were the two authors and two other adults; the latter were unaware of the experimental hypotheses. All had normal visual acuity and stereopsis.

Apparatus

Stimuli were displayed on a haploscope consisting of two monochrome CRT displays (58 cm on the diagonal) each seen in a mirror by one eye (Backus, Banks, van Ee, & Crowell, 1999; Hillis & Banks, 2001). The lines of sight from the eyes to the centers of the displays were perpendicular to the display surfaces. The displays were 39 cm from the eyes. Despite the short distance, the visual locations of the elements in our stimuli were specified to within ~30 arcsec. We fixed eye position relative to the displays by using custom bite bars.

Stimuli

The stimuli were random-dot stereograms with a dot density of 10 dots/deg² and an extent of 15° horizontally and vertically. The average luminous intensity of a dot was 1.72 × 10⁻⁶ cd and the size was 0.53 arcmin. The dots were randomly distributed in the half-images. Two methods were used to create disparity. In the first, we shifted the dots horizontally in screen coordinates (which correspond to horizontal in Helmholtz coordinates). This is the most common method for creating stereograms, but such stimuli presented in a haploscope do not have the vertical disparities that are produced by real-world stimuli at finite distances (Held & Banks, 2008). In the second method, we used “back projection” to create the appropriate horizontal and vertical disparities (Backus et al., 1999).

The disparities specified a horizontally oriented sawtooth corrugation (Figure 2). Each slat in the sawtooth had a constant disparity gradient that was proportional to both the spatial frequency and amplitude of the corrugation.

Procedure

Before each trial, a dichoptic nonius fixation target was visible. Observers made sure the nonius lines were aligned before initiating a stimulus presentation, which assured that their vergence eye position was appropriate. With a key press, the observer initiated a 600-ms presentation of the sawtooth stimulus. It was presented in one of two parities: with the slats slanted top-back or top-forward. The observer then indicated which parity had been presented. The absolute phase of the sawtooth was varied randomly from trial to trial, so that the task could not be performed by determining the depth of a single dot. Because the corrugations were horizontal (i.e., tilt = 90°), there were no monocular artifacts; performing the task required perceiving the cyclopean waveform.

We wanted to keep the corrugation waveform fixed for each experimental condition, so we measured parity-discrimination thresholds by adding uninformative noise to the stimulus rather than by manipulating disparity amplitude, which would have changed the disparity gradient of the waveform. The uninformative noise was dots randomly positioned in 3D; the depths of the noise dots were drawn from a uniform random distribution with a fixed range that was greater than the depth range of the corrugation waveform. Coherence is the number of signal dots (those specifying the corrugation) divided by the total number of dots (signal dots plus noise dots). A coherence of 1 means that all dots are signal dots, while a coherence of 0 means that all dots are noise dots. The sum of signal and noise dots was always the same. We varied coherence using the method of constant stimuli in order to determine the threshold value. We fit the psychometric data with a cumulative Gaussian using a maximum-likelihood criterion and used the coherence at 75% correct as the threshold estimate (Wichmann & Hill, 2001).

Our cyclopean discrimination task required that observers perceive at least part of the disparity-defined corrugation. We assume that the two eyes’ images had to be fused to perform this task.

Human results

The first measurements were made with stereograms in which the disparities were created by horizontal shifting. The haploscope arms were rotated such that the vergence distance was 39 cm, which matched the physical distance to the CRTs.

Figures 4 and 5 plot the results. In Figure 4, coherence threshold is plotted as a function of disparity amplitude. The three sets of data points correspond to the thresholds for three different spatial frequencies: 0.15, 0.3, and 0.6 cpd. Coherence threshold generally rose with increasing disparity amplitude, but there was no particular amplitude at which the rise in threshold was the same. Figure 5 plots the same data as a function of disparity gradient. With increasing gradient, threshold rose in similar fashion for all three spatial frequencies. Thus, perception of the cyclopean waveform began to collapse as the disparity gradient reached a value of approximately 1, which is consistent with earlier work (Burt & Julesz, 1980; Tyler, 1973, 1974, 1975; Ziegler et al., 2000). The worsening of performance was not precipitous: observers could still discriminate the cyclopean waveform when the gradient was as high as 1.6. This is consistent with previous work showing that humans can perceive surfaces when the disparity gradient exceeds 1 (McKee & Verghese, 2002).

Coherence threshold vs. disparity amplitude for human observers. Each panel plots coherence threshold, the proportion of signal dots in the stimulus, as a function of the peak-to-trough disparity amplitude. Different panels show data from different observers. Different symbols represent data for different spatial frequencies of the horizontal sawtooth corrugation: circles, squares, and triangles for 0.15, 0.3, and 0.6 cpd, respectively. Error bars are standard errors of the means calculated by bootstrapping.

Coherence threshold vs. disparity gradient for human observers. Each panel is replotted from Figure 4. The panels plot coherence threshold, the proportion of signal dots in the stimulus, as a function of the disparity gradient of the slats of the horizontal sawtooth stimulus. Different panels show data from different observers. Different symbols represent data for different spatial frequencies: circles, squares, and triangles for 0.15, 0.3, and 0.6 cpd, respectively. Error bars are standard errors of the means calculated by bootstrapping.

We also measured coherence thresholds for vertically oriented waveforms (tilt = 0°). The task was generally more difficult because vertical corrugations produced severe monocular artifacts (regions in which no dots appeared and regions of high dot density), but we were able to obtain reasonable data from two observers. As with horizontal corrugations, the rise in coherence threshold was determined by the disparity gradient. Thus, the disparity gradient is a key determinant of the ability to perceive cyclopean stimuli whether the variations in depth are vertical (tilt = 90°) or horizontal (tilt = 0°).

A given disparity gradient specifies different slants at different distances. The first experiment was conducted at one simulated viewing distance, so the relationship between disparity gradient and slant was fixed. It is therefore possible that slant rather than the disparity gradient was the primary limit to performance. To investigate this possibility, we tested two observers at different viewing distances. To vary the simulated distance, we manipulated the vergence stimulus by rotating the haploscope arms to the appropriate values. It was also important to present the appropriate vertical-disparity gradient because this gradient is an effective distance cue (Rogers & Bradshaw, 1993, 1995). We therefore used the back-projection method to create disparities in the stereograms. The vertical-disparity gradient and the horizontal vergence of the back-projected stimuli specified viewing distances of 17 or 39 cm. At 17 cm, a gradient of 1 specifies a slant of 71°; at 39 cm, the same gradient specifies a slant of 80°.

The results are shown in Figures 6 and 7. The data from the two viewing distances nearly superimposed when plotted as a function of disparity gradient (Figure 7) and did not when plotted as a function of slant (Figure 6). Thus, the disparity gradient rather than slant was the primary limit to performance.

Coherence threshold vs. slant for human observers. Coherence threshold, the proportion of signal dots in the stimulus, is plotted as a function of the slant of the sawtooth slats. Different panels show data from different observers. Different symbol shapes represent data for different spatial frequencies: circles, squares, and triangles for 0.15, 0.3, and 0.6 cpd, respectively. Different colors represent data from the two viewing distances: red and blue for 39 cm and 17 cm, respectively. Error bars are standard errors of the means calculated by bootstrapping.

Coherence threshold vs. disparity gradient for human observers. Coherence threshold, the proportion of signal dots in the stimulus, is plotted as a function of the disparity gradient of the sawtooth slats. Different panels show data from different observers. Different symbol shapes represent data for different spatial frequencies: circles, squares, and triangles for 0.15, 0.3, and 0.6 cpd, respectively. Different colors represent data from the two viewing distances: red and blue for 39 cm and 17 cm, respectively. Error bars are standard errors of the means calculated by bootstrapping.

The human results show, as others have (Tyler, 1974, 1975; Ziegler et al., 2000), that the disparity gradient is a key determinant of the ability to construct stereoscopic percepts. As the gradient increases, percept construction becomes successively, but not precipitously, more problematic. The constraints imposed by high disparity gradients do not depend on stimulus tilt. We next examined the cause of the degradation in performance with increasing disparity gradient.

Modeling methods

If the constraints imposed by large disparity gradients are a byproduct of estimating disparity via correlation, we should see similar behavior in humans and a local cross-correlator like the one described by Banks et al. (2004). As noted earlier, such a correlation algorithm has the same fundamental properties as the disparity-energy calculation that characterizes binocular neurons in visual cortex (Anzai, Ohzawa, & Freeman, 1999; Ohzawa et al., 1990).

Stimuli and task

We presented the same stimuli and task to the model that were presented to human observers. The stimuli were again random-dot stereograms specifying sawtooth corrugations and the task was again to determine the parity of the sawtooth.

The stereo half-images were blurred according to the optics of the human eye. Specifically, we convolved the half-images with the point-spread function of the well-focused eye with a 3-mm pupil (h(x, y)):

h (x, y) = a * h_{1} (x, y) + (1 - a) * h_{2} (x, y),

(7)

where

h_{i} (x, y) = {[\sqrt{2 π} s_{i}]}^{- 2} e^{- 0.5 (x^{2} + y^{2}) / s_{i}^{2}}

(8)

and a = 0.583, s₁ = 0.443 arcmin, and s₂ = 2.04 arcmin (Geisler & Davila, 1985). The resulting images were scaled such that the spacing between rows and columns was 0.6 arcmin, corresponding roughly to the spacing between foveal cones (Geisler & Davila, 1985). These values were chosen to best approximate the analogous viewing situation for the human observers.

Cross-correlator

The half-images were then sent to the binocular cross-correlator, which computed the correlation between samples of the left and right half-images:

c (δ_{x}) = \frac{\sum_{(x, y) \in W_{L}} [(L (x, y) - μ_{L}) (R (x - δ_{x}, y) - μ_{R})]}{\sqrt{\sum_{(x, y) \in W_{L}} {(L (x, y) - μ_{L})}^{2}} \sqrt{\sum_{(x, y) \in W_{R}} {(R (x - δ_{x}, y) - μ_{R})}^{2}}},

(9)

where L(x, y) and R(x, y) are the image intensities in the left and right half-images, W_L and W_R are the windows applied to the half-images, µ_L and µ_R are the mean intensities within the two windows, and δ_x is the displacement of W_R relative to W_L (where the displacement is disparity). The normalization by mean intensity assures that the correlation is always between −1 and 1; the correlation for identical images is 1. Without the normalization, the resultant would depend on the contrast and average intensity of the half-images. W_L and W_R were identical two-dimensional Gaussian weighting functions:

W_{L} = W_{R} = e^{- (\frac{x^{2}}{2 σ_{x}^{2}} + \frac{y^{2}}{2 σ_{y}^{2}})} .

(10)

We used isotropic functions, so σ_x and σ_y had the same values (for an example of the use of anisotropic functions, see Kanade & Okutomi, 1994). These weighting functions were used to select patches of the left and right half-images, which were then cross-correlated. Throughout the manuscript, we refer to the size of these weighting functions as “window size”; the window size we report is the diameter of the part of the Gaussian containing ±1σ. The actual windows used in the simulations extended to ±3σ until they were truncated. Our weighting functions mimic the envelopes associated with cortical receptive fields, but not the even- and odd-symmetric weighting functions of the disparity-energy model (Ohzawa et al., 1990).

To estimate disparity across the stimulus, we shifted W_L along a vertical line perpendicular to the sawtooth corrugations in the middle of the left eye’s half-image. For each position of W_L, we then computed the correlation for different horizontal positions of W_R relative to W_L (Equation 9; horizontal defined in Helmholtz coordinates). The restriction of shifting W_L along one vertical line greatly reduced computation time but did not affect the main results.

The lower part of Figure 8 provides an example of the output of the cross-correlator. The abscissa represents the position of W_L along the vertical search line and the ordinate represents the horizontal position of W_R relative to W_L; thus, the ordinate is the horizontal disparity. Red corresponds to high correlations, green to correlations near 0, and blue to negative correlations.

Local cross-correlation model for disparity estimation. The upper two panels are the half-images presented to the model. The reader can see the sawtooth corrugation by divergent or cross fusing. A Gaussian correlation window *W_L* is placed in the left eye’s half-image. That window was moved along a vertical line as indicated by the arrow. For each position of *W_L*, an identical Gaussian window *W_R* was placed in the right eye’s image, and the cross-correlation (Equation 9) was computed between the two windowed images. Throughout the manuscript, we refer to the size of *W_L* and *W_R* as the “window size.” By this we mean two standard deviations of the Gaussian. The lower panel shows an example of the output of the cross-correlator. The abscissa is the position of *W_L* along the vertical line in the left eye’s image, and the ordinate is the relative horizontal position of *W_R*; this corresponds to the horizontal disparity. Correlation is represented by color, red for correlations approaching 1, green for correlations near 0, and blue for correlations between −1 and 0. In this example, the sawtooth is revealed by ridges of high correlation.

Decision rule

To compare model and human behaviors for the same task, we needed a decision rule for the model. It would have been best to use an ideal decision rule because such a rule is information preserving and is therefore best at revealing constraints imposed by earlier processing stages (Watson, 1985). However, to construct an ideal rule, we would need to know the means and standard deviations for each pixel in the correlation image for all the relevant parameters. The required computation was too complex and time consuming, so we chose a simpler rule: template matching (Watson, Barlow, & Robson, 1983). We first constructed templates of the post-optics stimuli in disparity-estimation space. Specifically, we constructed a bank of templates with the same dimensions as the cross-correlator output. Each template in this bank had the same spatial frequency as the corrugation waveform. All relevant amplitudes were included as were the two parities. To minimize computation time, phase was varied in steps of 10° (the phase of the stimulus was also limited to 10° steps). We next found the similarity of each template to the output of the cross-correlator by using an abbreviated form of cross-correlation: each template was multiplied element by element by the cross-correlator output and the result was summed across both dimensions. The model then picked the stimulus whose template had the highest correlation with the cross-correlator output. The model therefore knew the spatial frequency of the stimulus but was uncertain about everything else. By recording the model’s responses, we were able to construct psychometric functions like the ones generated in the human experiments. These functions were fit by cumulative Gaussians using a maximum-likelihood criterion; the threshold estimates were the means of the best-fitting functions (Wichmann & Hill, 2001).

Modeling results

The results are shown in Figures 9 and 10. In Figure 9, the model’s coherence threshold is plotted against disparity amplitude. Threshold rose with increasing amplitude, but at different rates depending on the spatial frequency of the horizontal corrugation stimulus. In Figure 10, the same data are plotted as a function of disparity gradient. Now the curves superimpose, so the disparity gradient, not the disparity amplitude, was the primary constraint for the local cross-correlator.

Coherence threshold vs. disparity amplitude for the cross-correlation model. Coherence threshold, the proportion of signal dots in the stimulus, is plotted as a function of the disparity amplitude of the horizontal sawtooth stimulus. The size of the correlation window used in this simulation was 18 arcmin (recall that refers to ±1 standard deviation of the Gaussian window). Different symbols represent different spatial frequencies: squares and triangles for 0.3 and 0.6 cpd, respectively. Error bars are standard errors of the means calculated by bootstrapping.

Coherence threshold vs. disparity gradient for the cross-correlation model. Coherence threshold is plotted as a function of the disparity gradient of the sawtooth slats. The size of the correlation window was 18 arcmin. Different symbols represent different spatial frequencies: squares and triangles for 0.3 and 0.6 cpd, respectively. Error bars are standard errors of the means calculated by bootstrapping.

We also conducted simulations with vertically oriented sawtooth corrugations and observed quite similar behavior (not shown).

The size of the window used to correlate the two eyes’ images is an important aspect of disparity estimation (Banks et al., 2004; Harris et al., 1997; Kanade & Okutomi, 1994). Larger windows contain more variation in luminance and thus generally allow better discrimination between correct and false matches. However, when disparity varies at a finer scale than the window, the ability to estimate the variation is reduced. To investigate the influence of window size on disparity estimation, we ran the local cross-correlator with window sizes of 6, 18, and 30 arcmin. Figure 11 shows the results. For every window size, threshold grew systematically with an increase in disparity gradient, and the growth was the same for different spatial frequencies. Thus, the fall-off in performance with increasing disparity gradient is similar regardless of the size of the correlation window. Overall performance was somewhat reduced with the smallest window of 6 arcmin because the dot density of 10 dot/deg² was too low to provide sufficient luminance variation over that small a region. We conclude that disparity estimation via correlation suffers from a disparity-gradient limit, and that the limit cannot be avoided by choosing smaller window sizes.

Coherence threshold vs. disparity gradient for the cross-correlation model with different window sizes. Different colors represent thresholds obtained for different sizes of the correlation window: red for 6 arcmin, blue for 18 arcmin, and green for 30 arcmin. Different symbol types represent thresholds for different spatial frequencies: squares and triangles for 0.3 and 0.6 cpd, respectively. Error bars are standard errors of the means calculated by bootstrapping.

We next compared the performance of the local cross-correlator model with human performance. Figure 12 plots coherence thresholds for the model and a representative human observer for the same stimulus and task. In both cases, the rise in threshold was determined by the disparity gradient, not the disparity amplitude. Thus, the human visual system is constrained in much the same way as a local cross-correlator when it is presented two images that differ greatly.

Comparison of model and human performances. Coherence thresholds are plotted as a function of disparity gradient for the local cross-correlator and a representative human observer (HRF). Filled symbols are human data and unfilled symbols are model data. The size of the correlation window used in this simulation was 18 arcmin. Different symbols represent different spatial frequencies: squares and triangles for 0.3 and 0.6 cpd, respectively. Error bars are standard errors of the means calculated by bootstrapping.

There was one systematic difference between human and model behaviors: The rise in threshold as a function of disparity gradient was steeper in humans. This difference is probably caused by differences in the way large absolute disparities are treated by humans and the model. In humans, disparity-discrimination thresholds increase with the value of absolute disparity because disparity estimation is not as precise off the horopter as it is near the horopter (Blakemore, 1970; Ogle, 1953). Furthermore, the detection of correlation in dichoptic stimuli worsens systematically with increases in absolute disparity (Stevenson, Cormack, Schor, & Tyler, 1992). As the disparity gradient increased in our experiment, more of the stimulus fell farther from the horopter, which would have made it more difficult for humans to perceive the signal corrugation. Our model had no provision for penalizing solutions involving large disparities, so it did not suffer in the same way as humans when parts of the stimulus created large absolute disparities.

In summary, the disparity gradient of a stimulus is a critical determinant of humans’ ability to construct stereoscopic percepts. Its influence seems to be unaffected by the orientation or tilt of the depth variations. These same properties are exhibited by a local cross-correlation model. Thus, the disparity-gradient limit appears to be a byproduct of using this method to estimate disparities.

Stereoresolution

Although humans can detect very small disparities, stereoresolution—the finest spatial variation in disparity that can be reliably perceived—is relatively poor (Bradshaw & Rogers, 1999; Tyler, 1973, 1974, 1975). The highest detectable spatial frequency for disparity-defined corrugations is only 2–3 cpd (Figure 1); in contrast, the highest detectable frequency for luminance-defined waveforms is about 50 cpd (Campbell & Robson, 1968). Banks et al. (2004) proposed that relatively coarse stereoresolution is a byproduct of estimating disparity by correlating the two eyes’ images. To further investigate this possibility, we compared the performance of humans and the local cross-correlator in the same stereoresolution task.