Abstract
A fundamental natural visual task is the identification of specific target objects in the environments that surround us. It has long been known that some properties of the background have strong effects on target visibility. The most well-known properties are the luminance, contrast, and similarity of the background to the target. In previous studies, we found that these properties have highly lawful effects on detection in natural backgrounds. However, there is another important factor affecting detection in natural backgrounds that has received little or no attention in the masking literature, which has been concerned with detection in simpler backgrounds. Namely, in natural backgrounds the properties of the background often vary under the target, and hence some parts of the target are masked more than others. We began studying this factor, which we call the “partial masking factor,” by measuring detection thresholds in backgrounds of contrast-modulated white noise that was constructed so that the standard template-matching (TM) observer performs equally well whether or not the noise contrast modulates in the target region. If noise contrast is uniform in the target region, then this TM observer is the Bayesian optimal observer. However, when the noise contrast modulates then the Bayesian optimal observer weights the template at each pixel location by the estimated reliability at that location. We find that human performance for modulated noise backgrounds is predicted by this reliability-weighted TM (RTM) observer. More surprisingly, we find that human performance for natural backgrounds is also predicted by the RTM observer.
Keywords: natural scene statistics, detection, masking, normalization, reliability weighting
The human visual system has a remarkable ability to identify objects in natural scenes. The complex structure of natural objects and environments make this a very difficult task. Target objects are randomly positioned and oriented in the three-dimensional (3D) environment, are viewed under complex and varying lighting conditions, and are viewed in a complex context of other objects that may partially occlude or mask the target object. We are still a long way from fully understanding how our visual and cognitive systems perform object identification in the face of all of the dimensions of stimulus variation that occur under natural conditions.
However, what has become clear is that the retinal images produced from natural scenes contain many statistical regularities and that the human brain has learned, through evolution and through experience over the lifespan, to exploit those statistical regularities in order to make accurate inferences about whether or not target objects are present within a scene. Identification of targets in natural scenes is inherently a statistical inference problem, and hence the natural theoretical framework is Bayesian statistical decision theory. In this framework, the statistical regularities of natural scenes are represented as prior probability distributions. Here we consider a statistical property of natural-image backgrounds, the “partial masking factor,” and show that it is exploited by the human visual system in a near-optimal way that is consistent with Bayesian inference.
A long history of behavioral studies has shown that three other background properties have a substantial effect on target identification performance: the luminance of the background, the contrast of the background, and the spatial similarity of the background to the target object (chromatic similarity is also important but here we are only concerned with achromatic stimuli). For simple backgrounds, the target amplitude required for identification increases with the background luminance (1, 2), background contrast (3–6), and background similarity to the target (7–10). It is important to note that most of these studies used deterministic backgrounds that did not vary from trial to trial (except for photon noise). Nonetheless, the psychophysical laws observed in these studies are likely to be a consequence of peripheral and central neural mechanisms that evolved specifically to support identification performance under natural conditions where the stimulus properties are almost always varying randomly.
In a recent study (11), we used a constrained-sampling method to examine how these properties affect identification in natural images and to measure the statistics of these properties in natural images. For each target object considered, millions of patches of natural-image background were sorted into a 3D histogram where each bin contains patches having a particular luminance, contrast, and similarity to the target. We then measured human identification thresholds for a sparse subset of the bins tiling the space, where on each trial a different background from the bin was randomly selected. We found that 1) threshold increases approximately linearly with local mean luminance, in agreement with the classic finding of Weber’s law reported for detection in uniform backgrounds (1, 2), 2) threshold amplitude power increases approximately linearly with background contrast power in agreement with the classic finding for detection in white noise (5) and more recent findings for detection in correlated noise (12, 13), and 3) threshold increases approximately linearly with the similarity of the background to the target, a result not previously reported. (Similarity was defined as the cosine of the vector angle between the Fourier amplitude spectra of the target and background; see ref. 11 for details.)
These factors each had a large effect, and their effects combine multiplicatively, consistent with separable multidimensional Weber’s law: Amplitude threshold is proportional to the product of the background luminance, contrast, and similarity: . Within a typical natural image, these factors cause the threshold for identification to vary by more than a factor of 30 (11). Importantly, we found that this multidimensional Weber’s law is predicted directly from the statistical properties of the natural backgrounds, and we showed that the luminance, contrast, and similarity normalization known to occur along the visual pathway may be the mechanism by which the brain incorporates this statistical knowledge (prior probabilities) into its target-identification circuitry (Discussion).
Nonetheless, there is, subjectively, at least one other important factor. When measuring the threshold for a particular bin in our experiment, the subjects viewed hundreds of trials, where each trial had a different background of the same luminance, contrast, and similarity. Subjects reported that on a subset of trials the target was relatively easy to detect. These easy trials seemed to occur when the properties of the background under the target were inhomogeneous. For example, if there was a lower contrast subregion, then the part of the target in the low-contrast subregion was much easier to detect, and often this visible part of the target was sufficient for identification. Others have also noted this phenomenon when subjects are trying to detect image-compression artifacts in natural images (14, 15), but it has never been studied systematically. Here we measure, analyze, and model this partial masking factor, in both simple backgrounds and natural backgrounds. Identification of partially masked objects is closely related to identification of partially occluded objects (Discussion).
Partial masking in white noise backgrounds is illustrated in Fig. 1A. The same target image of a lion’s head is added to the background noise images on the Left and Right. On the Left, the background noise is uniform in contrast, whereas on the Right the background noise is modulated in contrast. However, the total contrast energy of the background noise covering the target is the same on the Left and on the Right. Models that ignore the variation in background properties under the target (e.g., the standard template-matching (TM) observer which is optimal in uniform noise) predict equal detectability on the Left and Right. However, the target is more visible on the Right because the properties of the background are changing under the target—the target is only partially masked.
Fig. 1.
Demonstration of partial masking. (A) The same lion’s head target (identical pixel gray levels) is added to two noise texture backgrounds. For the uniform contrast texture on the left, the standard TM observer is optimal. The TM observer applies a template (receptive field) having the shape of the target (Left image in B). For modulated contrast textures like the image on the right, the TM observer is not optimal. In fact, here it performs the same for the Left and Right images because the total noise power in the target region is the same for both images. For the general case where the background noise can modulate (as in the Right image) the RTM observer is optimal. The RTM observer applies a template that is normalized at each pixel location by the estimated noise variance at that location. The RTM observer performs better because it suppresses information from pixel locations that are more corrupted by noise. (B) Optimal reliability-weighted templates for Left and Right images in A.
In what follows we first consider the effect of partial masking on the identification of targets in contrast-modulated white noise. We describe an exact mathematical formula for Bayes optimal performance in this task and show that it predicts the effect of partial masking on human performance for modulated white noise backgrounds. We then show, surprisingly, that this same formula predicts the effects of partial masking on human detection performance in natural backgrounds. Finally, we discuss possible neural mechanisms and the general implications of the results for object identification in natural images and in other kinds of images (e.g., medical images).
Results
Partial Masking in Contrast-Modulated White Noise.
We first consider partial masking for spatially modulated Gaussian white noise. In this case, it is possible to determine the optimal computations (the Bayesian ideal observer) for detection when there are arbitrary background subregions within the target region. The ideal observer is useful because it reveals the fundamental computational principles of a perceptual task and because it provides an appropriate baseline against which to compare human performance (16, 17).
Consider a single-interval identification task. At random, on half the trials, the stimulus is a background of spatially modulated white noise, , and on the other half an arbitrary known target pattern of amplitude a is added to the noise background (e.g., see Fig. 2):
[1] |
where . The noise background is statistically independent Gaussian noise with a mean luminance of and an arbitrary SD map :
[2] |
The target shape is defined so that its Euclidean norm is one, , and hence the amplitude is the square root of the target energy (the dot product of the target with itself). (Here, bold symbols represent vectors.)
Fig. 2.
Spatially modulated Gaussian noise experiment (experiment 1). (A) The target was a horizontal 7.5-cpd square-wave grating. The backgrounds were varied parametrically in the ratio of the SDs between the high- and low-contrast regions and the spatial frequency of the noise contrast modulation. In all cases the average noise contrast power was held fixed at 0.061. (B) Trial sequence in the single-interval identification task.
In SI Appendix we show that the ideal observer applies a reliability-weighted template (receptive field) that has the target shape multiplied at each pixel by the inverse of the local noise variance (the reliability) , or equivalently, the image and template are both normalized at each pixel location by the local SD. The optimal templates for the uniformly and partially masked lions head are shown in Fig. 1B. If the template response exceeds a criterion, then the observer reports the target is present, otherwise that it is absent.
The detectability of this reliability-weighted TM (RTM) observer is proportional to the product of the target amplitude and the “partial masking factor” :
[3] |
The partial masking factor is the Euclidean norm of the locally normalized template shape,
[4] |
where is the relative SD map: the SD map divided by mean value of the SD map . The partial masking factor is defined this way so that it depends only on the shape of the SD map; that is, it is unaffected by arbitrary scaling of the local variances.
The detection threshold of the RTM observer is proportional to one over the partial masking factor:
[5] |
In the special case of constant noise variance, the RTM observer reduces to the well-known formula for targets in white noise: Detectability is the square root of the target energy divided by the noise SD (17): .
The theoretical analysis above assumes that the reliability at each location within the target region is known. However, when trying to identify a target under natural conditions the structure of the background typically varies on every trial, and hence the local reliabilities need to be estimated on every trial. The direct estimate is to compute the gray-level variance in the local neighborhood around each image pixel. However, the optimal neighborhood size depends on the statistics of the noise modulation. If we assume the human visual system evolved low‐level mechanisms that take into account the local SDs that occur in natural scenes, then it is plausible that the visual system uses some fixed, but unknown, neighborhood size. Thus, for the purpose of comparison with human performance, we assume that the estimated local SD, , is the square root of the sample variance under a raised‐cosine window with a diameter w at half height (SI Appendix). Note that by definition the local SD is the product of the local luminance and local rms contrast [i.e., ]. Thus, normalizing by the local SD is equivalent to normalizing by the local luminance and the local contrast. Such normalization is known to occur in the early stages of the visual system. When we assume that the RTM observer is using the estimated local SDs rather than the true local SDs we represent the partial masking factor as .
Psychophysical experiments can be used to test whether humans behave like the RTM observer and to estimate the local normalization diameter. In experiment 1 we measured detection thresholds for horizontal grating targets in uniform and modulated Gaussian noise backgrounds (Fig. 2A). To measure the spatial scale over which the visual system estimates local reliability we varied the spatial frequency of the square-wave noise modulation from 0.25 to 2 cycles per degree (cpd). We also varied the modulation ratio of the noise SDs, , while holding the average noise power of the background fixed at 0.061 (rms contrast 0.246) across all conditions. The mean luminance was also held fixed at 48 cd/m2 across all conditions.
The trial sequence for a contrast modulation frequency of 1 cpd, and a modulation ratio of 4, is illustrated in Fig. 2B. The stimulus duration was set to be similar to the typical duration of human fixations (250 ms). For each condition we measured a psychometric function based on at least 300 trials, in the single-interval identification task. The amplitude threshold was defined as the target amplitude giving 69% correct .
The symbols in Fig. 3A show the thresholds in decibels measured for three observers. Note that when the modulation frequency is 0 cpd the background is uniform noise (a modulation ratio of 1:1). Threshold is a U-shaped function of modulation frequency that reaches a minimum in the range of 0.25 to 0.5 cpd. As modulation ratio increases, the minimum threshold declines. The maximum partial masking effect occurs for a modulation ratio of 8 and is ∼12 dB for two observers and 8 dB for the third (a drop in the threshold contrast by a factor of 3 on average). The symbols in Fig. 3B show the average thresholds for the three observers. The standard TM observer (which is optimal for uniform noise backgrounds) predicts that the thresholds for all conditions should be identical. Obviously, the human visual system takes into account the variation in local reliability within the target region.
Fig. 3.
Thresholds in the partial masking experiment for spatially modulated white noise (experiment 1). (A) Thresholds for three observers as function of the amplitude and frequency of noise modulation. The thresholds are expressed in units of Michelson contrast (amplitude threshold divided by mean luminance) on a logarithmic (decibel) scale. (B) Average threshold for the three observers. The solid curves show predictions of an RTM observer that estimates local reliability (inverse of noise contrast energy) from a fixed-size neighborhood around each image pixel location. A simple TM observer predicts exactly the same threshold for all conditions. Half-height width w: subject 1 = 0.25°, subject 2 = 0.25°, subject 3 = 0.32°, average = 0.25°.
To estimate the spatial scale over which the visual system estimates reliability, we fit the data with an RTM observer having the half-height diameter and overall scale factor as free parameters (the scale parameter simply translates all of the predicted thresholds vertically on the decibel scale). The curves in Fig. 3B show the least squares fit. The RTM observer predicts the pattern of results. The percentage of variance accounted for (r2) is 96%. The estimated half-height diameter is 15 arc min (0.25°) of visual angle, and the estimated scale parameter is 0.18 (efficiency = 0.032). We also fit the subjects’ data individually. The percentages of variance accounted for are given in the figure and the half-height diameters in the caption. The RTM observer’s performance increases with noise ratio because the quality of the information in the low-noise region improves. The RTM observer’s performance declines with the frequency of noise modulation above 0.25 cpd because the summation width w for the local noise estimation increasingly overlaps with the multiple contrast regions in the modulated noise.
Partial Masking in Natural Images.
We now consider how the theory and experimental findings for partial masking in white noise translate to detection in natural images. As mentioned in the Introduction, Sebastian et al. (11) used a constrained sampling approach to examine the effect of background luminance (L), contrast (C), and similarity (S) on detection thresholds for targets added to natural backgrounds. They began by determining the performance of the TM observer. Specifically, they measured the SD of the template responses (dot product of target shape and natural image) to millions of natural image backgrounds and found that the SD is approximately proportional to the product of the background luminance, contrast, and similarity,
[6] |
and hence the detectability of the TM observer is given by
[7] |
Eq. 6 implies that the thresholds of the TM observer in natural backgrounds conform to separable multidimensional Weber’s law:
[8] |
Sebastian et al. (11) then showed that Eq. 8 predicts human thresholds for background patches randomly sampled from a subset of the bins covering the luminance–contrast–similarity space, and Dorronsoro et al. (18) subsequently found that the equation holds for a wider range of conditions.
Strictly speaking, Eq. 8 cannot correctly predict thresholds when background luminance and contrast are near zero, where neural noise (or other internal factors) dominate human thresholds. However, here we are concerned only with relatively high-luminance and high‐contrast backgrounds where the equation holds to good approximation.
Eqs. 7 and 8 do not include the effects of partial masking. However, a plausible hypothesis is that the visual system uses mechanisms that evolved to process natural backgrounds, and that those same mechanisms were being tapped in the noise-masking experiment described above. Thus, it is plausible that Eqs. 3 and 5 (with same size neighborhood for computing local reliability) apply to natural backgrounds. If so, then for natural backgrounds we have
[9] |
and
[10] |
We assume here that the partial masking factor is computed in exactly the same way that it is in experiment 1, from the local contrast and luminance in a neighborhood size w of 0.25°. Recall that the partial masking factor is independent of any scale factor on the local variances and hence should be independent and separable from the scale factors L, C, and S.
We tested this prediction in experiment 2. Specifically, we picked four bins in luminance, contrast, and similarity (LCS) space that contain large numbers of image patches (Methods). (The luminance in the four bins was the same.) For each bin we computed the value of the partial masking factor for every image patch in the bin and then binned those patches into 20 sub-bins. For five of these sub-bins (the 5th, 25th, 50th, 75th, and 95th percentile) we measured psychometric functions in a single-interval forced choice task, where the background on each trial was randomly sampled (without replacement) from the sub-bin.
Fig. 4 shows a randomly sampled image from each of the five tested sub-bins within each of the four tested bins. The same windowed sinewave target (Fig. 4, Upper Left) of the same amplitude is added to all of the images. Subjectively, target detectability tends to increase with the value of the partial masking factor and tends to decrease with the level of similarity and contrast.
Fig. 4.
Example stimuli from each of the four bins tested in experiment 2. All of the images in a row have approximately the same background luminance, contrast, and similarity in the target region. The luminance L was the same (33 cd/m2) for all four bins. The contrast C and similarity S of each bin are given on the left. A 4-cpd windowed sinewave target (Upper Left, but with a lower fixed amplitude) is added to the center of each image. In agreement with the measured thresholds, target visibility tends to decrease subjectively from top to bottom and from right to left.
Fig. 5A shows the average amplitude thresholds of three subjects (two of the subjects were different from those in the white-noise experiment) for detecting the windowed 4-cpd sinewave target (Fig. 4, Upper Left) as a function of the partial masking factor (in percentile) for each of the four bins. Thresholds and psychometric functions for the individual subjects are given in SI Appendix, Figs. S2 and S3. As expected from Sebastian et al. (11), thresholds are highest when the background contrast and similarity are high and lowest when the background contrast and similarity are low. Thresholds are intermediate when the background contrast is high and the similarity is low or when the background contrast is low and the similarity is high. Indeed, the vertical spacing between the curves is correctly predicted from Eq. 8 (Fig. 5B). As expected from the partial masking experiment with white noise (experiment 1) the thresholds decrease substantially as the partial masking factor increases. This effect is not predicted by the standard TM observer (Eq. 8), which predicts essentially no effect (Fig. 5B). However, as shown in Fig. 5C, when the partial masking factor is included (Eq. 10) the effect is accurately predicted (98% of the variance accounted for). Importantly, the partial masking factor of the RTM observer was computed in the same way, and with the same diameter of the spatial neighborhood w used for the white-noise predictions in Fig. 3. The accuracy of these (essentially parameter-free) predictions is rather remarkable given the differences between white-noise and natural backgrounds and between the targets in the two experiments.
Fig. 5.
Threshold for the sinewave target in Fig. 4 as function of partial masking factor in percentile, in natural backgrounds from four different bins in LCS space (experiment 2). (A) Average measured thresholds for three observers. Error bars are bootstrapped 68% confidence intervals. (B) Thresholds of the TM observer (Eq. 8). These predictions are based on computing the luminance (L), contrast (C), and similarity (S) for the specific stimulus in each trial. (C) Thresholds of the RTM observer (Eq. 10). The percentile binning of the partial masking factor was done separately for each of the four bins in LCS space. However, the model predictions are based on the values of , L, C, and S measured from the specific stimulus on each trial.
Discussion
Natural backgrounds contain regions where the properties of the background (e.g., its luminance, contrast, and texture) are relatively uniform and other regions where the properties are rapidly changing. If a target stimulus happens to fall on a region where the background properties are changing, then the target will not be uniformly occluded by the masking properties of the background. Under these circumstances the target may be identifiable from its less-occluded parts. This phenomenon, which we call partial masking, had been noted in the past but not studied systematically. We began by describing the formal (ideal observer) theory of partial masking in contrast-modulated white noise. The ideal observer computes the response of a template (i.e., a receptive field) that has the shape of the target divided at each pixel location by the noise power at that location. We found that this RTM observer accurately predicts human detection thresholds as a function of the amplitude and frequency of the noise modulation, under the assumption that the visual system computes the local noise power (reliability) from a small neighborhood centered on each pixel location (Fig. 3B). The size of this neighborhood was estimated from the average subject data to be 0.25° at half height.
We then asked whether this RTM observer predicts detection of targets in natural backgrounds. To do this, we built upon a constrained-sampling approach (11, 18), where natural backgrounds are binned into a 3D histogram along the dimensions of luminance, contrast, and similarity. In these previous studies, we found that both the SD of template responses and human detection thresholds are the separable product of the luminance, contrast, and similarity of the natural background (Eqs. 6 and 8). The direct combination of Eqs. 5 and 8 yields the prediction of the RTM observer for natural backgrounds (Eq. 10). To test this prediction, we picked four bins in LCS space that contained a large number of samples and then binned those along the partial masking factor dimension (with neighborhood size still 0.25°) to obtain five sub-bins. We then measured thresholds for each of the five sub-bins in each bin and found that human thresholds were accurately predicted by Eq. 10 (Fig. 5). The results suggest that the partial masking effects observed in white noise generalize to natural backgrounds and that foveal detection in natural backgrounds is predicted by four multiplicative factors: the background’s luminance, contrast, and similarity and the inverse of the partial masking factor.
The results are consistent with the hypothesis that the brain implements target identification in a fashion consistent with Bayesian inference from the statistical properties of natural scenes. In effect, the brain performs target identification by modeling the statistical properties of the environment.
Other Factors.
There are other factors that must affect detection in natural backgrounds. For example, as mentioned earlier, under low-luminance and -contrast conditions neural noise and/or neural threshold nonlinearities tend to dominate human behavioral thresholds, causing deviations from Weber’s law (19). There are also other intrinsic factors responsible for the overall lower efficiency of human observers relative to the RTM observer. These factors appear to be largely background-independent and include target uncertainty (20, 21), late decision noise (22), and certain forms of pooling inefficiency. Simulations (and intuition) suggest that these intrinsic factors tend to scale overall performance without affecting the pattern of thresholds predicted by the four factors considered here (23). Thus, the four factors described here may be the dominant factors responsible for the variation in thresholds (for a given fixed target) across different natural backgrounds, at least in the fovea. Other intrinsic factors contribute substantially to detection thresholds in the periphery. These include reduced resolution due to retinal spatial pooling and downsampling (13), substantially increased target uncertainty (24), and other central losses of spatial information (25). A next step will be to include these largely known intrinsic factors to obtain, and then test, principled models for identification in natural images across the visual field.
Another potentially important factor may be a spatial-frequency “whitening” operation. It is well known that the ideal observer for identifying target objects in filtered (e.g., blurred) Gaussian noise applies a template that is altered (whitened) so that it responds less to those spatial frequencies in the target that have high amplitude in the noise (26). This whitening operation can be regarded as reliability weighting in the spatial-frequency domain, rather than in the space domain like the current RTM observer. However, unlike reliability weighting in space, whitening in spatial frequency assumes that properties of the background are statistically stationary in the region of the target. In other words, whitening in spatial frequency is not sensitive to the variations of background properties within the target region. Whitening in spatial frequency has been considered as a potentially important factor affecting target identification, especially in the medical-imaging literature (refs. 27, 28, 29, and 30; see ref. 31 for a review). The evidence suggests that in the human visual system whitening has a weak effect under some conditions (27) and a stronger effect under others (28–30). For the sinewave (narrowband) targets used here in the natural-background experiment, the whitening operation cannot have more than a very small effect, because even without whitening the narrowband template eliminates most background spatial frequencies.
As mentioned in the Introduction, under real-world conditions target objects generally do not produce a fixed target pattern at a known retinal location but a target pattern that depends on the 3D position and orientation of the object in the 3D scene. Thus, in addition to the factors described here, models for real-world conditions will need to take into account the effects of pose and lighting in a 3D environment. In the Bayesian framework, this would involve finding the maximum posterior probability over various pose and lighting variables.
Neural Mechanisms.
The accuracy of Eq. 10 strongly suggests that human detection thresholds in natural backgrounds (and in contrast-modulated white noise) are based on fundamental computational principles that follow directly from the statistical properties of natural signals. An important question is how these computational principles are implemented in the nervous system. For the dimensions of luminance, contrast, and similarity (at the scale of the target) Sebastian et al. (11) show that a biologically plausible and efficient computation is to normalize template responses (or more generally target feature responses) by the product of the luminance, contrast, and similarity. This normalization makes it possible to achieve near-optimal identification performance with a single fixed decision criterion, even when the background properties and the amplitude of the target are randomly varying from one occasion to the next, as they do under natural conditions. Without normalization, near-optimal performance would require dynamically adjusting the decision criterion on each occasion based on estimates of the luminance, contrast, and similarity. The advantage of normalization is particularly clear when the prior of the target being present is low, as is also typical under natural conditions (11, 32).
There is strong neurophysiological evidence for luminance (19, 33), contrast (34, 35, 33), and perhaps similarity (36, 37) normalization (gain control) in the visual system, as well as for other forms of normalization throughout the nervous system (38). These normalization effects operate rapidly enough to easily keep up with the changes in the retinal image that occur every 200 to 400 ms during saccadic inspection or visual search. Furthermore, the fact that the correct normalization for natural backgrounds is the separable product is convenient for natural selection, because it implies that the order in which the normalization is performed is arbitrary, and that for each factor the normalization could be arbitrarily distributed along the visual pathway and could involve feed-forward or feed-back pathways.
It is possible that the local reliability weighting needed to explain the partial masking effects is also implemented through normalization, but at a smaller spatial scale. In SI Appendix and Results we show that the RTM observer for modulated white noise backgrounds could be implemented by local luminance and contrast normalization of the input image together with local luminance and contrast normalization of the template. SI Appendix, Fig. S1 illustrates how the RTM observer could be implemented for arbitrary backgrounds using local and global normalization. Thus, it seems likely that some plausible neural models, having normalization at multiple scales, will be consistent with Eq. 10. Such models should be testable in neurophysiology experiments.
It is important to emphasize that there are many possible neural implementations of the RTM observer that are more biologically plausible than a single reliability-weighted template. For example, target objects and templates could also be represented as the responses of a set of multiscale basis functions (simple cells) that are each weighted by local reliability. It is also possible that when detecting extended targets, like the grating target in experiment 1, the visual system does not integrate over the whole target region but instead computes the maximum response over a set of smaller templates that correspond to parts (features) of the extended target. We chose to focus here on single templates to simplify the mathematics and intuitions.
Other Types of Targets and Backgrounds.
Historically, the study of visual detection and identification has been directed almost exclusively at situations where the light from the target is physically added to (or subtracted from) the light from the background. This is convenient for experiments because one can measure psychometric functions at any location in the visual field for any background by varying the amplitude of the target. Additive targets are also convenient for theory because mathematical models tend to be simpler and more tractable for additive targets.
Although a great deal has been learned about the visual system from experiments with additive targets, it is more common in the natural environment for a foreground target to be opaque and hence to occlude rather than add to the background. Template matching is still an appropriate computation in this case (39), and Eq. 6 still holds when the target is absent, but there is no significant partial masking except in the peripheral retina where spatial pooling and downsampling blend target and background light in the neighborhood of the target boundary. Partial masking occurs primarily when the target is not in the foreground and is partially occluded by opaque elements of the scene. In this case, the visual system still down-weights the template/features in the occluded region, and the downweighting could still be implemented neurally by local normalization; however, other mechanisms (e.g., perceptual grouping) must be used to control the normalization factors. It is possible that these other mechanisms also contribute to the partial masking effects observed with additive targets.
It should be noted that additive targets are also the most common targets in laboratory studies of visual search, visual recognition, and crowding. Additivity holds whenever the “opaque” target and distractor objects are arrayed without overlap against a uniform background. Additivity is only violated mathematically when the opaque target blocks unknown features of the background, or when the target is blocked by opaque elements of the background.
There are real-world cases where additive-target models are reasonable approximations. Two cases are when the target is a cast shadow and when a target object is viewed through a partially transparent atmosphere (e.g., fog). Moderately transparent backgrounds and targets also occur in medical images and airport-security images. Like natural images, medical and security images are typically more structured than white noise, and hence partial masking (as well as luminance, contrast, and similarity masking) is common. In medical images, anatomical structures frequently produce local background contrast that masks part of the targets of interest (e.g., tumors). Thus, another possible next step would be to incorporate Eq. 10 (or its underlying principles) into models of detection in medical and security images, as well as in detection of compression and noise artifacts in digital imagery.
Methods
Experiment 1.
Psychometric functions were measured in a single-interval forced-choice task for 13 different conditions: 4 contrast-modulation frequencies × 3 contrast-modulation amplitudes and a uniform-noise background (Fig. 2). The stimuli were 256 × 256 pixels at 120 pixels/° (2.13° × 2.13°) and were presented for 250 ms. Each noise pixel was 4 × 4 image pixels, and its value was randomly sampled on each trial from a Gaussian probability distribution (Eq. 2). To reduce local contrast-adaptation effects, the phase of the contrast modulation was shifted 180° (phase reversed) on each trial. The subject’s head was stabilized with a chin and head rest. The target was a horizontal 7.5-cpd square-wave grating that was windowed with a raised-cosine falloff having a half-height width of 128 pixels. A square-wave grating was used so the luminance profile of the target would remain largely invariant even at very low target amplitudes (this was not necessary in experiment 2, where thresholds were sufficiently high). Each psychometric function was based on at least 300 trials spread over two sessions (5 amplitudes × 30 trials × 2 sessions). Amplitude thresholds were calculated by fitting a generalized cumulative Gaussian function to hits and false alarms using a maximum likelihood procedure. Threshold was defined as the target amplitude giving (percent correct of 69% given optimal criterion). The threshold parameter was bootstrapped to estimate the 68% confidence interval.
Experiment 2.
Background image patches were selected based on the constrained-sampling approach described elsewhere (11). Briefly, a large database of calibrated gray-scale natural images (4,284 × 2,844 pixels) was divided into patches the size of the target stimulus (101 × 101 pixels). (The images and calibration procedure are available at http://natural-scenes.cps.utexas.edu/.) For each patch, the luminance (L), rms contrast (C), and cosine similarity of the amplitude spectrum of the patch to that of the target (S) was calculated. Based on these three values each patch was then sorted into a 3D histogram. The histogram was a cube consisting of 10 bins along each dimension. Thus, each bin contained a set of image patches that were approximately equal in luminance, contrast, and similarity.
For the experiment, we selected four bins from this histogram that had a large number of patches (>15,000). Each bin had approximately the same luminance value (33 cd/m2). One of these bins had high contrast (0.29 rms) and high cosine similarity (0.34), one high contrast (0.29 RMS) and low similarity (0.22), one low contrast (0.2) and high similarity (0.34), and one low contrast (0.2 RMS) and low similarity (0.22). Within each of these four bins, patches were further subdivided into 20 sub-bins (percentile ranges) based on the partial masking factor . Psychometric functions were measured for five of these sub-bins (<5%, 22.5 to 27.5%, 47.5 to 52.5%, 72.5 to 77.5%, and >95%). From each of these sub-bins, 300 image patches were randomly chosen to be background image patches in the psychophysical experiment, for a total of 1,500 image patches in each of the four bins.
Psychometric functions were measured using a single-interval forced-choice procedure like that used in the white-noise experiment (Fig. 2), where the stimulus was presented for 250 ms and subjects were asked to report the presence or absence of the target. The target was like one of the targets in Sebastian et al. (11): a horizontal 4-cpd sinewave that was windowed by a radially symmetric raised cosine with half-height width of 0.42°. The target was the same size as the background patch, but as illustrated in Fig. 4 the background was expanded to include the surrounding background pixels for a total size of 256 × 256 pixels (2.13° × 2.13°). Each of the 20 psychometric functions measured for each subject was based on 300 trials spread over two sessions (5 amplitudes × 30 trials × 2 sessions).
The stimulus presentation and experimental procedure for experiments 1 and 2 were programmed in MATLAB using Psychtoolbox (40, 41).
The experimental protocols for this study were approved by the University of Texas Institutional Review Board, and informed consent forms were obtained from all participants.
All data associated with this paper are accessible at http://natural-scenes.cps.utexas.edu/data.shtml#rv1.
Supplementary Material
Acknowledgments
This work was supported by National Institutes of Health Grants EY024662 and EY11747.
Footnotes
The authors declare no competing interest.
This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Brain Produces Mind by Modeling,” held May 1–3, 2019, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. NAS colloquia began in 1991 and have been published in PNAS since 1995. From February 2001 through May 2019, colloquia were supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foundation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband, Arthur M. Sackler. The complete pro-gram and video recordings of most presentations are available on the NAS website at http://www.nasonline.org/brain-produces-mind-by.
This article is a PNAS Direct Submission.
Data deposition: All data associated with this paper are accessible from the Center for Perceptual Systems, University of Texas at Austin (http://natural-scenes.cps.utexas.edu/data.shtml#rv1).
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1912331117/-/DCSupplemental.
References
- 1.König A., Brodhun E., Experimentelle Untersuchungen über die psycho-physische Fundamentalformel in Beug auf den Gesichtssinn, Zweite Mittlg S. B., Ed. (Preuss. Akad. Wiss, 1889), p. 641. [Google Scholar]
- 2.Mueller C. G., Frequency of seeing functions for intensity discrimination of various levels of adapting intensity. J. Gen. Physiol. 34, 463–474 (1951). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Nachmias J., Sansbury R. V., Letter: Grating contrast: Discrimination may be better than detection. Vision Res. 14, 1039–1042 (1974). [DOI] [PubMed] [Google Scholar]
- 4.Legge G. E., Foley J. M., Contrast masking in human vision. J. Opt. Soc. Am. 70, 1458–1471 (1980). [DOI] [PubMed] [Google Scholar]
- 5.Burgess A. E., Wagner R. F., Jennings R. J., Barlow H. B., Efficiency of human visual signal discrimination. Science 214, 93–94 (1981). [DOI] [PubMed] [Google Scholar]
- 6.Legge G. E., Kersten D., Burgess A. E., Contrast discrimination in noise. J. Opt. Soc. Am. A 4, 391–404 (1987). [DOI] [PubMed] [Google Scholar]
- 7.Campbell F. W., Kulikowski J. J., Orientational selectivity of the human visual system. J. Physiol. 187, 437–445 (1966). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stromeyer C. F. 3rd, Julesz B., Spatial-frequency masking in vision: Critical bands and spread of masking. J. Opt. Soc. Am. 62, 1221–1232 (1972). [DOI] [PubMed] [Google Scholar]
- 9.Wilson H. R., McFarlane D. K., Phillips G. C., Spatial frequency tuning of orientation selective units estimated by oblique masking. Vision Res. 23, 873–882 (1983). [DOI] [PubMed] [Google Scholar]
- 10.Watson A. B., Solomon J. A., Model of visual contrast gain control and pattern masking. J. Opt. Soc. Am. A Opt. Image Sci. Vis. 14, 2379–2391 (1997). [DOI] [PubMed] [Google Scholar]
- 11.Sebastian S., Abrams J., Geisler W. S., Constrained sampling experiments reveal principles of detection in natural scenes. Proc. Natl. Acad. Sci. U.S.A. 114, E5731–E5740 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Najemnik J., Geisler W. S., Optimal eye movement strategies in visual search. Nature 434, 387–391 (2005). [DOI] [PubMed] [Google Scholar]
- 13.Bradley C., Abrams J., Geisler W. S., Retina-V1 model of detectability across the visual field. J. Vision 14, 22 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chandler D. M., Gaubatz M. D., Hemami S. S., A patch-based structural masking model with an application to compression. J. Image Video Processing 5, 1–22 (2009). [Google Scholar]
- 15.Alam M. M., Vilankar K. P., Field D. J., Chandler D. M., Local masking in natural images: A database and analysis. J. Vision 14, 22 (2014). [DOI] [PubMed] [Google Scholar]
- 16.Green D. M., Swets J. A., Signal Detection Theory and Psychophysics (Wiley, New York, 1966). [Google Scholar]
- 17.Geisler W. S., Contributions of ideal observer theory to vision research. Vision Res. 51, 771–781 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Dorronsoro C., Walshe R. C., Sebastian S., Geisler W. S., Separable effects of similarity and contrast on detection in natural backgrounds. J. Vis. 18, 747 (2018). [Google Scholar]
- 19.Hood D. C., Lower-level visual processing and models of light adaptation. Annu. Rev. Psychol. 49, 503–535 (1998). [DOI] [PubMed] [Google Scholar]
- 20.Swensson R. G., Judy P. F., Detection of noisy visual targets: Models for the effects of spatial uncertainty and signal-to-noise ratio. Percept. Psychophys. 29, 521–534 (1981). [DOI] [PubMed] [Google Scholar]
- 21.Pelli D. G., Uncertainty explains many aspects of visual contrast detection and discrimination. J. Opt. Soc. Am. A 2, 1508–1532 (1985). [DOI] [PubMed] [Google Scholar]
- 22.Lu Z. L., Dosher B. A., Characterizing observers using external noise and observer models: Assessing internal representations with external noise. Psychol. Rev. 115, 44–82 (2008). [DOI] [PubMed] [Google Scholar]
- 23.Sebastian S., Geisler W. S., Decision-variable correlation. J. Vision 18, 3 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Michel M. M., Geisler W. S., Intrinsic position uncertainty explains detection and localization performance in peripheral vision. J. Vision 11, 18 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Levi D. M., Crowding–An essential bottleneck for object recognition: A mini-review. Vision Res. 48, 635–654 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Papoulis A., Probability, Random Variable and Stochastic Processes (McGraw-Hill, New York, 1991). [Google Scholar]
- 27.Myers K. J., Barrett H. H., Borgstrom M. C., Patton D. D., Seeley G. W., Effect of noise correlation on detectability of disk signals in medical imaging. J. Opt. Soc. Am. A 2, 1752–1759 (1985). [DOI] [PubMed] [Google Scholar]
- 28.Rolland J. P., Barrett H. H., Effect of random background inhomogeneity on observer detection performance. J. Opt. Soc. Am. A 9, 649–658 (1992). [DOI] [PubMed] [Google Scholar]
- 29.Burgess A. E., Li X., Abbey C. K., Visual signal detectability with two noise components: Anomalous masking effects. J. Opt. Soc. Am. A Opt. Image Sci. Vis. 14, 2420–2442 (1997). [DOI] [PubMed] [Google Scholar]
- 30.Zhang Y., Abbey C. K., Eckstein M. P., Adaptive detection mechanisms in globally statistically nonstationary-oriented noise. J. Opt. Soc. Am. A Opt. Image Sci. Vis. 23, 1549–1558 (2006). [DOI] [PubMed] [Google Scholar]
- 31.Burgess A. E., “Signal detection: A brief history” in The Handbook of Medical Image Perception and Techniques, Samei E., Krupinski E., Eds. (Cambridge University Press, Cambridge, ed. 2, 2018), pp 28–48. [Google Scholar]
- 32.Oluk C., Geisler W. S., Effects of target amplitude uncertainty, background contrast uncertainty, and prior probability are predicted by the normalized template-matching observer. J. Vis. 19, 79c (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Mante V., Bonin V., Frazor R. A., Geisler W. S., Carandini M., Independence of gain control mechanisms in early visual system matches the statistics of natural images. Nat. Neurosci. 8, 1690–1697 (2005). [DOI] [PubMed] [Google Scholar]
- 34.Albrecht D. G., Geisler W. S., Motion selectivity and the contrast-response function of simple cells in the visual cortex. Vis. Neurosci. 7, 531–546 (1991). [DOI] [PubMed] [Google Scholar]
- 35.Heeger D. J., Normalization of cell responses in cat striate cortex. Vis. Neurosci. 9, 181–197 (1992). [DOI] [PubMed] [Google Scholar]
- 36.Cavanaugh J. R., Bair W., Movshon J. A., Selectivity and spatial distribution of signals from the receptive field surround in macaque V1 neurons. J. Neurophysiol. 88, 2547–2556 (2002). [DOI] [PubMed] [Google Scholar]
- 37.Coen-Cagli R., Kohn A., Schwartz O., Flexible gating of contextual influences in natural vision. Nat. Neurosci. 18, 1648–1655 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Carandini M., Heeger D. J., Normalization as a canonical neural computation. Nat. Rev. Neurosci. 13, 51–62 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Walshe R. C., Sebastian S., Geisler W. S., Ideal observer for detection of occluding targets in natural scenes in the fovea and periphery. J. Vis. 18, 629 (2018). [Google Scholar]
- 40.Brainard D. H., The psychophysics toolbox. Spat. Vis. 10, 433–436 (1997). [PubMed] [Google Scholar]
- 41.Pelli D. G., The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spat. Vis. 10, 437–442 (1997). [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.