Abstract
We investigate a series of two-alternative forced-choice (2AFC) discrimination tasks based on malignant features of abnormalities in low-dose lung CT scans. A total of 3 tasks are evaluated, and these consist of a size-discrimination task, a boundary-sharpness task, and an irregular-interior task. Target and alternative signal profiles for these tasks are modulated by one of two system transfer functions and embedded in ramp-spectrum noise that has been apodized for noise control in one of 4 different ways. This gives the resulting images statistical properties that are related to weak ground-glass lesions in axial slices of low-dose lung CT images. We investigate observer performance in these tasks using a combination of statistical efficiency and classification images.
We report results of 24 2AFC experiments involving the three tasks. A staircase procedure is used to find the approximate 80% correct discrimination threshold in each task, with a subsequent set of 2,000 trials at this threshold. These data are used to estimate statistical efficiency with respect to the ideal observer for each task, and to estimate the observer template using the classification-image methodology.
We find efficiency varies between the different tasks with lowest efficiency in the boundary-sharpness task, and highest efficiency in the non-uniform interior task. All three tasks produce clearly visible patterns of positive and negative weighting in the classification images. The spatial frequency plots of classification images show how apodization results in larger weights at higher spatial frequencies.
Keywords: Classification Images, observer performance, low-dose CT, discrimination tasks
1. INTRODUCTION
Our general interest is to better understand how noise and image structure influence observer performance in visual tasks. In this work we consider several discrimination tasks related to lung-cancer screening with low-dose x-ray CT imaging. Discrimination tasks become relevant (compared to detection tasks) when an object is clearly detectable, but the categorization of the object depends on features that may be subtle. For example, an important clinical task in some imaging exams is to assess whether a suspicious abnormality in a prior image has grown in an intervening period. In this case, the abnormality may be clearly visible in both the prior and current images, but the critical feature – a change in size – may be considerably more difficult to determine from the images.
We are investigating discrimination tasks as part of an effort to develop models for assessing the effects of reduced dose in x-ray CT imaging using psychophysical techniques. The methodology combines statistical efficiency1-3, a measure of how much diagnostic information is being accessed by image readers, with classification images4-7, which measure spatial weighting used by observers in performing the task. Discrimination tasks have the potential to play an important role in optimizing low-dose CT for lung-cancer screening, where the fundamental task is discriminating malignant features of visible lung tissue from non-malignant features. We are interested in understanding how subtle features of malignancy are accessed in the presence of image noise and limited spatial resolution.
2. METHODS
Our study consists of a set of 24 psychophysical experiments utilizing the two-alternative forced-choice (2AFC) discrimination paradigm. The studies are meant to coarsely simulate the assessment of small and relatively weak ground-glass lesions in the lung.
2.1. Simulated imaging systems
We simulate two imaging systems designed to reconstruct a 35cm × 35cm field of view (FOV) with a 512 × 512 array (0.68mm pixel size). The simulation focuses on a 128 × 128 subregion of the FOV. The two imaging systems, referred to as System 1 and System 2, are defined by different spatial resolution properties. System 1 is modeled with a lower resolution than System 2, which is implemented by a system transfer function that falls off more quickly. The transfer functions of both systems are modeled as a cosine-rolloff in radial frequency, with System 1 rolling off to zero at a frequency of 0.59 cyc/mm, and System 2 rolling off at 0.73 (Nyquist). In addition, both systems employ frequency apodization as a means to control noise. Apodization is implemented by various Sinc-weighted frequency rolloffs, often called a Shepp-Logan filter8, which multiply the intrinsic system transfer functions. The apodization conditions consist of no apodization (A1), a Sinc-rolloff with the first zero at 2 times the rolloff of the intrinsic transfer function (A2), a Sinc-rolloff with the first zero at the rolloff of the intrinsic transfer function (A3), and a sinc-rolloff with the first zero at 0.8 times the rolloff of the intrinsic transfer function (A4). Figure 1 (A and B) shows the resulting combined transfer function (intrinsic and apodization) of the systems.
Figure 1. System properties.

The simulated MTFs for System 1 (A) and System 2 (B) are shown, along with plots of the noise power spectrum (C and D). The Legend applies to all plots.
We simulate noise in the systems as a ramp-spectrum Gaussian process out to the Nyquist frequency of the images (0.73 cyc/mm). The ramp spectrum component is identical for both System 1 and System 2, and thus System 2 represents an improved higher-resolution imaging system relative to System 1. In the apodization conditions, noise is attenuated by the frequency rolloff of the various Sinc filters. Figure 1 (C and D) shows the combined NPS (ramp and apodization) of the systems in Hounsfield units (HU) times mm2. Figure 2 shows sample noise textures for each system and apodization level. Table 1 gives generic measures of resolution and noise for the two systems and the 4 apodization conditions.
Figure 2. Noise Textures.

The different levels of simulated apodization lead to different noise textures in the simulated imaging systems. Higher levels of apodization lead to a smoother and less grainy texture.
Table 1. Apodization, resolution, and noise.
The resolution measure is the frequency at which the MTF falls to 10% of its maximum. The noise measure is the pixel standard deviation.
| Resolution (cyc/mm) | Noise (HU) | |||
|---|---|---|---|---|
| Apod. | Sys. 1 | Sys. 2 | Sys. 1 | Sys. 2 |
| A1 | 0.47 | 0.58 | 166.5 | 166.5 |
| A2 | 0.45 | 0.56 | 112.9 | 129.9 |
| A3 | 0.39 | 0.49 | 46.5 | 65.0 |
| A4 | 0.34 | 0.43 | 30.2 | 42.2 |
2.2. Discrimination Tasks
This work considers three tasks that are related to the detection of lung cancer with low-dose CT. Task 1 is a size-discrimination task, in which a slightly larger lesion is discriminated from a baseline 3mm diameter lesion (FWHM). The task parameter is ΔR, the difference between the larger and smaller lesion. Task 2 can be described as an indistinct-boundary discrimination task, in which a 5mm lesion with a more slowly decaying edge is discriminated from a similar size lesion with a sharper edge. In this case the task parameter controls rate of decay at the edge of the lesion. Task 3 can be described as a non-uniform lesion interior discrimination task, in which a 5mm lesion with variable interior attenuation is discriminated from a lesion with uniform interior. Target and alternative signal profiles for each task are plotted in Figure 3.
Figure 3. Task Profiles.

Radial plots of the “Malignant” and “Benign” profiles are shown for each of the three tasks considered. In Task 1 (A), the feature of interest is the lesion size. In Task 2 (B), the feature of interest is an indistinct or unsharp boundary. In Task 3 (C), the feature of interest is a nonuniform lesion interior.
One notable feature of these discrimination tasks is that they tend to emphasize higher spatial frequencies than traditional low-contrast detection tasks. Figure 4 shows plots of the Fourier transform of the difference signals (i.e. the difference between “malignant” and “benign” profiles) from Figure 3. The frequency range of the task (i.e. before filtering by a system transfer function) is seen to extend well beyond the Nyquist frequency of the image. Furthermore, since the relevant difference between the two profiles may be somewhat displaced from the origin in the spatial domain, the frequency profiles can be somewhat oscillatory, as seen in the figure. Note that since the task parameters affect the shape of the target, the frequency profile of the difference signal will change somewhat as task parameters increase or decrease.
Figure 4. Task object spectra.

Radial plots of the difference signal spectrum for each task are shown. The Nyquist frequency of the final image is shown for reference.
To make the image stimuli, target and alternative profiles are generated at high resolution, and then passed through one of the transfer functions plotted in Figure 1 and down sampled. The result is embedded in a noise field (similar to Figure 2) to produce an image stimulus. The contrast is normalized so that the background has an intensity of −1000 HU, the intensity of air, and the lesions have an intensity of −800 HU, which is consistent with a weak ground-glass lesion9. These are scaled for display based on the assumed use of a lung window of 1500 HU and level of −650 HU. Figure 5 shows examples of the “signal” (malignant) and “alternative” (benign) profiles after filtering the object with the system transfer function and down sampling. The figure also shows the difference signal between these two profiles, and the appearance of the profiles when they are embedded in noise.
Figure 5. Stimulus Profiles and Images.

The (noiseless) signal and alternative profiles for each of the three tasks are shown, along with the difference signal and sample image patches from each class (System 1, Apodization Level 3). Each task parameter has been exaggerated for the purpose of display in this figure. The image patches represent 21.2mm of a 350mm simulated field of view. All the images have a window of 1500 HU and level of −650 HU except for the difference images which are scaled to the max difference value.
2.3. Psychophysical procedure
The image simulation procedure is used to generate stimuli for two-alternative forced-choice experiments, where a stimulus from the malignant class and an independent stimulus from the benign class are displayed side-by-side on each trial. The position of the two images (right or left) is randomized on each trial, and the reader is asked to indicate the alternative corresponding to a malignant image using a mouse click.
Each experiment begins with 6 runs through a staircase training procedure that both familiarizes the readers with the task, and also allows us to estimate (from the last 5 runs) the signal parameters to achieve approximately 80% correct responses. Subsequently, 2,000 trials (40 sessions of 50 trials) are run at this contrast level and used to estimate the primary endpoints of the study.
2.4. Task performance analysis
The proportion of correct responses (PC) is the natural measure of task performance for 2AFC experiments. However, in this case the task difficulty is adjusted based on the training data to achieve a PC of approximately 80%. As a result, we use both PC and an image signal-to-noise ratio (SNR) measure to characterize observer performance. These measures are also used to estimate the efficiency of readers with respect to the ideal observer.
Let s0 be a column vector that represents the benign lesion profile array (2nd column in Figure 5), and let s(θ) represent the malignant lesion array (1st column in Figure 5). The malignant lesion array is a function of the task parameter, θ. The malignant and benign signal profiles are embedded in a noise field represented by the column vector, n. The noise is generated as a zero-mean stationary Gaussian random process with one of the power spectra shown in Figure 1, which we will represent as a noise covariance matrix, Σ. We can define the SNR in terms of the image statistics10 as
| (2.1) |
which can also be recognized as the detectability of the Hotelling observer (the Ideal Observer in this case) or as the Mahalanobis distance between the malignant and benign profiles.
Ideally, we would like to know the threshold SNR, which we define as the image SNR when the reader achieves 80% correct in the task. As we will see, the actual observed values in the task are generally different from 80% correct, although the training procedure generally gets the values within ±10%. As a result, we need a way to adjust the SNR for a PC that is modestly greater or less than 80%. We do this using the standard relation11 between detectability and PC in 2AFC experiments,
| (2.2) |
The resulting “threshold” SNR is given by
| (2.3) |
Note that if PC is higher than 80%, then the ratio d80% / dPC will be less than 1, and the SNR value will be adjusted downward. For example, if the observed PC is 85%, then the SNR is multiplied by 0.812. The opposite occurs if PC < 80%. The threshold allows us to evaluate overall performance in the task, and to make comparisons between tasks.
We also compute task efficiency with respect to the Ideal Observer, which is considered a measure of the fraction of task-relevant information that is being accessed by an observer. This is defined as
| (2.4) |
For all performance measures (PC, SNRThresh, and η), bootstrapping across the 40 sessions in each condition is used to get error bars.
2.5. Classification-image analysis
The classification-image methodology is based on published methodology for 2AFC studies5,7,12. For each condition (c = 1,…,24), reader (r = 1,…,3), and trial (j = 1,…,2000) an outcome (or score) variable is defined, Oc,r,j, that is zero if the reader responds incorrectly and 1 if the reader responds correctly in the trial. The score is used to generate a weighted noise field,
| (2.5) |
where PCc,r is the estimated proportion correct for reader r in condition c, Σc + ε2I is the noise covariance matrix for condition c with a small term added to the diagonal to control for instability in taking the inverse (ε2 = 0.001), and and are the malignant and benign noise fields for trial j respectively. The raw classification image for experimental condition c is estimated as
| (2.6) |
Even though the classification image estimate involves an average over 2000 trials, the result can still be quite noisy, particularly for conditions with more apodization. As a result, we apply a 4th-order Butterworth spatial window of 10mm radius to the classification images followed by another Butterworth filter with a radius of 0.4 cyc/mm in the frequency domain. We also employ radial averages of the classification image in the frequency domain. In this case only a spatial window is applied.
3. RESULTS AND DISCUSSION
At the time of this writing, the discrimination experiments are ongoing with complete data from three readers reported in this proceedings paper. In this section we will describe the performance of readers in the tasks, show classification images derived from average reader responses, and plots of the spatial frequency content of the classification images.
3.1. Task performance
Measures of task performance are plotted for each reader in Figure 6 The observed proportion correct (PC) for each condition (Top plot) shows that the targeted value of 80% was not always achieved, with some bias towards higher values across the conditions. This likely reflects some additional learning by readers over the course of the 2,000 experimental trials that followed the initial threshold estimate. The values are all within ±10% of the targeted level, although it should be noted that S1 repeated 3 conditions due to high performance (> 90%). In all three cases, repeating the condition, including the training runs, resulted in an acceptable observed PC.
Figure 6. Characterization of performance.

The plots show performance for each of the three tasks, both imaging systems, and the four levels of apodization (A1 – A4). The PC plot (top) shows that the training process does not necessarily result in an observed PC of 80%. The thresholds and efficiency plots show variability between the tasks and some evidence of better performance with increasing apodization.
The middle panel of Figure 6 shows the threshold SNR for each condition, corrected from the observed PC to 80% correct. This can be thought of as the primary estimate of task performance, with lower values indicating better performance (i.e. 80% correct achieved with lower image SNR implies better performance). Threshold SNR values range from just under 2.5 (best performance) to over 6 (worst). It is clear that Task 2 has generally higher threshold SNR, suggesting that readers are having relatively more difficulty performing the boundary distinctness discrimination task. Within a given task and imaging system, there is some evidence of a downward trend in SNR with increasing apodization (from A1 to A4). This indicates that performance seems to be improving with more apodization.
The bottom panel of Figure 6 shows the readers’ efficiency in each condition. Average efficiency in each task is 12.7%, 7.4%, and 16.2% for Tasks 1-3 respectively. This is relatively low in comparison to other tasks like as low-contrast detection1, where efficiency has often been found to be around 50%, or localization tasks13 where efficiency has been found in some conditions to be more than 70%. However, low efficiency in discrimination tasks is consistent with some previous findings12,14. Low overall efficiency is also consistent with the use of aids in performing visual discrimination tasks, such as digital calipers for measuring lesion size, or radiomics features for evaluating lesion texture.
There also appears to be a mild increase in reader efficiency with increasing levels of apodization. This shows that the improvement in threshold SNR is due to more effective reading in the readers, and is not due to any improvement in the information content of the images (ideal observer performance is insensitive to apodization).
3.2. Classification images
Average classification images for each task are shown in Figure 7 after spatial windowing and smoothing with 4th-order Butterworth filters. The spatial window has a half-max of 10mm, and this is followed by smoothing with a frequency window that has a half-max of 0.4 cyc/mm. The frequency filtration procedure also zeros the imaginary component of the FT, thereby enforcing symmetry about the midpoint of the classification image.
Figure 7. Classification Images.

The average classification image across readers is shown for each of the 24 experimental conditions (3 Tasks, 2 Systems, and 4 Apodizations). These display patches (21.2mm) have been spatially windowed to radius of 10mm (HWHM), and frequency windowed to 0.4cyc/mm.
The classification images generally show clear regions of facilitation (positive weighting) and inhibition (negative weighting). The conditions with greater levels of apodization (right side of Figure 7) appear to have more variability than the others. This is a consequence of instability due to the inverse covariance matrix applied in the classification estimation procedure. There are substantial differences between the classification images for the different tasks. In Task 1, the classification images are generally positively weighted near the lesion boundary, with a mild negative region outside of this area. The central region of the classification image appears to be suppressed at higher levels of apodization. In Task 2, there is a pronounced negatively weighted central region with a positive surround, and this pattern persists in Task 3.
3.3. Classification image spectra
The radial frequency plots in Figure 8 allow us to focus on subtler differences in the classification images. The plots show the radial average of the real component of the spatially-windowed classification image Fourier Transform (the imaginary part integrates to zero) using a frequency bin with a width of 1/ NΔPix . The x-axis of these plots is the average radial frequency in each bin, and the y-axis is the average classification weight in the bins.
Figure 8. Classification Image Spectra.

Average spatial-frequency weights of the classification images are plotted as a function of the average radial frequency. Each plot shows the average radial weights associated with the 4 levels of apodization (A1-A4) in the simulated imaging systems. The legend in the upper left applies to all plots.
Each graph in Figure 8 shows all 4 apodization levels, which allows for an assessment of how apodization impacts the way observers access information in noisy images. To a greater or lesser extent, all of these plots show that as the amount of apodization is increased, spatial-frequency weights increase at higher spatial frequencies (> 0.1 cyc/mm). This finding means that readers are effectively up-weighting higher spatial frequencies, which suggest that to some extent they are adapting to counteract the apodization process. Furthermore, this process results in more efficient reading according to the efficiency results in Figure 6. Further analysis will be needed to understand the nature of these changes in spatial weighting.
4. SUMMARY AND CONCLUSIONS
This proceedings paper describes our initial results using efficiency and classification images in discrimination tasks related to the use of CT in lung-cancer screening. We report results of psychophysical studies on simulated images generated as Gaussian stochastic processes generated with a variety of resolution and noise properties. We propose 3 tasks (size discrimination, margin distinctness, lesion uniformity) that are based on features related to lung cancer screening, and we evaluate them in conditions that simulate different intrinsic system resolution and different degrees of apodization for noise control.
The advantage of the simulation/Gaussian-texture approach is that we can compute task efficiency2,3,15, and the classification-image technique is known to be an unbiased estimate of a linear template profile5,16,17. Task efficiency allows us to evaluate how much diagnostic information in the images is being accessed by the readers, and the classification images tell us how the image is being used to access the information.
The preliminary results of the experiments, using a total of three readers, suggest that overall efficiency is relatively low and dependent on the task. The efficiency results show that the readers tend to improve when an image has been more heavily apodized. Averaging results across readers produces a readily observable classification image in all conditions, although there are some visible effects of noise instability in the most apodized conditions. It is clear that there are very different patterns of spatial weighting being used for the different tasks, and some visible differences across apodization levels. Further analysis is needed to understand how these differences in spatial weighting relate to mechanisms of discrimination in low-dose CT of the lungs.
ACKNOWLEDGEMENTS
This work was supported by the NIH through research grants (R01-EB025829, R01 EB018958, and R01 EB026427). The content of this proceedings paper is solely the responsibility of the authors and does not represent the institutional views of any funding agency.
REFERENCES
- 1.Burgess AE, Wagner RF, Jennings RJ & Barlow HB Efficiency of human visual signal discrimination. Science 214, 93–94 (1981). [DOI] [PubMed] [Google Scholar]
- 2.Burgess A Image quality, the ideal observer, and human performance of radiologic decision tasks. Acad Radiol 2,522–526 (1995). [DOI] [PubMed] [Google Scholar]
- 3.Barrett HH, Abbey CK & Clarkson E Objective assessment of image quality. III. ROC metrics, ideal observers, and likelihood-generating functions. J Opt Soc Am A Opt Image Sci Vis 15, 1520–1535 (1998). [DOI] [PubMed] [Google Scholar]
- 4.Ahumada AJ Perceptual classification images from Vernier acuity masked by noise. Perception 26, 18 (1996). [Google Scholar]
- 5.Abbey CK & Eckstein MP Classification image analysis: estimation and statistical inference for two-alternative forced-choice experiments. J Vis 2, 66–78 (2002). [DOI] [PubMed] [Google Scholar]
- 6.Eckstein MP & Ahumada AJ Classification images: a tool to analyze visual strategies. J Vis 2, 1x (2002). [DOI] [PubMed] [Google Scholar]
- 7.Murray RF Classification Images: A Review. J. Vis 11, 1–25 (2011). [DOI] [PubMed] [Google Scholar]
- 8.Shepp LA & Logan BF The Fourier reconstruction of a head section. IEEE Trans. Nucl. Sci 21, 21–43 (1974). [Google Scholar]
- 9.Hiramatsu M, et al. Pulmonary ground-glass opacity (GGO) lesions—large size and a history of lung cancer are risk factors for growth. J Journal of Thoracic Oncology 3, 1245–1250 (2008). [DOI] [PubMed] [Google Scholar]
- 10.Myers KJ, Rolland JP, Barrett HH & Wagner RF Aperture optimization for emission imaging: effect of a spatially varying background. J Opt Soc Am A 7, 1279–1293 (1990). [DOI] [PubMed] [Google Scholar]
- 11.Abbey CK & Bochud FO Modeling Visual Detection Tasks in Correlated Image Noise with Linear Model Obseervers in Handbook of Medical Imaging, Vol. 1 (eds. Beutel J, Kundel HL & VanMetter RL) 630–651 (2000). [Google Scholar]
- 12.Abbey CK & Eckstein MP Classification images for detection, contrast discrimination, and identification tasks with a common ideal observer. J Vis 6, 335–355 (2006). [DOI] [PubMed] [Google Scholar]
- 13.Abbey CK, et al. Classification images for localization performance in ramp-spectrum noise. Medical Physics 45, 1970–1984 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Legge GE, Kersten D & Burgess AE Contrast discrimination in noise. J Opt Soc Am A 4, 391–404 (1987). [DOI] [PubMed] [Google Scholar]
- 15.Abbey CK & Barrett HH Human- and model-observer performance in ramp-spectrum noise: effects of regularization and object variability. J Opt Soc Am A Opt Image Sci Vis 18, 473–488 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ahumada AJ Classification image weights and internal noise level estimation. J Vis 2, 121–131 (2002). [DOI] [PubMed] [Google Scholar]
- 17.Abbey CK & Eckstein MP Optimal shifted estimates of human-observer templates in two-alternative forced-choice experiments. IEEE transactions on medical imaging 21, 429–440 (2002). [DOI] [PubMed] [Google Scholar]
