Abstract
Purpose
Conventional metrics used for assessing digital mammography (DM) and digital breast tomosynthesis (DBT) image quality, including noise, spatial resolution, and detective quantum efficiency, do not necessarily predict how well the system will perform in a clinical task. A number of existing phantom-based methods have their own limitations, such as unrealistic uniform backgrounds, subjective scoring using humans, and regular signal patterns unrepresentative of common clinical findings. We attempted to address this problem with a realistic breast phantom with random hydroxyapatite microcalcifications and semi-automated deep learning-based image scoring. Our goal was to develop a methodology for objective task-based assessment of image quality for tomosynthesis and DM systems, which includes an anthropomorphic phantom, a detection task (microcalcification clusters), and automated performance evaluation using a convolutional neural network.
Approach
Experimental 2D and pseudo-3D mammograms of an anthropomorphic inkjet-printed breast phantom with inserted microcalcification clusters were collected on clinical mammography systems to train a signal-present/signal-absent image classifier based on Resnet-18 architecture. In a separate validation study using simulations, this Resnet-18 classifier was shown to approach the performance of an ideal observer. Microcalcification detection performance was evaluated as a function of four dose levels using receiver operating characteristic (ROC) analysis [i.e., area under the ROC curve (AUC)]. To demonstrate the use of this evaluation approach for assessing different technologies, the method was applied to two different mammography systems, as well as to mammograms with re-binned pixels emulating a lower-resolution X-ray detector.
Results
Microcalcification detectability, as assessed by the deep learning classifier, was observed to vary with the exposure incident on the breast phantom for both DM and tomosynthesis. At full dose, experimental AUC was 0.96 (for DM) and 0.95 (for DBT), whereas at half dose, it dropped to 0.85 and 0.71, respectively. AUC performance on DM was significantly decreased with an effective larger pixel size obtained with re-binning. The task-based assessment approach also showed the superiority of a newer mammography system compared with an older system.
Conclusions
An objective task-based methodology for assessing the image quality of mammography and tomosynthesis systems is proposed. Possible uses for this tool could be quality control, acceptance, and constancy testing, assessing the safety and effectiveness of new technology for regulatory submissions, and system optimization. The results from this study showed that the proposed evaluation method using a deep learning model observer can track differences in microcalcification signal detectability with varied exposure conditions.
Keywords: digital mammography, breast tomosynthesis, anthropomorphic breast phantom, image quality evaluation
1. Introduction
Over the past 30 years, digital mammography (DM) and digital breast tomosynthesis (DBT) have been tremendously successful in improving the detection and diagnosis of breast cancer. Advancement in these technologies is continuing with improvements in both hardware and software, with the hope of achieving further enhancements in performance. To assure that these technologies are performing safely and effectively, it is important to have testing and evaluation techniques that can objectively assess the image quality of these breast imaging devices using physical phantom studies. There are many possible uses for these evaluation techniques, including (1) to optimize design parameters during the system development phase, (2) to assure safety and effectiveness in regulatory submissions, (3) to optimize system acquisition operational parameters, and (4) for quality control and acceptance testing of clinical systems to assure that breast imaging systems are operating as designed.
Historically, conventional metrics of spatial resolution, noise, and detective quantum efficiency1 have been used to evaluate image quality. However, these metrics alone, or even in combination, do not necessarily inform the tester on how well the imaging system will perform in clinical-like tasks. One problem in particular with these metrics is that they assume a linear, shift-invariant imaging system. Many breast imaging systems are non-linear, for example, modern DBT systems that use non-linear reconstruction methods and sophisticated image processing techniques in “FOR PRESENTATION” data. In addition, some of the latest DM/DBT systems incorporate deep-learning algorithms, further contributing to this non-linearity.
One approach for assessing the image quality of breast imaging systems that are commonly used throughout the United States is the use of the American College of Radiology, Mammography Accreditation Phantom (ACR-MAP).2 The ACR-MAP phantom consists of three types of signals, masses, specks, and fibers of various sizes, embedded into a wax insert, which itself is inserted into a uniform polymethyl methacrylate (PMMA) block. The phantom is imaged, and the observer records how many signals can be visualized. Although this approach is simple, it only offers a subjective assessment of image quality. Furthermore, the signals are inserted into a homogeneous background. Although breast phantoms with a homogeneous uniform background might be adequate for quality control (QC) constancy checks to monitor system performance, it is questionable whether these phantoms can adequately predict the diagnostic performance of (pseudo)tomographic breast imaging systems such as DBT and breast CT.3,4
A number of approaches for task-based assessment of breast imaging systems using phantoms with anthropomorphic or structured backgrounds have been previously proposed. Cockmartin et al.5 described one unique approach using a structured PMMA phantom consisting of various size PMMA spheres suspended in water. Three-dimensional printed spiculated and non-spiculated masses, as well as microcalcifications, were inserted into the phantom as signals to be detected. Analysis of this phantom was conducted using the four-alternative forced-choice (4-AFC) paradigm with human observers. Later work by this group used channelized Hotelling model observers trained with deep learning neural networks to analyze detectability with this phantom.6 Balta et al.7 used an evaluation framework consisting of a 3D-printed anthropomorphic breast phantom modeled after real patient breast CT data with inserted gold disks to mimic microcalcifications. Detectability was evaluated using a non-pre-whitening model observer with an eye model, as well as human observers using a 2-AFC study. Another task-based assessment approach that has been described uses the CDMAM phantom to evaluate image quality based on threshold contrast readings of small microcalcification-like objects. Although the CDMAM phantom was designed for mammography and has a uniform background, it has been sandwiched between slabs of material with a structured background emulating adipose and fibroglandular tissue for testing DBT systems using 4-AFC analysis.8
Although these previous approaches for task-based assessment of breast imaging systems are useful, they have some limitations. The approach used by Cockmartin et al. uses a phantom with a structured background, but it is not anthropomorphic. The approach described by Balta et al. uses an anthropomorphic phantom modeled from patient breast CT data; however, the phantom does not portray finer structures in the breast due to the limited spatial resolution of breast CT. Other limitations with previous approaches include the use of phantom materials that might not accurately represent X-ray attenuation through breast tissue.
This paper discusses another approach for task-based, objective assessment of image quality, where the task is the detection of small calcium hydroxyapatite microcalcification clusters embedded into an inkjet-printed anthropomorphic breast phantom. We explore the feasibility of automating the analysis of phantom images with the use of a Resnet-18 deep learning model observer (DLMO) that is trained to assess the detectability of small microcalcification clusters. Examples of how the phantom and accompanied analysis can be used to assess the image quality of a commercial mammography/DBT system are given.
Model observers have been employed for objectively assessing image quality and are typically designed to predict either human observer performance or the IO performance that employs complete task-relevant information. Both of these approaches have challenges. It is well known that human performance can vary greatly between observers resulting in a relatively large uncertainty in technology performance evaluation.9 The Bayesian IO is a desirable approach for evaluating new imaging technology because it provides an upper bound on performance that can be achieved with the particular system of interest. Unfortunately, the IO for a clinical-like task performed with breast images is a non-linear function of the image data and, except in simplified cases, is mathematically intractable. However, Zhou et al.10 recently provided a tractable approach to approximating the IO performance by maximizing the posterior function using a convolutional neural network (CNN). The drawback of this approach is that the IO CNN requires a very large amount of training data. In this work, we have used a Monte Carlo simulated dataset of microcalcifications in an anthropomorphic breast background to validate and compare the performance of the Resnet-18 model observer to the approximate IO-CNN of Zhou et al.10
2. Methods
2.1. Anthropomorphic Breast Phantom with Microcalcification Cluster Signals
A previously described inkjet-printed anthropomorphic breast phantom11 was used in the experiments. The phantom consists of a stack of 571 parchment paper sheets sandwiched between two 6-mm-thick PMMA plates. Each sheet represents a single slice of a digital breast model, as described by Graff,12 with the entire stack modeling a 40-mm-thick compressed breast. As was found with an experimental prototype, parchment paper approximates X-ray attenuation of breast adipose tissue reasonably well. Fibroglandular regions of the breast were realized by printing with an iodine-doped ink, with an appropriate concentration of iodine. The breast model used in our phantom mimics a heterogeneously dense breast with 30% fibroglandular tissue composition.
For microcalcification signals, we manufactured four X-ray transparent plastic envelopes (hereafter “inserts”), each containing an array of 60 microcalcification clusters and a set of fiducial markers for the region of interest (ROI) extraction. Each cluster occupies an area and contains 5 to 15 hydroxyapatite calcifications of random shapes and sizes varying between and . This size range was selected to provide a reasonably challenging detection task in observer experiments. It is worth noting that calcifications of sizes less than are typically not visualized in clinical mammograms. On the other hand, calcifications larger than are fairly easy to detect by the human eye. Further details on the phantom and microcalcification cluster design and implementation can be found in Ikejimba et al.11 and Ghammraoui et al.13
A signal insert was placed in the middle of the phantom paper stack for measurements. The calcification insert was positioned at a single (fixed) vertical location within the phantom. Figure 1 illustrates the placement of the phantom on the breast support and an insert with microcalcification clusters.
Fig. 1.
Paper breast phantom (a) and plastic envelope (b) with hydroxyapatite microcalcification clusters.
2.2. Image Acquisition
Microcalcification conspicuity in screening mammography varies with X-ray dose. Thus, it is expected to observe signal detectability change across the range of exposure settings. Accordingly, our experimental plan was devised as follows. DM and DBT “combo” acquisitions were carried out on a Hologic Selenia Dimensions system in our laboratory. First, the optimal imaging techniques using the system’s automatic exposure control (AEC Auto-kV) were determined for both modalities. After that, phantom images were acquired using , , , and exposure levels, using fixed filter and tube voltage settings, as found optimal by AEC, and manually varied tube current-exposure time product. A summary of techniques used is listed in Table 1.
Table 1.
Parameters used in the Hologic Selenia dimension system.
| Modality | Spectrum | kVp | [mAs] | Pixel/-voxel size () | |||
|---|---|---|---|---|---|---|---|
| DM | W-Rh | 31 | 60 | 100 | 125 | 150 | 70 |
| DBT | W-Al | 33 | 25 | 35 | 50 | 65 | 117 |
A single breast phantom and four signal inserts (240 unique microcalcification clusters), as described above, were used in the experiments. To produce sufficient training data for the DLMO, each of the four signal inserts was scanned multiple times, using different dose levels, shifted positions with respect to the phantom background, and flipped both vertically and horizontally for more cluster/background combinations. All in all, 144 combo scans of the phantom were performed for each modality. Ideally, this would have produced regions of interest with calcifications. However, due to the circumstances, discussed below, the useful yield was , resulting in unique (DM) and 5600 (DBT) signal-present ROI images.
The data acquisition protocol for the comparison experiment (older 2008 Lorad Selenia versus 2014 Hologic Selenia Dimensions) was similar to what was used for all other measurements, i.e., determine the optimal exposure technique by scanning the phantom in automatic mode (AEC Auto-Filter), fix target/filter combination and tube voltage to what AEC found optimal, and collect mammograms at four dose settings , , , and , by varying tube current-exposure time product manually. Table 2 lists these parameters. Both systems reported approximately equal estimated organ dose (1.44 [Lorad] and 1.42 mGy [Hologic]) and entrance dose (7.85 and 7.15 mGy) using AEC settings.
Table 2.
Parameters used in the Lorad Selenia system.
| Modality | Spectrum | kVp | [mAs] | |||
|---|---|---|---|---|---|---|
| DM | Mo-Rh | 32 | 35 | 50 | 70 | 90 |
2.3. Data Preparation
Signal present (SP) and SA ROIs or volumes of interest (VOIs) were extracted from mammograms (central slices of DBT reconstruction volumes) using fiducial markers. Figure 2(a) illustrates the process. ROI sizes were and for DM and DBT, respectively, with an in-plane size equivalent to and -size of 3 mm for DBT. Extracted image crops (three-slice stacks for DBT) were saved as 16-bit grayscale TIFF files for use with the deep learning–based observer.
Fig. 2.
X-ray image of the phantom with microcalcifications insert and examples of signal-present ROIs. (a) Fiducial markers around the perimeter were used to determine cluster centers (indicated with square bright dots) and background area centers (small dots). (b) Acceptable (top) and unusable (bottom) ROIs with prominent spurious signals.
Due to imperfect hand manufacturing of our microcalcification inserts (described in Sec. 4), a non-negligible number of signal-present ROIs were inadvertently contaminated with large spurious signals, unacceptable for deep learning model training and testing. Examples of good and bad DM ROIs are shown in Fig. 2(b). For that reason, all images used for the DM and DBT observers were inspected and screened by a human. In addition, ROIs fully or partially containing outside-of-the-breast regions were discarded. Altogether, and signal-present ROIs, acquired using different dose settings, were collected to train the deep learning model for DM. Testing datasets comprised ROIs, acquired at four dose levels. Fewer data were used for DBT, due to more images being rejected at ROI “quality screening,” with ROIs used for training and datasets used for testing. When partitioning data, it was ensured that the same ROI/VOI image was not used in training and testing datasets.
2.4. Deep Learning Model Observer
A Resnet-18 network14 from the PyTorch deep learning library was used to implement the model observer image classifier. Other popular CNN architectures were also explored (VGG-16/19,15 DenseNet,16 and deeper Resnet variants) with classification performance observed to be similar to that obtained with Resnet-18, but requiring more training time, with some of these architectures being less stable and needing more fine-tuning. Resnet-18 appeared to be the most suitable existing architecture for our purpose. Both DM and DBT models were based on the ImageNet17 pre-trained Resnet-18. Horizontal and vertical flipping were used to augment training data and reduce overfitting. A stochastic gradient descent optimizer with the initial learning rate of , regularization parameter , momentum = 0.95, and a learning rate scheduler with a 1/2 step-decay every five epochs produced optimal convergence during validation. The validation dataset comprised data acquired at all four X-ray dose levels equally represented. The cross-entropy loss was used as a cost function for the two-class problem. The models were trained using the early stopping criteria in the iteration loop, a common regularization technique to reduce overfitting. The patience parameter in early stopping was set to 20 epochs.
To account for the stochastic nature of training deep networks and associated variance in performance, each network was trained five times, and the best-performing final model was chosen.
To establish a reference to which our proposed DLMO can be compared, we validated it against a published implementation of the CNN-based ideal observer (IO-CNN).10 Details are provided in the Appendix. We demonstrate that signal detection performance with Resnet-18-based DLMO is approaching that of the IO-CNN, using a large collection of simulated digital mammography ROIs with random microcalcifications in anthropomorphic breast background.
3. Results
3.1. Resnet-18 Based DLMO for Mammography and Tomosynthesis Applied to Physical Phantom Data
Classification probability distributions for SA and SP test images, as estimated by the deep learning model, were used to construct receiver operating characteristic (ROC) curves, from which the area under the curve (AUC) values were calculated. Hereafter, we use an AUC as a figure of merit with uncertainties expressed as symmetric 95% confidence intervals, estimated from 1-side error bar = , where standard deviation was obtained from bootstrapping data with iterations.18
Examples of the ROC curves for DM and DBT acquired with different dose levels, using Resnet-18 model observers trained with all available data, are provided in Fig. 3. It is interesting to notice that, for tomosynthesis, dose reduction results in a larger separation between the ROC curves (and corresponding AUC values) compared with mammography, i.e., tomosynthesis may be more sensitive to change in incident X-ray fluence than mammography.
Fig. 3.
Calcification detection ROC curves for DM/DBT for varying X-ray exposure and AUC values with 95% CIs (Hologic Selenia Dimensions system). (a) DM. (b) DBT.
Our main findings are presented as DLMO-estimated AUC values for , , , and exposure levels. Figures 4 and 5 show these results for DM and DBT. A few observations can be made from these measurements. First, mammography and tomosynthesis show similar calcification detectability for the three higher-dose settings , , and . Second, DLMO AUC tracks X-ray doses for both modalities. A stronger decline in the AUC performance with decreased dose is seen with DBT, especially at the 0.5AEC setting. For the three highest dose settings, there is a slight delay in performance from to . The differences among these data points are not statistically significant. This means that microcalcification detectability in the Hologic Selenia Dimensions varies a little with X-ray exposure in this range. At the setting, however, we see a detectable drop in AUC, for both mammography and tomosynthesis. Third, we explored DLMO behavior with a reduced number of training images. Having fewer X-ray scans to acquire enough data for model training makes this approach more appealing and practical. For DM, reducing the number of training signal-present ROIs from 6574 to 4000 leads to only a minor shift of the AUC points (red square markers) across the dose range (Fig. 4), whereas using 2000 signal-present ROIs (green triangle markers) produced a still usable range of AUC values (). The DLMO for DBT exhibited a more pronounced AUC performance drop with fewer training samples, as seen in Fig. 5. One consequence of having less training data is larger error bars on the AUC estimates. Even with only 2000 signal-present ROIs, our observer can still reliably distinguish between microcalcification detectability in full-dose and half-dose acquisitions for DM and DBT. It is interesting to notice that when we imaged the standard ACR-MAP accreditation phantom using exposure, the scores for the three largest specks groups were for DM and for DBT, as shown in Figs. 6(a), 6(b), hence clearing the passing criteria as described in the ACR-MAP QC manual.2
Fig. 4.
DLMO performance versus dose for DM. Colors show performance change with using less training data. Models were tested using independent 180 signal-absent + 180 signal-present ROIs.
Fig. 5.
DLMO performance versus dose for DBT. Colors show performance change with using less training data. Models were tested using independent 150 signal-absent + 150 signal-present VOIs.
Fig. 6.
ACR-MAP phantom speck groups at 0.5AEC: DM (a), DBT (b).
With our existing prototype, microcalcification insert 2000 signal-present ROIs for training the DLMO can be collected with scans.
3.2. Original Resolution Versus Binned Mammograms
It is possible to mimic a mammography system with reduced spatial resolution (pixel size) by applying binning to the existing DM images. Such an averaging operation degrades spatial resolution important for microcalcification detection but also reduces pixel noise. Binned training and testing datasets were used to train the DLMO and construct AUC versus dose plots and compare calcification detectability to the original resolution mammograms. The results of this mini-experiment are presented in Fig. 7. In this case, our method shows a statistically significant lower AUC performance for all dose levels from the images with worse spatial resolution, as expected.
Fig. 7.
DM calcification detectability with original and re-binned images versus dose level.
3.3. Comparison of Older and Newer Mammography Systems
To demonstrate how imaging performance from different technologies can be compared using the proposed method, we conducted an experiment with two clinical DM systems in our laboratory, a 2008 Lorad Selenia and a 2014 Hologic Selenia Dimensions. Differences between the two machines include X-ray tube anode material (molybdenum versus tungsten), sets of available filters, AEC tables, different a-Se direct-conversion detectors, different scatter rejection grid designs, and post-processing software.
For training, the DLMO on the Lorad system 4000 SP and 4000 signal-absent ROIs were collected for a just comparison with the Hologic Selenia Dimensions system. The testing dataset had signal-absent, signal-present ROIs for each of the four dose levels. The same deep-learning training and testing software was used (e.g., architecture and hyperparameters, unchanged). Figure 8 shows AUC as a function of dose comparison between the two systems. The results suggest that the newer system is superior with regard to the detectability of small microcalcification signals at the AEC and two lower dose settings. The largest gain in AUC (28%) is seen at the exposure. Curiously, the two systems resulted in identical AUC values at the highest dose of . The promising conclusion from this experiment is that the assessment approach described herein seems to be able to detect a non-negligible task performance difference between the two technologies.
Fig. 8.
Lorad Selenia versus Hologic Selenia Dimensions performance versus dose level.
4. Discussion
The most pertinent assessment of the effectiveness of a breast imaging system is how well it can detect and classify diagnostically important features within the breast. The primary diagnostic features that radiologists search for when reading a DM/DBT image are extended masses and microcalcifications. This research presents a methodology for objective, task-based performance assessment of DM and DBT systems using an anthropomorphic physical breast phantom. By realistically modeling the task of microcalcification detection, we hypothesize that this approach can partially predict how well a system will perform in clinical use. A similar approach for assessing the detection of masses will be the subject of future work.
As previously mentioned, there have been other studies discussing methods for task-based performance assessment of breast imaging systems using structured phantoms. The approach presented herein is unique for a few reasons. First, unlike the structured background of the University of Leuven L1 phantom,5 or the CIRS structured background phantom,19 the physical phantom here is based on a model of breast anatomy. As previously shown,11 the models for adipose and fibroglandular tissue for the phantom used here have similar photon attenuation properties to real breast tissues. Second, by using inserts containing 60 microcalcification clusters (each with randomly placed 5 to 15 microcalcifications fabricated from calcium hydroxyapatite), and moving those inserts within the phantom before each acquisition, it is possible to obtain a large number of regions- or volumes-of-interest (ROIs/VOIs) for use in training and testing with a reasonable number of scans. Finally, the approach studied here uses a deep-learning Resnet-18 model observer trained for the binary decision of whether a microcalcification cluster is present or absent within the ROI/VOI. Other similar phantom studies use human observers to read ROIs/VOIs; however, as shown in published studies, model observers typically have lower intra- and inter-observer variance.9 Another limitation of using human observers to assess performance with phantoms is that it is time consuming. Other studies have used analytical model observers such as non-pre-whitening (NPW) matched filters or channelized Hotelling observers (CHO), where these observers strive to model either an IO or human observers.
In this study, we have chosen to use a Resnet-18 DLMO to assess task performance. This selection is based on two observations. First, it is likely that mammography and DBT images will be read at least in part by deep learning–based algorithms in the future, and thus, the assessment of image quality from anthropomorphic breast phantom studies with DLMO is a reasonable approach. Second, we have shown through simulation studies (as described in the Appendix) that for the task of microcalcification detection, the Resnet-18 model can approximate the performance of an IO model if enough training images are available. As the amount of training data for the experimental phantom study described herein is somewhat limited (the IO-CNN with simulated data required SP and signal-absent ROIs to saturate), it is likely that the Resnet-18 is not achieving maximum performance on the experimental phantom data. However, we believe that the sub-optimal Resnet-18 algorithm can still be useful in assessing and optimizing system performance. Future studies will investigate whether transfer learning can be used to increase performance closer to the IO with fewer number of training data.
The results from this study showed that the DLMO can track differences in DM and DBT system performance. By changing the exposure incident on the breast phantom, it was observed that the AUC varied minimally for , , and but was significantly reduced at half of the AEC exposure. This demonstrates that this objective task-based approach can track the decrease in performance with exposure and could be an important tool for the optimization of AEC settings. This result was different when subjectively analyzing scores of the ACR-MAP for the different exposure levels. The ACR-MAP was not sensitive enough to differentiate performance at the four exposure levels tested.
It was also observed that calcification detection performance on DM was significantly decreased with an effective larger pixel size. This was emulated by using binning on the DM images. Task-based assessment was also used to explore differences in performance between an older and a more recent DM system of the same manufacturer (Lorad/Hologic). Our measurements suggested that the newer DM system outperformed or matched the older technology across the range of X-ray exposure settings.
There are a few limitations with the described approach for assessing image quality, and refinements are needed. The phantom used in this study is not a commercial product and thus is not necessarily reproducible. This should be the subject of future work in this area.
The phantom described here is a stack of printed paper. To be more realistic, a method for removing paper outside of the breast support and attaching and aligning the sheets of paper should be employed.20 The study described herein used a phantom that modeled a 40-mm breast thickness with heterogeneously dense tissue. It is important that the task being tested is challenging for the system, and thus, testing on phantoms modeling less dense tissue is probably not necessary. For some evaluation studies, multiple phantoms modeling different breast thicknesses might be needed.
Inserts with microcalcification clusters were made manually, using miniature tweezers and a magnifying glass to place individual calcification particles in clusters centered at the tagged locations on a sticky plastic swatch. Our goal was to have clusters with randomly placed microcalcifications of sizes varying between 150 and 180 microns. However, as was learned later, on a number of occasions, smaller individual micro-particles of hydroxyapatite formed larger aggregations (most likely due to electrostatic attraction), which later appeared as bright spurious signals in mammographic and DBT images. Contaminated image patches are unsuitable for deep learning model training and testing, and therefore, all extracted signal-present ROIs were screened by a human. With this prototype insert design, we had to reject of all collected ROIs. Such a screening process is inconsistent and subjective and can potentially be a source of bias in the DLMO performance estimates.
As was pointed out above, our microcalcification inserts were handcrafted. Due to the labor involved with manufacturing them, only four such templates were made. Therefore, we used X-ray images of the same calcification clusters for model training and testing. Although quantum noise and breast phantom anthropomorphic background were different in each acquisition, there might be a concern regarding possible bias in performance estimates, when using the same physical signal patterns in the algorithm training and testing. This problem will be addressed in our next signal template design, in which, separate, previously unseen by the neural network, inserts with microcalcifications will be reserved for testing.
A fixed vertical position of the microcalcification insert within the phantom meant that no variation in magnification and geometric unsharpness was included in the experiments. This variation will be incorporated in future applications of this method.
It is clear that in a practical scenario, when Resnet-18-based DLMO is used together with the breast phantom DM or DBT images to evaluate a clinical system, it would unlikely achieve the IO performance due to the hefty training dataset size requirement. However, as shown above (experimental results), the DLMO can still produce meaningful AUC versus X-ray exposure data with fewer training samples. One potentially useful approach for fair comparison of technologies with such a DLMO, which deserves further investigation, would be pre-training the Resnet-18 with a rich array of simulated data, and acquiring the necessary amount of phantom images to attain AUC saturation.
Another limitation was that all conditions under which the DLMO was applied were included in the testing. Although this approach may be adequate for many applications, it could pose limitations in the context of QC (which typically involves monitoring changes in system performance over time), as the behavior of the DLMO under conditions that deviate from the trained settings remains unknown.
We hypothesize that variations in the DLMO’s performance would be similar to those seen in human performance. However, the DLMO may be either more or less sensitive than humans to changes in acquisition conditions, which could result in underestimating or overestimating potential issues.
5. Conclusions
We propose a novel method of objective task-based assessment of DM/DBT technology with an anthropomorphic breast phantom, realistic microcalcifications, and a DL-based model observer. Although the present implementation is experimental and may not be ready for immediate use, our results show that the approach has promising potential. Possible applications include regulatory submissions, quality control testing, and for use in optimizing acquisition parameters or designing new systems.
6. Appendix: Validation of Resnet-18 Classifier for Detection of Microcalcifications
It is important to validate the proposed Resnet-18-based observer model against a traditional, well-understood numerical observer design. The highest achievable performance among all observers, including humans, is obtained by the Bayesian IO, which satisfies the optimality criteria and whose test statistic is given by the likelihood ratio.21 The main challenge with estimating the IO is that it requires knowledge of full joint PDF on data under each hypothesis and is only mathematically tractable for a few special cases (e.g., with Gaussian measurement noise for which the IO is linear, objects, and backgrounds known exactly). To overcome this limitation, a number of approaches have been investigated recently, including the ones using CNNs, supervised learning methods, and Markov chain Monte Carlo with generative adversarial networks (GANs), for detection and detection–localization tasks.10,22,23 We chose Zhou et al.’s22 implementation of the CNN-based model, aiming to approximate the IO performance for binary signal detection with random signals and backgrounds (hereafter IO-CNN), to establish a reference benchmark for comparison.
6.1. IO-CNN Implementation
A high-level IO-CNN architecture diagram is shown in Fig. 9. The network consists of several convolutional layers, each having 64 filters and a leaky ReLU activation function, followed by the max-pooling layer and then the fully connected output layer with the logistic function for binary classification output.
Fig. 9.
IO-CNN implementation.
The algorithm was trained by minimizing the binary cross-entropy loss function on [100 SA and 100 SP] mini-batches. Adam’s iterative solver24 with the learning rate was used to find optimal model weights with the early stopping technique to control generalization error and avoid overfitting. The CNN accepted grayscale floating-point TIFF patches cropped from MC-GPU phantom “mammograms” as an input. The output, in the form of a list of class probabilities, was used to create an ROC curve and compute its AUC value and standard deviation using the Metz-ROC software.25,26
The following neural network parameters were varied in a search to determine the optimally performing configuration for a given dataset: number of convolutional layers (), convolution filter size (, , , and ), and number of training epochs. In the end, the models that resulted in the smallest validation loss at each exposure level were selected to produce IO-CNN performance metrics reported below. Examples of parameter optimization search are illustrated in Fig. 10.
Fig. 10.
(a) Convolution kernel size versus number of layers. (b) Number of epochs versus kernel size. (c) Number of convolutional layers. (d) Resulting ROC for the two tested configurations.
6.2. Training and Testing Data
The accuracy of an estimate of the IO, applicable to realistic images, comes at the expense of an extensive training dataset it requires to fully learn image statistics. Thus, to achieve convergence, a large amount of synthetic data were produced using the Monte Carlo X-ray tracing suite MC-GPU,27 by simulating a commercial DM system and imaging an ensemble of VICTRE-128 anthropomorphic breast phantoms.
A total of 1000 compressed phantom realizations of scattered breast density type were used to generate data for the IO-CNN training and testing. Half of these phantoms contained no signals, whereas the other half was populated with microcalcification clusters, 249 per phantom. Each cluster contained five randomly placed specks, with hydroxyapatite calcification particles of sizes to . The density of the particles was adjusted in a series of pilot experimental runs such that the resulting observer performance (AUC) was not saturated at unity, nor it was too small. As microcalcification conspicuity is largely dependent on quantum noise, the performance of the IO-CNN was evaluated at four X-ray dose settings, with nominal “full dose” exposure corresponding to , as estimated by MC-GPU software. The other three dose points correspond to FD, FD, and FD. Figure 11 shows sample signal-present (calcifications embedded in the anatomical background) and signal-absent regions of interest.
Fig. 11.
Examples of “full dose” ROIs used for IO-CNN algorithm training, validation, and testing. (a) Microcalcification clusters (easier patches shown for clarity). (b) Background.
An important question with the neural network-based IO is how much training data it needs to reach its true performance limit. The answer will be dependent on the difficulty of the task. To explore this, we first trained the model using 70k + 70k signal-absent and signal-present (SA + SP) images with a 1k + 1k hold-out test set, and a 1k+1k hold-out validation set from our “full dose” simulations with resulting ROC . Then, the training set size was expanded to 122.5k + 122.5k, with which the new model produced , e.g., we saw a 2.3% improvement with 1.75 more training samples. This suggests that the model is asymptotically approaching its performance ceiling, where adding more training data results in ever-slightly improved AUC. In a third test, with 490k + 490k SA + SP ROIs training set formed by piling , , , and data together, the resultant on the same test set was only a 0.55% increase over AUC from the previous experiment. The two AUC values (from the models trained with 122.5 and 490k datasets)† were also within their measurement uncertainties. Analyses† repeated for the remaining dose levels yielded similar () AUC gains as with full dose data. With the reasoning above, we estimate that with a 122.5k training set size, the IO-CNN performance is very close to its true upper bound. The results presented below were obtained with the IO-CNN trained with 122.5k + 122.5k SA and SP ROIs, with 1k + 1k independent hold-out set used for validation, and another 1k + 1k independent hold-out set used for testing.
6.3. Resnet-18 DLMO Versus IO-CNN
A 10-fold cross-validation was used to obtain Resnet-18’s mean AUC values and their standard deviations. A comparison between the two models (both were trained on the same images) is shown in Fig. 12.
Fig. 12.
Resnet-18-based observer versus IO-CNN performance comparison using a large amount of synthetic DM images. Detection task is calcification detection in a realistic breast anatomy background.
As can be seen from the plot, for our microcalcification in the anthropomorphic background detection task, the Resnet-18 DLMO performance almost perfectly matches that of the IO-CNN, when trained with the same amount of data. Although the mean AUC values from the two observers are remarkably close to each other, the optimized IO-CNN is consistently ever slightly better performing than Resnet-18. Naturally, the largest difference between the two models (1%) is at the lowest exposure of FD, where the two AUC values still have overlapping error bars. It should be noted that the same fixed Resnet-18 DLMO hyper-parameters were used throughout, and it could be speculated that fine-tuning or pre-training of Resnet-18 for each dose point individually could have further improved its performance.
Obviously, collecting ROIs with a physical breast phantom to evaluate a clinical system for QA or constancy testing is impractical. We investigated how DLMO’s performance will be penalized by using, for instance, only 10k + 10k ( of what was used for the IO-CNN) for model training. These data are plotted with green color markers in Fig. 12. Expectedly, we see an apparent decrease in DLMO AUC values for all dose levels, with significantly larger error bars. The trend of the curve, although having the same pattern as with the model trained on many more images, has a steeper fall-off toward lower exposures. It should be emphasized here that the model training set size in the actual application will be vastly dependent on the difficulty of the task, e.g., contrast and size of the microcalcification particles. The example dataset comparing the performance of the two observers here was purposely made to be quite challenging.
Another way to analyze the proposed model is to look at the correlation of its output probabilities with the ones from the reference observer. Here, we used Spearman’s non-parametric (no assumption about the distribution of probabilities is made) rank correlation calculated from the list probabilities of a test ROI being SP. Correlation analysis results are shown in Fig. 13. The estimated correlation coefficient indicates a high degree of agreement between the two observers. From the scatter plot, we infer that the IO-CNN versus Resnet-18 predictions are localized rather tightly in the vicinity of , whereas the data are more scattered around , with a broader spread in between, and particularly near where both models are dubious about whether the particular ROI is SA or SP.
Fig. 13.
IO-CNN versus Resnet-18 DLMO correlation analysis. (a) Scatter plot (1000 + 1000 SA + SP points). (b) Spearman’s correlation matrix.
Acknowledgments
The authors would like to acknowledge Dr. Frank Samuelson (FDA) for helping with statistical analysis and insightful discussions.
Biography
Biographies of the authors are not available.
Contributor Information
Andrey Makeev, Email: andrey.makeev@fda.hhs.gov.
Kaiyan Li, Email: kaiyanl2@illinois.edu.
Mark A. Anastasio, Email: maa@illinois.edu.
Arthur Emig, Email: arthur.e.emig@gmail.com.
Paul Jahnke, Email: paul.jahnke@phantomx.de.
Stephen J. Glick, Email: stephen.glick@fda.hhs.gov.
Disclosures
The mention of commercial products, their sources, or their use in connection with material reported herein is not to be construed as either an actual or implied endorsement of such products by the Department of Health and Human Services. This is a contribution of the US Food and Drug Administration and is not subject to copyright. The authors have no conflicts to disclose. Mark A. Anastasio was funded in part by the National Institutes of Health (Award No. EB034249). Kaiyan Li was funded by an Oak Ridge Institute for Science and Education (ORISE) fellowship. Paul Jahnke is a shareholder of PhantomX GmbH. No financial support from industry was received for this study. The data were controlled and analyzed by Andrey Makeev.
Code and Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
- 1.Metter R. L. V., Beutel J., Kundel H. L., Handbook of Medical Imaging, Volume 1. Physics and Psychophysics, SPIE Press Book, Bellingham, Washington: (2000). [Google Scholar]
- 2.American College of Radiology, “Phantom testing: mammography (Revised 12-12-19),” https://accreditationsupport.acr.org/support/solutions/articles/11000065938-phantom-testing-mammography-revised-12-12-19- (2024).
- 3.Kotre C. J., “The effect of background structure on the detection of low contrast objects in mammography,” Br. J. Radiol. 71(851), 1162–1167 (2014). 10.1259/bjr.71.851.10434911 [DOI] [PubMed] [Google Scholar]
- 4.Bochud F. O., et al. , “Estimation of the noisy component of anatomical backgrounds,” Med. Phys. 26(7), 1365–1370 (1999). 10.1118/1.598632 [DOI] [PubMed] [Google Scholar]
- 5.Cockmartin L., et al. , “Design and application of a structured phantom for detection performance comparison between breast tomosynthesis and digital mammography,” Phys. Med. Biol. 62, 758–780 (2017). 10.1088/1361-6560/aa5407 [DOI] [PubMed] [Google Scholar]
- 6.Petrov D., et al. , “Deep learning channelized Hotelling observer for multi-vendor DBT system image quality evaluation,” Proc. SPIE 11316, 113160X (2020). 10.1117/12.2548998 [DOI] [Google Scholar]
- 7.Balta C., et al. , “A model observer study using acquired mammographic images of an anthropomorphic breast phantom,” Med. Phys. 45(2), 655–665 (2018). 10.1002/mp.12703 [DOI] [PubMed] [Google Scholar]
- 8.Ravaglia V., et al. , “The small-size details detection performance of digital breast tomosynthesis, synthetic 2D, and conventional full-field digital mammography images for different mammography systems: a multicenter study,” Proc. SPIE 11513, 115131H (2020). 10.1117/12.2564279 [DOI] [Google Scholar]
- 9.Petrov D., et al. , “Model and human observer reproducibility for detection of microcalcification clusters in digital breast tomosynthesis images of three-dimensionally structured test object,” J. Med. Imaging 6(1), 015503 (2019). 10.1117/1.JMI.6.1.015503 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhou W., Li H., Anastasio M. A., “Approximating the ideal observer for joint signal detection and localization tasks by use of supervised learning methods,” IEEE Trans. Med. Imaging 39(12), 3992–4000 (2020). 10.1109/TMI.2020.3009022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ikejimba L. C., et al. , “A novel physical anthropomorphic breast phantom for 2D and 3D X-ray imaging,” Med. Phys. 44(2), 407–416 (2017). 10.1002/mp.12062 [DOI] [PubMed] [Google Scholar]
- 12.Graff C. G., “A new open-source multi-modality digital breast phantom,” Proc. SPIE 9783, 978309 (2016). 10.1117/12.2216312 [DOI] [Google Scholar]
- 13.Ghammraoui B., et al. , “Fabrication of microcalcifications for insertion into phantoms used to evaluate X-ray breast imaging systems,” Biomed. Phys. Eng. Express 7, 055021 (2021). 10.1088/2057-1976/ac1c64 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.He K., et al. , “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit. (2016). 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
- 15.Simonyan K., Zisserman A., “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 (2014).
- 16.Huang G., et al. , “Densely connected convolutional networks,” in IEEE Conf. Comput. Vision and Pattern Recognit., pp. 2261–2269 (2017). 10.1109/CVPR.2017.243 [DOI] [Google Scholar]
- 17.Deng J., et al. , “Imagenet: a large-scale hierarchical image database,” in IEEE Conf. Comp. Vision and Pattern Recognit., IEEE, pp. 248–255 (2009). 10.1109/CVPR.2009.5206848 [DOI] [Google Scholar]
- 18.Pezzullo J. C., Biostatistics for Dummies, John Wiley & Sons, Hoboken, New Jersey: (2013). [Google Scholar]
- 19.CIRS Tissue Simulation & Phantom Technology, “BR3D breast imaging phantom,” https://www.cirsinc.com/products/mammography/br3d-breast-imaging-phantom/ (2024).
- 20.Jahnke P., et al. , “Paper-based 3D printing of anthropomorphic CT phantoms: feasibility of two construction techniques,” Eur. Radiol. 29, 1384–1390 (2019). 10.1007/s00330-018-5654-1 [DOI] [PubMed] [Google Scholar]
- 21.Kupinski M. A., et al. , “Ideal-observer computation in medical imaging with use of Markov-chain Monte Carlo techniques,” J. Opt. Soc. Am. A. Opt. Image Sci. Vision 20(3), 430–438 (2003). 10.1364/JOSAA.20.000430 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zhou W., Li H., Anastasio M. A., “Approximating the ideal observer and Hotelling observer for binary signal detection tasks by use of supervised learning methods,” IEEE Trans. Med. Imaging 38(10), 2456–2468 (2019). 10.1109/TMI.2019.2911211 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li K., et al. , “Supervised learning-based ideal observer approximation for joint detection and estimation tasks,” Med. Phys. 47(6), E440 (2020). [Google Scholar]
- 24.Kingma D., Ba J., “Adam: a method for stochastic optimization,” in Int. Conf. Learn. Represent. (ICLR) (2015). [Google Scholar]
- 25.Metz C. E., “Free distribution ROC software,” https://radiology.uchicago.edu/research/metz-roc-software (2024).
- 26.Pesce L. L., Metz C. E., “Reliable and computationally efficient maximum-likelihood estimation of “proper” binormal ROC curves,” Acad. Radiol. 14(7), 814–829 (2007). 10.1016/j.acra.2007.03.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Badal A., Badano A., “Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit,” Med. Phys. 36(11), 4878–4880 (2009). 10.1118/1.3231824 [DOI] [PubMed] [Google Scholar]
- 28.Badano A., et al. , “Evaluation of digital breast tomosynthesis as replacement of full-field digital mammography using an in silico imaging trial,” JAMA Network Open. 1(7), e185474 (2018). 10.1001/jamanetworkopen.2018.5474 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.













