Abstract
A 2% threshold has been traditionally used to recommend breast biopsy in mammography. We aim to characterize how the biopsy threshold varies to achieve the maximum expected utility (MEU) of tomosynthesis for breast cancer diagnosis. A cohort of 312 patients, imaged with standard full field digital mammography (FFDM) and digital breast tomosynthesis (DBT), was selected for a reader study. Fifteen readers interpreted each patient's images and estimated the probability of malignancy using two modes: FFDM versus FFDM + DBT. We generated receiver operator characteristic (ROC) curves with the probabilities for all readers combined. We found that FFDM+DBT provided improved accuracy and MEU compared with FFDM alone. When DBT was included in the diagnosis along with FFDM, the optimal biopsy threshold increased to 2.7% as compared with the 2% threshold for FFDM alone. While understanding the optimal threshold from a decision analytic standpoint will not help physicians improve their performance without additional guidance (e.g. decision support to reinforce this threshold), the discovery of this level does demonstrate the potential clinical improvements attainable with DBT. Specifically, DBT has the potential to lead to substantial improvements in breast cancer diagnosis since it could reduce the number of patients recommended for biopsy while preserving the maximal expected utility.
Keywords: Digital Breast Tomosynthesis, Mammography, Breast Biopsy, Expected Utility, ROC Analysis
1. INTRODUCTION
Image-guided core-needle breast biopsy has become an integral part of diagnostic workup for a suspicious mammographic finding. In the United States over 700,000 women undergo breast biopsies per year1, 2. However, most women who have a breast biopsy do not have breast cancer—in fact, approximately 80% of image-guided breast biopsies are benign. Reducing unnecessary biopsies is important for several reasons. Breast biopsy is the most costly per capita component of a breast cancer screening program3. Moreover, breast biopsy can cause side effects such as bleeding, bruising, and infection4, 5. Furthermore, for a patient, waiting days for the results of a breast biopsy appears to affect stress hormone levels just as much as the news of a breast cancer diagnosis6.
One strategy of reducing unnecessary biopsies is to improve breast cancer risk estimation. Mammography is currently the standard of care for breast cancer early diagnosis. However, conventional mammography has nontrivial false-positive (recall for additional images or biopsy) and false-negative (missed cancer) rates7. Thus, substantial effort is being invested to improve breast cancer risk estimation through a variety of new imaging technologies and practice techniques. Recently, digital breast tomosynthesis, also referred to as “3-D mammography”, has been developed to reduce false-positive and false-negative findings. The feasibility of using tomosynthesis in breast imaging have been demonstrated8, 9, and the U.S. Food and Drug Administration recently approved the first commercial system in clinical use10 but many questions remain regarding the optimal use.
Attempts to improve breast cancer risk estimation to more accurately target individuals who are most likely to benefit from breast cancer early diagnosis and least likely to experience false positives are being actively pursued11. Nevertheless, another promising strategy to accomplish this goal is to pursue the optimal breast cancer risk threshold that radiologists should use to recommend biopsy. There have been several methods to determine the optimal thresholds for medical diagnosis12–15. Currently, clinically radiologists typically use a 2% threshold as a probability level of breast cancer above which a breast biopsy is recommended in mammography16. The American College of Radiology (ACR) developed the Breast Imaging Reporting and Data System (BI-RADS) lexicon which standardizes mammography reporting and provides a guide on mammography audits and performance measures17. BI-RADS reinforces this 2% threshold level below which biopsy need not be recommended. Furthermore, a recent Markov decision model validates that the 2% threshold currently used for biopsy is reasonable by considering clinical relevant variables2.
This study evaluates the performance of digital breast tomosynthesis for breast cancer risk estimation. This study aims to reveal how the biopsy threshold varies to achieve the maximization of the utility of the diagnosis of breast cancer when digital breast tomosynthesis is available in addition to mammography.
2. MATERIALS AND METHODS
2.1 Subjects
A cohort of 312 patients, imaged with standard full field digital mammography (FFDM) and digital breast tomosynthesis (DBT), was selected from a screening group and a biopsy group for a reader study18. These cases included 51 malignant and 261 benign/negative subjects. Fifteen readers interpreted each patient's images and provided the probability of malignancy in two modes: FFDM versus FFDM + DBT. The reference standard was based on biopsy results or the radiologist's final interpretation if no biopsy was performed.
2.2 Utility analysis in the ROC domain
Receiver Operating Characteristic (ROC) curve analysis is one of the most widely used statistical approaches to characterize the predictive ability of technologies and methods19–21. It illustrates the performance of a binary classifier system by plotting true positive rate (TPR) vs. false positive rate (FPR), at various threshold settings.
In the ROC domain, after utility is assigned for each category of outcomes (True Negative (TN), False Positive (FP), False Negative (FN) and True Positive (TP)), expected utility of a diagnostic technology or method f is defined as follows.
where E[U()] represents expected utility and p is the prevalence of breast cancer. The maximum expected utility (MEU) is defined as expected utility at the operating point where the line with the slope S is tangent to the ROC curve (Figure 1). The MEU occurs at the operating point where a rational radiologist should make clinical decision14, 22. The slope S is determined by utility values of four outcomes and the prevalence of breast cancer, and given by:
Figure 1.

Relationship between expected utility (EU) curve and ROC curve.
2.3 Study design
We first generated receiver operating characteristic (ROC) curves for all readers combined using ROCKIT software20, 23 based on the probabilities of malignancy in two modes (FFDM, FFDM+DBT). We compared accuracy in terms of the area under ROC curve (AUC) of FFDM+DBT versus FFDM alone.
Then, we assigned utility for each category of outcomes in breast cancer diagnosis as follows.
We chose TN outcomes as our baseline and assigned a utility of zero.
We assigned a loss of 4.7 days to the utility of FP based on the literature24, 25.
We used the University of Wisconsin Breast Cancer Simulation (UWBCS) model26 to estimate the utility of FN as a loss of 2.52 years. The UWBCS model developed as part of the Cancer Intervention and Surveillance Modeling Network (CISNET), is a discrete-event, stochastic simulation model designed to replicate breast cancer epidemiology in the U.S. population. The UWBCS includes a detailed and flexible natural history model that explicitly models breast cancer subtypes. It has been cross-validated against Wisconsin state registry and Surveillance Epidemiology and End Results (SEER) data27. The UWBCS has been previously utilized to provide guidance about the risks and benefits of alternative approaches to breast cancer diagnosis.
For TP, we assumed that its utility was UFN × (1-α), 0≤ α ≤1, where α is an unknown parameter representing the overall effectiveness of breast cancer treatment.
We compared MEU of FFDM+DBT with that of FFDM alone by using assigned utilities. We used 0.03 as the prevalence in MEU calculation, which is comparable to the incidence of breast cancer for the diagnostic population28.
Finally, using the baseline 2% biopsy threshold established in the literature for FFDM, we evaluated the threshold for FFDM+DBT such that its expected utility attained the maximum. That is to say, we looked for the optimal threshold for FFDM+DBT such that its related slope on the ROC space was the same as that of FFDM with 2% threshold. We calculated biopsy rate and positive predictive value (PPV) when different thresholds were used. Biopsy rate is the ratio of the number of recommended biopsies to the number of findings. PPV is percentage of all findings that result in a tissue diagnosis of cancer. We also estimated the effects of each threshold on the number of false-negative results, the number of false-positive results, and the number of biopsies avoided.
In addition, we obtained sensitivity and specificity of FFDM at 2% threshold. We also found a threshold for FFDM+DBT when the same sensitivity or specificity was maintained.
3. RESULTS
FFDM+DBT improved accuracy significantly compared with FFDM alone in terms of AUC (0.879 vs. 0.802, p-Value<0.001) (Figure 2). Overall, FFDM +DBT provided higher MEU compared with FFDM alone (Figure 3).
Figure 2.

ROC curves comparing FFDM+DBT (dashed curve) with FFDM (solid curve).
Figure 3.

Difference in MEU between FFDM+DBT and FFDM alone.
When a 2% threshold was used for FFDM, the sensitivity of FFDM was 0.906, the specificity was 0.360, biopsy rate was 0.684, and PPV was 0.218. The slope of the line that is tangent to the ROC curve at the 2% threshold was 0.324. When DBT was involved in the diagnosis in addition to FFDM, the sensitivity was 0.924, the specificity was 0.474, biopsy rate was 0.591, and PPV was 0.256 if the 2.0% was chosen as a biopsy threshold. To achieve the maximum expect utility for FFDM+DBT, the threshold was increased to 2.7%, with which the sensitivity of FFDM+DBT was 0.890 and the specificity was 0.610 (Figure 4). Biopsy rate was 0.414 and PPV was 0.347 when 2.7 % was used as a threshold for FFDM+DBT. FFDM+DBT with the 2.7% threshold avoided 1,263 biopsies but missed 25 more cancers as compared to FFDM with the 2% threshold.
Figure 4.
MEU of FFDM with 2% threshold (solid curve) and FFDM+DBT with 2.7% threshold (dashed curve).
To maintain the same sensitivity as when the 2% threshold was used for FFDM, the threshold was increased to 2.4% for FFDM+DBT and the corresponding specificity was 0.554. To maintain the same specificity, the threshold was reduced to 1.7% for FFDM+DBT and the corresponding sensitivity was 0.949 (Figure 5).
Figure 5.
Sensitivity and specificity for different biopsy thresholds.
4. DISCUSSION
We investigate and describe optimal thresholds for recommending biopsies based on digital breast tomosynthesis for the first time. We find an optimal threshold of 2.7% in FFDM+DBT for recommending biopsy as compared with 2% in FFDM alone in order to achieve maximum expected utility. For FFDM+DBT, a biopsy threshold of 2.7% threshold results in substantially fewer biopsies. Understanding the optimal threshold from a decision analytic standpoint creates the opportunity for development of decision support tools to support and reinforce this threshold in clinical practice to ultimately improve care.
The value of decreasing biopsy rate for FFDM+DBT is immense. Decreased biopsy rate can be directly translated into the benefits of cost reduction in breast cancer diagnosis programs. It can avoid the anxiety and side effects caused by unnecessary biopsies. This study reinforces two strategies for decreasing biopsy rate. One is to improve breast risk estimation, which is demonstrated by reduced biopsy rate from 0.684 to 0.591 when DBT was involved in the diagnosis of breast cancer in addition to FFDM. The other is to pursue the optimal breast cancer risk threshold, which is demonstrated by further reducing biopsy rate to 0.414 when the 2.7% threshold was utilized for recommending biopsy instead of routinely used 2% threshold. Although the benefits of decreasing biopsy rate are significant, sufficient attention has to be paid to the potential harms7, 29, 30. This study found that FFDM+DBT with the 2.7% threshold avoided 1,263 biopsies but missed 25 more cancers as compared to FFDM with the 2% threshold. These results indicate that our decision model based on maximum expected utility is an effective tool to pursue the optimal breast cancer risk threshold by balancing potential benefits and harms in breast cancer diagnosis.
In this study, PPV at 2.7% threshold for FFDM+DBT is superior to that at 2% threshold for FFDM. PPV is a critical measure of the performance of a diagnostic method, as it reflects the proportion of positive test results that are malignant cancers. A high PPV indicates an effective threshold for reducing the ratio of the number of false positive results to the number of true positive results, which may suggest a low biopsy rate for avoiding unnecessary biopsies, medical costs, and patients' emotional stress31, 32.
The recent emphasis on cost-effective medical practice has strengthened the need to seek optimal threshold in ROC curve analysis for diagnostic tests. In this study, we align optimal threshold suggested from clinical community with that induced from utility analysis, and find that the slope for optimal threshold is 0.324. This slope is less than the data from earlier studies13, 33, 34. A large-scale cohort study would be necessary to investigate the cause of the difference since this study uses the outcomes from a small subset of screening population to derive ROC curve, which would significantly affect the characterization of the slope at the optimal threshold. Verifying our methodology in a large cohort study is the subject of ongoing investigation.
There are several limitations in this study. First, we have made an important assumption that for mammography an optimal threshold derived from maximum expected utility accords the clinically used 2% threshold. We believe this assumption is sound since a rational radiologist would seek to maximize the expected utility for recommending breast biopsy14. However, theoretic derivation and experimental demonstration are limited to support this assumption. In the future, we plan to derive the optimal threshold from utility analysis after quantifying utilities for four outcomes on a large cohort study. We will compare the optimal threshold with the clinically used threshold for validating this assumption. Second, we have combined all outputs of breast cancer probabilities from a multi-reader multi-case reader study to derive ROC curve. We did not consider the complicated correlation among outputs35, 36. Third, age is important in determining an optimal biopsy threshold2. In the future, we will use our methodology to determine optimal biopsy thresholds for different age groups.
5. CONCLUSIONS
We find an optimal threshold of 2.7% in FFDM+DBT for recommending biopsy as compared with 2% in FFDM alone. Tomosynthesis has the potential to lead to substantial improvements in breast cancer diagnosis since it could reduce the number of patients recommended for biopsy while preserving the maximal expected utility. In addition, our proposed expected utility model could be used as a general technique to characterize optimal thresholds and to quantify the potential diagnostic value of different modalities.
ACKNOWLEDGEMENTS
This work was supported by the National Institutes of Health (R01-CA127379).
REFERENCES
- [1].Ghosh K, Melton L, Suman V, et al. Breast biopsy utilization: a population-based study. Arch Intern Med. 2005;165(14):1593–1598. doi: 10.1001/archinte.165.14.1593. [DOI] [PubMed] [Google Scholar]
- [2].Burnside ES, Chhatwal J, Alagoz O. What is the optimal threshold at which to recommend breast biopsy? PLoS ONE. 2012;7(11):1–9. doi: 10.1371/journal.pone.0048820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Poplack S, Carney P, Weiss J, et al. Screening mammography: costs and use of screening-related services. Radiology. 2005;234(1):79–85. doi: 10.1148/radiol.2341040125. [DOI] [PubMed] [Google Scholar]
- [4].Brewer N, Salz T, Lillie S. Systematic review: the long-term effects of false-positive mammograms. Ann Intern Med. 2007;146(7):502–510. doi: 10.7326/0003-4819-146-7-200704030-00006. [DOI] [PubMed] [Google Scholar]
- [5].Zagouri F, Sergentanis T, Gounaris A, et al. Pain in different methods of breast biopsy: emphasis on vacuum-assisted breast biopsy. Breast. 2008;17(1):71–75. doi: 10.1016/j.breast.2007.07.039. [DOI] [PubMed] [Google Scholar]
- [6].Lang E, Berbaum K, Lutgendorf S. Large-core breast biopsy: abnormal salivary cortisol profiles associated with uncertainty of diagnosis. Radiology. 2009;250(3):631–637. doi: 10.1148/radiol.2503081087. [DOI] [PubMed] [Google Scholar]
- [7].Nelson H, Tyne K, Naik A, et al. Screening for breast cancer: an update for the U.S. Preventive Services Task Force. Ann Intern Med. 2009;151(10):727–737. doi: 10.1059/0003-4819-151-10-200911170-00009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Haas BM, Kalra V, Geisel J, et al. Comparison of tomosynthesis plus digital mammography and digital mammography alone for breast cancer screening. Radiology. 2013;269(3):694–700. doi: 10.1148/radiol.13130307. [DOI] [PubMed] [Google Scholar]
- [9].Skaane P, Bandos A, Gullien R, et al. Comparison of digital mammography alone and digital mammography plus tomosynthesis in a population-based screening program. Radiology. 2013;267(1):47–56. doi: 10.1148/radiol.12121373. [DOI] [PubMed] [Google Scholar]
- [10].Food & Drug Administration (FDA) Selenia Dimensions 3D System - P080003. 2011. [Google Scholar]
- [11].Evans D, Howell A. Breast cancer risk-assessment models. Breast Cancer Res. 2007;9(5):213. doi: 10.1186/bcr1750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Zou K, Yu C, Liu K, et al. Optimal thresholds by maximizing or minimizing various metrics via ROC-type analysis. Acad Radiol. 2013;20(7):807–815. doi: 10.1016/j.acra.2013.02.004. [DOI] [PubMed] [Google Scholar]
- [13].Youden W. Index for rating diagnosis tests. Cancer. 1950;3(1):32–35. doi: 10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- [14].Halpern E, Albert M, Krieger A, et al. Comparison of receiver operating characteristic curves on the basis of optimal operating points. Acad Radiol. 1996;3(3):245–253. doi: 10.1016/s1076-6332(96)80451-x. [DOI] [PubMed] [Google Scholar]
- [15].Cantor S, Sun C, Tortolero-Luna G, et al. A comparison of C/B ratios from studies using receiver operating characteristic curve analysis. J Clin Epidemiol. 1999;52(9):885–892. doi: 10.1016/s0895-4356(99)00075-x. [DOI] [PubMed] [Google Scholar]
- [16].Varas X, Leborgne J, Leborgne F, et al. Revisiting the mammographic follow-up of BI-RADS category 3 lesions. AJR Am J Roentgenol. 2002;179(3):691–695. doi: 10.2214/ajr.179.3.1790691. [DOI] [PubMed] [Google Scholar]
- [17].American College of Radiology . Breast Imaging Reporting and Data System (BI-RADS) atlas. Reston, Va: 2003. [Google Scholar]
- [18].Rafferty E, Park J, Philpotts L, et al. Assessing radiologist performance using combined digital mammography and breast tomosynthesis compared with digital mammography alone: results of a multicenter, multireader trial. Radiology. 2013;266(1):104–113. doi: 10.1148/radiol.12120674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Eng J. Receiver operating characteristic analysis: utility, reality, covariates, and the future. Acad Radiol. 2013;20:795–797. doi: 10.1016/j.acra.2013.05.001. [DOI] [PubMed] [Google Scholar]
- [20].Metz C. Basic principles of ROC analysis. Semin Nucl Med. 1978;8:283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]
- [21].Obuchowski N. ROC analysis. AJR Am J Roentgenol. 2005;184:364–372. doi: 10.2214/ajr.184.2.01840364. [DOI] [PubMed] [Google Scholar]
- [22].Sox H, Blatt M, Higgins M, et al. Medical Decision Making. Butterworth-Heinemann; Philadelphia: 1988. [Google Scholar]
- [23].Metz C, Herman B, Shen J. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat Med. 1998;17:1033–1053. doi: 10.1002/(sici)1097-0258(19980515)17:9<1033::aid-sim784>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
- [24].Schousboe J, Kerlikowske K, Loh A, et al. Personalizing mammography by breast density and other risk factors for breast cancer: analysis of health benefits and cost-effectiveness. Ann Intern Med. 2011;155(1):10–20. doi: 10.7326/0003-4819-155-1-201107050-00003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Brett J, Bankhead C, Henderson B, et al. The psychological impact of mammographic screening. A systematic review. Psychooncology. 2005;14(11):917–938. doi: 10.1002/pon.904. [DOI] [PubMed] [Google Scholar]
- [26].Fryback D, Stout N, Rosenberg M, et al. The Wisconsin breast cancer epidemiology simulation model. J Natl Cancer Inst Monogr. 2006;36:37–47. doi: 10.1093/jncimonographs/lgj007. [DOI] [PubMed] [Google Scholar]
- [27].Gloeckler R, Reichman M, Lewis D, et al. Cancer survival and incidence from the Surveillance, Epidemiology, and End Results (SEER) program. Oncologist. 2003;8(6):541–552. doi: 10.1634/theoncologist.8-6-541. [DOI] [PubMed] [Google Scholar]
- [28].Sickles E, Miglioretti D, Ballard-Barbash R, et al. Performance benchmarks for diagnostic mammography. Radiology. 2005;235(3):775–790. doi: 10.1148/radiol.2353040738. [DOI] [PubMed] [Google Scholar]
- [29].Mandelblatt J, Cronin K, Bailey S, et al. Effects of mammography screening under different screening schedules: model estimates of potential benefits and harms. Ann Intern Med. 2009;151(10):738–747. doi: 10.1059/0003-4819-151-10-200911170-00010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].van Ravesteyn N, Miglioretti D, Stout N, et al. Tipping the balance of benefits and harms to favor screening mammography starting at age 40 years: a comparative modeling study of risk. Ann Intern Med. 2012;156(9):609–617. doi: 10.1059/0003-4819-156-9-201205010-00002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Schell M, Yankaskas B, Ballard-Barbash R, et al. Evidence-based target recall rates for screening mammography. Radiology. 2007;243(3):681–689. doi: 10.1148/radiol.2433060372. [DOI] [PubMed] [Google Scholar]
- [32].Yankaskas B, Cleveland R, Schell M, et al. Association of Recall Rates with Sensitivity and Positive Predictive Values of Screening Mammography. AJR Am J Roentgenol. 2001;177(3):543–549. doi: 10.2214/ajr.177.3.1770543. [DOI] [PubMed] [Google Scholar]
- [33].Abbey C, Eckstein M, Boone J. Estimating the relative utility of screening mammography. Medical Decision Making. 2013;33:510–520. doi: 10.1177/0272989X12470756. [DOI] [PubMed] [Google Scholar]
- [34].Abbey C, Samuelson F, Gallas B, et al. Statistical properties of a utility measure of observer performance compared to area under the ROC curve. Proc of SPIE. 2013;8673 [Google Scholar]
- [35].Obuchowski N, Beiden S, Berbaum K, et al. Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol. 2004;11(9):980–995. doi: 10.1016/j.acra.2004.04.014. [DOI] [PubMed] [Google Scholar]
- [36].Wagner R, Beam C, Beiden S. Reader variability in mammography and its implications for expected utility over the population of readers and cases. Medical Decision Making. 2004;24:561–572. doi: 10.1177/0272989X04271043. [DOI] [PubMed] [Google Scholar]


