Skip to main content
Medical Physics logoLink to Medical Physics
. 2015 Jul 31;42(8):4987–4996. doi: 10.1118/1.4927260

External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets

Matthias Benndorf 1,a), Elizabeth S Burnside 2, Christoph Herda 3, Mathias Langer 4, Elmar Kotter 4
PMCID: PMC4570488  PMID: 26233224

Abstract

Purpose:

Lesions detected at mammography are described with a highly standardized terminology: the breast imaging-reporting and data system (BI-RADS) lexicon. Up to now, no validated semantic computer assisted classification algorithm exists to interactively link combinations of morphological descriptors from the lexicon to a probabilistic risk estimate of malignancy. The authors therefore aim at the external validation of the mammographic mass diagnosis (MMassDx) algorithm. A classification algorithm like MMassDx must perform well in a variety of clinical circumstances and in datasets that were not used to generate the algorithm in order to ultimately become accepted in clinical routine.

Methods:

The MMassDx algorithm uses a naïve Bayes network and calculates post-test probabilities of malignancy based on two distinct sets of variables, (a) BI-RADS descriptors and age (“descriptor model”) and (b) BI-RADS descriptors, age, and BI-RADS assessment categories (“inclusive model”). The authors evaluate both the MMassDx (descriptor) and MMassDx (inclusive) models using two large publicly available datasets of mammographic mass lesions: the digital database for screening mammography (DDSM) dataset, which contains two subsets from the same examinations—a medio–lateral oblique (MLO) view and cranio–caudal (CC) view dataset—and the mammographic mass (MM) dataset. The DDSM contains 1220 mass lesions and the MM dataset contains 961 mass lesions. The authors evaluate discriminative performance using area under the receiver-operating-characteristic curve (AUC) and compare this to the BI-RADS assessment categories alone (i.e., the clinical performance) using the DeLong method. The authors also evaluate whether assigned probabilistic risk estimates reflect the lesions’ true risk of malignancy using calibration curves.

Results:

The authors demonstrate that the MMassDx algorithms show good discriminatory performance. AUC for the MMassDx (descriptor) model in the DDSM data is 0.876/0.895 (MLO/CC view) and AUC for the MMassDx (inclusive) model in the DDSM data is 0.891/0.900 (MLO/CC view). AUC for the MMassDx (descriptor) model in the MM data is 0.862 and AUC for the MMassDx (inclusive) model in the MM data is 0.900. In all scenarios, MMassDx performs significantly better than clinical performance, P < 0.05 each. The authors furthermore demonstrate that the MMassDx algorithm systematically underestimates the risk of malignancy in the DDSM and MM datasets, especially when low probabilities of malignancy are assigned.

Conclusions:

The authors’ results reveal that the MMassDx algorithms have good discriminatory performance but less accurate calibration when tested on two independent validation datasets. Improvement in calibration and testing in a prospective clinical population will be important steps in the pursuit of translation of these algorithms to the clinic.

Keywords: mammography, CADx, Bayesian statistics, BI-RADS

1. INTRODUCTION

Computer assisted diagnostic (CADx) algorithms for mammography interpretation have received substantial attention in the past. Generally, these algorithms can be divided into those that employ imaging features that are automatically extracted from the mammographies and those that employ descriptors that are assessed by radiologists; the latter can be referred to with the term semantic CADx. Elter and Horsch provide a comprehensive review of available mammography CADx algorithms.1

Many promising semantic CADx algorithms with good to very good diagnostic performance have been proposed; the statistical techniques employed include artificial neural networks,2,3 Bayesian networks,4–6 decision trees,7 and logistic regression.8,9 However, before it is acceptable to actually apply a semantic CADx algorithm in clinical routine, an external validation of the algorithm’s diagnostic performance is mandatory.10,11 External validation is defined as evaluation of the performance of a classification algorithm on data that were not used to generate the algorithm.11

Recently, a publicly available semantic CADx algorithm for mammographic masses (MMs) was published by Benndorf and colleagues12 and we refer to the algorithm with mammographic mass diagnosis (“MMassDx”). The MMassDx algorithm takes into account the qualitative breast imaging-reporting and data system (BI-RADS) mass lesion descriptors, patient age, and optionally BI-RADS assessment categories. The MMassDx algorithm can be accessed at www.ebm-radiology.com/nbmm/index.html; the algorithm implements a naïve Bayes network to derive diagnostic decisions. The MMassDx algorithm showed good diagnostic performance in the development study; the area under the curve was 0.935 with BI-RADS assessment categories included and area under the curve was 0.876 without assessment categories included. A validation with different cases from the same practice was performed by the authors in their study.

External validation should, to demonstrate robustness of the MMassDx algorithm, also be performed on patients from different practices.11 Second, it is worthwhile to investigate how the MMassDx algorithm performs in a variety of clinical circumstances to identify potential areas of use and potential limitations of the approach. We therefore aim at the external validation of the MMassDx algorithm proposed by Benndorf and colleagues.12 We process two large, publicly available datasets of mammographic mass lesions7,13–15 and apply the MMassDx algorithm to them. The two datasets come from different practices and are thus independent from the data used to train the original MMassDx algorithm. Confirmation of high performance on these two datasets (i.e., external validation) would be a step toward demonstrating potential clinical efficacy of the algorithm.

2. MATERIALS AND METHODS

2.A. The BI-RADS lexicon

The BI-RADS offers a standardized terminology for the description of lesions detected on x-ray mammography.16 BI-RADS was established in 1992 to facilitate an unequivocal communication between radiologists, even at different practicing sites, and referring clinicians.17 The BI-RADS lexicon has been refined over the last 30 yr, and currently, it is in its fifth edition. In 1999, the Mammography Quality Standards Act (MQSA) made it mandatory by law that every mammographic report in the United States includes the language of the BI-RADS for final assessment.18

In case of mass lesions, the following descriptors are assessed according to BI-RADS: shape (round, oval, lobulated, or irregular), margin (circumscribed, microlobulated, obscured, ill-defined, or spiculated), and density (fat-dense, low, isodense, or high). Furthermore, each lesion is assigned an assessment category ranging from 0 to 6. This assessment category can be regarded as a probabilistic stratification of the risk of malignancy of the described lesion. BI-RADS 0 denotes a lesion for which further work-up is required and BI-RADS 1 denotes a mammography without a detected lesion. BI-RADS 2 assigns a risk of malignancy of 0%, BI-RADS 3 of <2%, BI-RADS 4 of 2%–95%, and BI-RADS 5 assigns a risk of >95% (almost certainly malignant). BI-RADS 6 denotes lesions which are histopathologically proven malignant. For pictorial examples of the BI-RADS descriptors, refer to the paper by Balleyguier and colleagues.19

Up to now, no guidelines exist to link certain combinations of BI-RADS descriptors to certain BI-RADS assessment categories. This is why multivariate prediction models, like the MMassDx algorithm investigated herein, could help to quantify the risk and thus help to more accurately make further management decisions.

2.B. The classifier at ebm-radiology.com

The development and testing of the MMassDx algorithm are detailed in the paper by Benndorf and colleagues.12 The MMassDx algorithm can be accessed at www.ebm-radiology.com/nbmm/index.html. At this homepage, an interface is presented that allows the radiologist to enter a combination of observed BI-RADS mass lesion descriptors, patient age, and optionally a BI-RADS assessment category. The following notation will be used throughout the paper:

  • The algorithm that takes into account age and BI-RADS descriptors is referred to as “descriptor model.”

  • The algorithm that takes into account age, BI-RADS descriptors, and BI-RADS assessment categories is referred to as “inclusive model.”

  • The diagnostic performance of the BI-RADS assessment categories alone is referred to as “clinical performance.”

The MMassDx algorithm is based upon a naïve Bayes network. For each mammographic mass lesion, the network calculates a post-test probability of malignancy based on the values of the entered predictive variables. The post-test probability is calculated with Bayes’ theorem and assumed conditional independence of the single predictive variables. The MMassDx algorithm automatically bins the calculated post-test probability into a risk group analogously to the BI-RADS assessment categories. For this step, the method proposed by Zadrozny and Elkan is used.20 Figure 1 illustrates the binning technique used. We highlight that the calibration technique is applied to overcome a shortcoming of the naïve Bayes in assignment of the actual probabilities of malignancy.21 For detailed information about Bayesian networks and especially naïve Bayes networks, refer to Refs. 22 and 23.

FIG. 1.

FIG. 1.

Binning method for naïve Bayes classifiers by Zadrozny and Elkan (Ref. 20). For example, 15 instances are regarded with classes Y (=yes) and N (=no). Step (1): the naïve Bayes classifier calculates a probability of malignancy for each instance. Step (2): the instances are ordered according to the calculated probability, starting with the lowest probability. Step (3): three equally sized bins are formed, with 5 members each. The probability of malignancy in each bin is the number of Y-instances/all instances in this bin. New instances classified by the naïve Bayes classifier can be sorted into one of the bins, and the probability of malignancy assigned is the probability from step 3.

2.C. The external validation datasets

We first process the data from the digital database for screening mammography (DDSM) to be used as a validation dataset for the MMassDx algorithm.15 The DDSM is publicly available at http://marathon.csee.usf.edu/Mammography/Database.html. It contains information of 1220 mammographic mass lesions, described in either medio–lateral oblique (MLO) or cranio–caudal (CC) view, or described in both views. For each lesion, morphological BI-RADS descriptors (shape and margin) and patient age are given. Information regarding the DDSM can be found in Ref. 15, and we process the database as provided in Ref. 14. The data there are provided as a tab-delimited text file, the content is identical to the source homepage, and processing was performed with R 2.15.3. The dataset allows for different descriptors in the MLO and CC views. We thus generate a dataset for each view, referred to by “DDSM MLO” and “DDSM CC.”

We second process the mammographic mass dataset (referred to as “MM”) provided by Elter and colleagues7 at the University of California, Irvine, Machine Learning Repository,13 http://archive.ics.uci.edu/ml/datasets/Mammographic+Mass, which contains information of 961 mammographic mass lesions. For each lesion, morphological BI-RADS descriptors and patient age are given. Table I gives the characteristics of the two validation datasets and compares them with the dataset that was used in the original publication by Benndorf and colleagues.

TABLE I.

Comparison of the original dataset for the classifier at www.ebm-radiology.com/nbmm/index.html with the datasets for the current external validation.

Original dataset, Benndorf et al. Mammographic mass (MM) dataset, Elter et al. (Ref. 7) DDSM MLO,a Heath et al. (Ref. 15) DDSM CC,a Heath et al. (Ref. 15)
Data acquired between 2005–2011 2003–2006 1988–1999 1988–1999
n proven mass lesions 9629 961 1212 1159
Contains missing values Yes Yes Yes Yes
n mass lesions with complete descriptors 2453b (25.5%) 821 (85.4%) 1108 (91.4%) 1054 (91.4%)
n malignant (%) 313 (12.8%) 396 (48.2%) 562 (50.7%) 519 (49.2%)
Multiple descriptors allowedc Yes No Yes Yes
n with multiple descriptors for mass shape 0 0 6 7
n with multiple descriptors for mass margin 95 0 78 75
n with multiple descriptors for mass density 0 0 Not applicable, density not assessed Not applicable, density not assessed
a

In the DDSM database, separate information is given for MLO view and CC view mammographies. We process both views as separate datasets, since descriptors are allowed to differ between the two views for a given lesion.

b

For classifier construction, these data are split into training data and validation data. Pretest probability in the training data was 10.8%.

c

For cases with multiple descriptors, the descriptor most suspicious for malignancy is chosen for application of the MMassDx algorithm.

2.D. Statistical analysis

We apply the MMassDx algorithm to the DDSM MLO, DDSM CC, and MM datasets. For each lesion, the MMassDx algorithm reports (a) a post-test probability and (b) a diagnostic risk group as explained in Ref. 12. We apply both the descriptor model and the inclusive model to the datasets. For probability calculations, the pretest probability was set to 10.8%, i.e., the pretest probability in the training data in the work by Benndorf and colleagues.12

2.D.1. Analysis of classifier discrimination

We perform receiver-operating-characteristic (ROC) analysis with the calculated post-test probabilities and the obtained diagnostic risk groups. We employ the area under the ROC curve (AUC) as a measure for discriminative performance.24 We compare the AUC of the ROC curves from the MMassDx algorithm with the AUC of the BI-RADS assessment categories (i.e., the clinical performance) in the respective validation dataset using the method developed by DeLong.25 We consider a P-value <0.05 to denote a statistically significant difference.

Figure 2 provides a diagram of the analyses performed with numbers given to the consecutive steps. These numbers will be used throughout the remaining paper for referral to the scenarios investigated. For example, scenario 2c is the application of the MMassDx (inclusive) algorithm to the DDSM CC dataset. Scenario 2d stratifies the resulting post-test probabilities from 2c into the diagnostic groups as described in Ref. 12, i.e., the results are calibrated.

FIG. 2.

FIG. 2.

External validation of the MMassDx algorithm at www.ebm-radiology.com/nbmm/index.html. DDSM: digital database for screening mammography (Ref. 15), MM: mammographic mass dataset (Ref. 7). (A) MMassDx algorithm without BI-RADS assessment categories used. (B) MMassDx algorithm with BI-RADS assessment categories used. In each numbered scenario, the MMassDx algorithm (descriptor, inclusive) is applied to the respective dataset. First, post-test probabilities of malignancy are derived (e.g., scenario 2c), which are then calibrated into diagnostic groups (scenario 2d). ROC analyses are performed for all scenarios. The numbers 1a–2f will be used throughout the remaining paper.

Discriminative performance of the BI-RADS assessment categories alone, i.e., the clinical performance, also was assessed with ROC analysis. As detailed in Sec. 2.A of the paper, the BI-RADS assessment categories can be regarded as a risk stratification tool themselves. For ROC analysis, the assessment categories were considered an ordinal variable with sequence 2, 3, 0, 4, and 5.

2.D.2. Analysis of classifier calibration

The MMassDx algorithm at www.ebm-radiology.com/nbmm/index.html returns a probabilistic estimate how likely the lesion is malignant. Since the management of mammographic mass lesions depends on this risk (i.e., benign lesions will undergo routine screening follow-up; probably, benign lesions will undergo short term follow-up and suspicious lesions will undergo tissue sampling), we are also interested in how accurately the risk estimates of the classifier are compared to the actual risk of the lesions considered.

For this purpose, we employ the technique of calibration plots.10,26 For scenarios 1b, 1d, 1f, 2b, 2d, and 2f in Fig. 2, we plot on the x-axis the predicted probability of the pre-specified bin as defined in Table 3 in Ref. 12, and on the y-axis the percentage of how many lesions in the allocated bin actually proved to be malignant. In an ideally calibrated classifier, the predicted and actual risks should match. If the classifier underestimates the risk of malignancy, the resulting points will be above the identity line. If the classifier overestimates the risk, the resulting points will be below the identity line, compare Fig. 3 for an illustration of this concept.

FIG. 3.

FIG. 3.

Principle of a calibration plot: on the x-axis, the predicted probabilities for the pre-specified bins derived by classifier calibration (Ref. 20) are given. On the y-axis, the corresponding rate of outcomes among all cases that were assigned a certain bin is plotted. A perfectly calibrated classifier results in points that lie on the identity line. The areas for under- and overestimation of the probability for the outcome are highlighted in the plot.

3. RESULTS

Table II lists the results of the ROC analyses for the different scenarios as defined in Fig. 2. With the exception of scenario 1f (MM dataset, calibrated classifier), all applications of the MMassDx algorithm perform better than the clinical performance alone (P < 0.05 for each comparison). In all investigated scenarios, the MMassDx algorithm shows good discriminative performance with AUC values >0.80.

TABLE II.

Performance of the MMassDx algorithm at www.ebm-radiology.com/nbmm/index.html when applied to the DDSM MLO, DDSM CC, and MM datasets.

Model External dataset Analysis according to Fig. 1 MMassDx area under the ROC curve (AUC) P-value in comparison to clinical performance
Descriptor model Benndorf 2015a n.a. 0.876 0.799
DDSM MLO 1a 0.876b <0.001
DDSM MLO 1b 0.838b <0.01
DDSM CC 1c 0.895b <0.001
DDSM CC 1d 0.856b <0.01
MM 1e 0.862b 0.019
MM 1f 0.842 0.294
Inclusive model Benndorf 2015a n.a. 0.935b <0.001
DDSM MLO 2a 0.891b <0.001
DDSM MLO 2b 0.819b <0.01
DDSML CC 2c 0.900b <0.001
DDSM CC 2d 0.832b <0.01
MM 2e 0.900b <0.001
MM 2f 0.868b 0.001
a

Classifier performance from Ref. 12 added for comparison.

b

Denotes statistical significant difference from the performance of the BI-RADS assessment categories alone, i.e., the clinical performance. No statistically significant difference was found in the clinical performance between DDSM MLO, DDSM CC, and MM (P > 0.05).

Figure 4 exhaustively provides the diagnostic performance of the MMassDx algorithm in the datasets analyzed (ROC curves). Again, notation of the investigated scenarios follows Fig. 2. Figure 5 provides the information about classifier calibration in the scenarios where applicable. The calibration points do not follow the identity line, and the MMassDx algorithm systematically underestimates the true probability of malignancy in the diagnostic bins. Multiple points result at the 0% risk of malignancy value (x-axis) because multiple bins with an estimated risk of 0% are defined in Ref. 12. These bins proved to have a higher cancer yield in the validation datasets.

FIG. 4.

FIG. 4.

Receiver-operating-characteristic (ROC) plots for the classifier at www.ebm-radiology.com/nbmm/index.html when applied to the DDSM MLO, DDSM CC, and MM datasets. The numbers 1a–2f follow the notation given in Fig. 2. In each diagnostic scenario, the ROC curves for the descriptor model, the inclusive model, and the BI-RADS assessment category alone (clinical performance) are given. Binormal fitting is performed for plotting.

FIG. 5.

FIG. 5.

Calibration plots for the classifier at www.ebm-radiology.com/nbmm/index.html when applied to the DDSM MLO, DDSM CC, and MM datasets. The numbers follow the notation given in Fig. 2. In each plot, the identity line is given, and a perfectly calibrated CADx algorithm would result in points that lie on this line.

To further investigate calibration behavior of the MMassDx algorithm, we provide in Table III a comparison of how many lesions comprised the diagnostic bins in the study by Benndorf and colleagues12 and the number of lesions in the bins in the DDSM and MM data. Here, we observe that bins with a low probability of malignancy are only marginally populated in comparison to the development study in the DDSM datasets, and in the MM dataset, bins 3–7 are only marginally populated.

TABLE III.

Comparison of how many lesions are assigned to the diagnostic bins in the DDSM MLO, DDSM CC, and MM datasets in comparison to the development study (Ref. 12). Results are for the validated descriptor model (i.e., scenarios 1b, 1d, and 1f from Fig. 2). For comparison, the distribution in the validation data from the development study (Ref. 12) is given.

Bin DDSM MLO (scenario 1b) DDSM CC (scenario 1d) MM (scenario 1f) Benndorf (validation data)
1 0 (0%) 0 (0%) 163 (19.9%) 82 (7.0%)
2 40 (3.6%) 32 (3.0%) 106 (12.9%) 95 (8.1%)
3 23 (2.1%) 20 (1.9%) 10 (1.2%) 196 (16.6%)
4 45 (4.1%) 43 (4.1%) 25 (3.0%) 104 (8.8%)
5 79 (7.1%) 74 (7.0%) 25 (3.0%) 143 (12.1%)
6 108 (9.7%) 107 (10.2%) 45 (5.5%) 108 (9.2%)
7 62 (5.6%) 60 (5.7%) 24 (2.9%) 94 (8.0%)
8 124 (11.2%) 116 (11.0%) 99 (12.1%) 105 (8.9%)
9 219 (19.8%) 216 (20.5%) 220 (26.8%) 127 (10.8%)
10 408 (36.8%) 386 (36.6%) 104 (12.7%) 123 (10.5%)

4. DISCUSSION

Diagnostic performance of prediction algorithms has two separate dimensions: first, discrimination measures the ability of the algorithm to differentiate between lesions with different outcomes. Discriminatory performance can be visualized with ROC curves. Second, calibration measures the agreement between predicted and observed risks. Calibration performance can be visualized with calibration plots. It has been demanded that both discrimination and calibration of prediction algorithms should be examined in studies that evaluate prediction algorithms.11,27

In our study, we demonstrate that the MMassDx algorithm proposed at www.ebm-radiology.com/nbmm/index.html shows good discriminative performance when applied to two research datasets of mammographic mass lesions. The AUC values, ranging between 0.83 and 0.88 for the descriptor model and between 0.83 and 0.90 for the inclusive model, are in accordance with AUC values reported before for semantic mammographic CADx algorithms based on machine learning algorithms. The neural network approach by Baker and colleagues yielded an AUC of 0.89,3 the Bayesian network by Fischer and colleagues yielded an AUC of 0.88,5 and the tree-augmented Bayesian network by Burnside and colleagues yielded an AUC of 0.96.4 Elter and colleagues reported AUC values of 0.87 and 0.89 for their decision tree model and case-based learning algorithm, respectively.7

We second demonstrate that the MMassDx algorithm is not well calibrated when applied to the two validation datasets, especially in lesions for which a low probability of malignancy is calculated, compare with Fig. 5. Thus, even if the MMassDx algorithm performed well discriminationwise, its actual application to the datasets would not have resulted in an adequate management for a substantial proportion of patients.

The reason for the insufficient calibration has to be searched for in the nature of the datasets, we believe. When compared to the dataset, the classifier was developed with, it is obvious that the pretest probability of malignancy in the two validation datasets is much higher (13% versus 50%, compare with Table I). Furthermore, the DDSM and MM datasets do not come with detailed descriptions of how the data were collected—i.e., there is no information whether cases were consecutively observed mass lesions, consecutively histopathologically verified mass lesions, or deliberately sampled such that a case-control ratio of 1:1 obtained. The population examined with a diagnostic test may affect diagnostic (discriminatory) performance of prediction algorithms, a phenomenon which is known as spectrum effect;28 we reason that classifier calibration might likewise be affected.

If we regard the pretest probability of malignancy as a surrogate parameter for the characterization of the study population, it is worthwhile to derive an estimate of the pretest probability in the population the MMassDx algorithm is developed for. As a premise, we take that the algorithm is applied to all mass lesions observed in a given period of time at a given practice. Obviously, the prevalence of disease in the screening population [around 1% (Refs. 29 and 30)] is too low, since in the majority of patients, no abnormality is detected at all at mammography screening [i.e., rating is BI-RADS 1 (Ref. 31)]. On the other hand, taking into account only lesions with histopathological verification will overestimate the pretest probability [estimates range between 30% and 50% (Refs. 32 and 33)], since clearly benign findings will only undergo biopsy in rare occasions. A reliable estimate for the pretest probability among consecutively observed mammographic mass lesions may be derived as follows.

Zonderland and colleagues34 report on a consecutive hospital population of patients referred to mammography (35% diagnostic, 65% screening). The distribution of BI-RADS assessment categories among their study population was 1,542 BI-RADS 1, 935 BI-RADS 2, 154 BI-RADS 3, 74 BI-RADS 4, and 57 BI-RADS 5. An explanation of the BI-RADS categories is given in Subsection 2.A of this paper. BI-RADS 1 ratings do not describe a lesion per definition, and we are thus left with 1220 cases with a described lesion. Of these 1220 cases, 115 had a malignant outcome. Therefore, the pretest probability in the study by Zonderland and colleagues was, given a lesion had been observed, 115/1220 = 9.4%. This value is close to the 10.8% in the development study of the MMassDx algorithm,12 indicating that the algorithm was developed under sensible clinical premises.

We hypothesize that differences in the study populations, as indicated by the differing pretest probabilities, at least partly explain the insufficient calibration of the MMassDx algorithm. The decision to perform histopathological sampling of a lesion detected at mammography also depends on variables not considered by the MMassDx algorithm, e.g., the presence of dense fibroglandular tissue35 or a genetic risk profile to actually develop breast cancer.36 Since we do not know about case selection in the validation datasets, we cannot exclude the possibility that the presence of such risk factors influenced the case distribution in the validation datasets. A lesion that we would call “probably benign” in an overall low-risk patient might be rendered “suspicious” if we come to know that the patient has an increased lifetime risk of breast cancer >50%. This is one possible explanation of the high cancer yield in the low probability bins in the calibration curves in Fig. 4.

The visualization of classifier calibration in Fig. 4 demonstrates a positive result, though. If the MMassDx algorithm assigns a high probability of malignancy, then the lesion under consideration will likely be malignant. Underestimation of the probability here does not play a clinically significant role: a threshold of 2% of suspicion for malignancy is generally regarded the threshold at which to perform biopsy.37 The assignment of a 40% risk of malignancy thus has the same clinical consequence as the assignment of a risk of 50%, for example.

Semantic CADx algorithms in general depend on the assessment of the employed descriptors by a physician (a radiologist in our case). Although the BI-RADS lexicon offers a highly standardized diagnostic approach,16 interobserver disagreement may lead to statistical noise introduced into classification aids using these descriptors. Kappa values for the assignment of mass lesion descriptors in mammography reading range between 0.48 (Ref. 38) and 0.60 (Ref. 39) in general, although lower values have been reported as well.40 For BI-RADS assessment categories, fair to moderate agreement has been reported41 with formal training in lexicon usage improving on agreement rates. Both the DDSM and the MM dataset do not come with detailed information about training of the involved readers and an analysis of interobserver disagreement. How the performance of different readers affects semantic CADx algorithms appears as a promising direction for future research.

We have not performed a dedicated analysis of the optimal classification threshold of the Bayesian classifier taking into account expected utility of the classification,42 but regarded the entire AUC as parameter for discriminatory performance. The optimal classification threshold is known to be prevalence dependent.42,43 The definition of actual utilities (that is, costs and benefits for true positive, false negative, false positive, and true negative classifications) and the evaluation of classifier performance in different clinical scenarios (with differing pretest probability), are non-trivial tasks. Taking these issues into account also presents as possible future research topic.

To sum up, from the results of our study, we infer that the MMassDx algorithm might be of clinical use in the future, since it links combinations of BI-RADS descriptors to probabilistic risk estimates and achieves a high discriminatory performance. If the MMassDx algorithm reports a high probability of malignancy, the estimate will be clinically meaningful. However, the MMassDx algorithm shows insufficiencies regarding calibration when it reports a low probability of malignancy. We recommend (and plan) to perform the logical next step of external validation: the application of the MMassDx algorithm to a consecutive clinical cohort of mammographic mass lesions. This experiment will show whether the insufficiencies in calibration are due to spectrum effects or a true insufficiency of the algorithm. Additionally, it should be investigated whether incorporation of further risk factors as predictive variables into the MMassDx algorithm can improve on the discriminatory and calibration performance.

ACKNOWLEDGMENT

M. Benndorf received a grant from the DFG (Deutsche Forschungsgemeinschaft, BE5747/1-1) for conducting experiments in Madison, WI. E. Burnside acknowledges the support of the National Institutes of Health (grants: R01CA165229, R01LM011028).

There is no conflict of interest to declare.

REFERENCES

  • 1.Elter M. and Horsch A., “CADx of mammographic masses and clustered microcalcifications: A review,” Med. Phys. 36, 2052–2068 (2009). 10.1118/1.3121511 [DOI] [PubMed] [Google Scholar]
  • 2.Wu Y., Giger M. L., Doi K., Vyborny C. J., Schmidt R. A., and Metz C. E., “Artificial neural networks in mammography: Application to decision making in the diagnosis of breast cancer,” Radiology 187, 81–87 (1993). 10.1148/radiology.187.1.8451441 [DOI] [PubMed] [Google Scholar]
  • 3.Baker J. A., Kornguth P. J., Lo J. Y., Williford M. E., and Floyd C. E., “Breast cancer: Prediction with artificial neural network based on BI-RADS standardized lexicon,” Radiology 196, 817–822 (1995). 10.1148/radiology.196.3.7644649 [DOI] [PubMed] [Google Scholar]
  • 4.Burnside E. S., Davis J., Chhatwal J., Alagoz O., Lindstrom M. J., Geller B. M., Littenberg B., Shaffer K. A., Kahn C. E., and Page C. D., “Probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings,” Radiology 251, 663–672 (2009). 10.1148/radiol.2513081346 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Fischer E., Lo J., and Markey M., “Bayesian networks of BI-RADS descriptors for breast lesion classification,” Conf. Proc. IEEE Eng. Med. Biol. Soc. 4, 3031–3034 (2004). 10.1109/IEMBS.2004.1403858 [DOI] [PubMed] [Google Scholar]
  • 6.Kahn C. E., Roberts L. M., Shaffer K. A., and Haddawy P., “Construction of a Bayesian network for mammographic diagnosis of breast cancer,” Comput. Biol. Med. 27, 19–29 (1997). 10.1016/S0010-4825(96)00039-X [DOI] [PubMed] [Google Scholar]
  • 7.Elter M., Schulz-Wendtland R., and Wittenberg T., “The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process,” Med. Phys. 34, 4164–4172 (2007). 10.1118/1.2786864 [DOI] [PubMed] [Google Scholar]
  • 8.Timmers J., Verbeek A., IntHout J., Pijnappel R., Broeders M., and den Heeten G., “Breast cancer risk prediction model: A nomogram based on common mammographic screening findings,” Eur. Radiol. 23, 2413–2419 (2013). 10.1007/s00330-013-2836-8 [DOI] [PubMed] [Google Scholar]
  • 9.Chhatwal J., Alagoz O., Lindstrom M. J., Kahn C. E., Shaffer K. A., and Burnside E. S., “A logistic regression model based on the national mammography database format to aid breast cancer diagnosis,” Am. J. Roentgenol. 192, 1117–1127 (2009). 10.2214/AJR.07.3345 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Altman D. G., Vergouwe Y., Royston P., and Moons K. G., “Prognosis and prognostic research: Validating a prognostic model,” Br. Med. J. 338:b605 (2009). 10.1136/bmj.b605 [DOI] [PubMed] [Google Scholar]
  • 11.Collins G. S., de Groot J. A., Dutton S., Omar O., Shanyinde M., Tajar A., Voysey M., Wharton R., Yu L.-M., Moons K. G., and Altmann D., “External validation of multivariable prediction models: A systematic review of methodological conduct and reporting,” BMC Med. Res. Methodol. 14, 40 (2014). 10.1186/1471-2288-14-40 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Benndorf M., Kotter E., Langer M., Herda C., Wu Y., and Burnside E., “Development of an online, publicly accessible naive Bayesian decision support tool for mammographic mass lesions based on the American College of Radiology (ACR) BI-RADS lexicon,” Eur. Radiol. 25, 1768–1775 (2015). 10.1007/s00330-014-3570-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bache K. and Lichman M., UCI Machine Learning Repository (University of California, School of Information and Computer Science, Irvine, CA, 2013), http://archive.ics.uci.edu/ml. [Google Scholar]
  • 14.Benndorf M., Herda C., Langer M., and Kotter E., “Provision of the DDSM mammography metadata in an accessible format,” Med. Phys. 41, 051902 (3pp.) (2014). 10.1118/1.4870379 [DOI] [PubMed] [Google Scholar]
  • 15.Heath M., Bowyer K., Kopans D., Moore R., and Kegelmeyer P., “The digital database for screening mammography,” in Proceedings of the 5th International Workshop on Digital Mammography, edited by Yaffe M. J. (Medical Physics Publishing, 2001), pp. 212–218. [Google Scholar]
  • 16.Sickles E. A. et al. , ACR BI-RADS® Mammography, in ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System (American College of Radiology, Reston, VA, 2013). [Google Scholar]
  • 17.Burnside E. S., Sickles E. A., Bassett L. W., Rubin D. L., Lee C. H., Ikeda D. M., Mendelson E. B., Wilcox P. A., Butler P. F., and D’Orsi C. J., “The ACR BI-RADS experience: Learning from history,” J. Am. Coll. Radiol. 6, 851–860 (2009). 10.1016/j.jacr.2009.07.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. “State certification of mammography facilities,” Fed. Regist. 67, 5446–5469 (2002), see http://www.gpo.gov/fdsys/granule/FR-2002-02-06/02-2750. [PubMed] [Google Scholar]
  • 19.Balleyguier C., Bidault F., Mathieu M. C., Ayadi S., Couanet D., and Sigal R., “BIRADS™ mammography: Exercises,” Eur. J. Radiol. 61, 195–201 (2007). 10.1016/j.ejrad.2006.08.034 [DOI] [PubMed] [Google Scholar]
  • 20.Zadrozny B. and Elkan C., “Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,” in International Conference on Machine Learning (ACM, New York, NY, 2001), Vol. 1, pp. 609–616. [Google Scholar]
  • 21.Domingos P. and Pazzani M., “Beyond independence: Conditions for the optimality of the simple Bayesian classifier,” in Proceedings of the 13th International Conference on Machine Learning (Morgan Kaufmann, Burlington, MA, 1996), pp. 105–112. [Google Scholar]
  • 22.Hand D. J. and Yu K., “Idiot’s Bayes-not so stupid after all?,” Int. Stat. Rev. 69, 385–398 (2001). 10.1111/j.1751-5823.2001.tb00465.x [DOI] [Google Scholar]
  • 23.Burnside E. S., “Bayesian networks: Computer-assisted diagnosis support in radiology,” Acad. Radiol. 12, 422–430 (2005). 10.1016/j.acra.2004.11.030 [DOI] [PubMed] [Google Scholar]
  • 24.Zweig M. H. and Campbell G., “Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine,” Clin. Chem. 39, 561–577 (1993). [PubMed] [Google Scholar]
  • 25.DeLong E. R., Delong D. M., and Clarke-Pearson D. L., “Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach,” Biometrics 44, 837–845 (1988). 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
  • 26.Steyerberg E. W., Vickers A. J., Cook N. R., Gerds T., Gonen M., Obuchowski N., Pencina M. J., and Kattan M. W., “Assessing the performance of prediction models: A framework for some traditional and novel measures,” Epidemiology (Cambridge, MA) 21, 128–138 (2009). 10.1097/EDE.0b013e3181c30fb2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Justice A. C., Covinsky K. E., and Berlin J. A., “Assessing the generalizability of prognostic information,” Ann. Intern. Med. 130, 515–524 (1999). 10.7326/0003-4819-130-6-199903160-00016 [DOI] [PubMed] [Google Scholar]
  • 28.Ransohoff D. and Feinstein A., “Problems of spectrum and bias in evaluating the efficacy of diagnostic tests,” N. Engl. J. Med. 299, 926–930 (1978). 10.1056/NEJM197810262991705 [DOI] [PubMed] [Google Scholar]
  • 29.Fenton J. J., Taplin S. H., Carney P. A., Abraham L., Sickles E. A., D’Orsi C., Berns E. A., Cutter G., Hendrick R. E., and Barlow W. E., “Influence of computer-aided detection on performance of screening mammography,” N. Engl. J. Med. 356, 1399–1409 (2007). 10.1056/NEJMoa066099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Gilbert F. J., Astley S. M., Gillan M. G., Agbaje O. F., Wallis M. G., James J., Boggis C. R., and Duffy S. W., “Single reading with computer-aided detection for screening mammography,” N. Engl. J. Med. 359, 1675–1684 (2008). 10.1056/NEJMoa0803545 [DOI] [PubMed] [Google Scholar]
  • 31.Rosenberg R. D., Yankaskas B. C., Abraham L. A., Sickles E. A., Lehman C. D., Geller B. M., Carney P. A., Kerlikowske K., Buist D. S., and Weaver D. L., “Performance benchmarks for screening mammography,” Radiology 241, 55–66 (2006). 10.1148/radiol.2411051504 [DOI] [PubMed] [Google Scholar]
  • 32.Liberman L., Abramson A., Squires F., Glassman J., Morris E., and Dershaw D., “The breast imaging reporting and data system: Positive predictive value of mammographic features and final assessment categories,” Am. J. Roentgenol. 171, 35–40 (1998). 10.2214/ajr.171.1.9648759 [DOI] [PubMed] [Google Scholar]
  • 33.Orel S. G., Kay N., Reynolds C., and Sullivan D. C., “BI-RADS categorization as a predictor of malignancy,” Radiology 211, 845–850 (1999). 10.1148/radiology.211.3.r99jn31845 [DOI] [PubMed] [Google Scholar]
  • 34.Zonderland H. M., T. L. Pope, Jr., and Nieborg A. J., “The positive predictive value of the breast imaging reporting and data system (BI-RADS) as a method of quality assessment in breast imaging in a hospital population,” Eur. Radiol. 14, 1743–1750 (2004). 10.1007/s00330-004-2373-6 [DOI] [PubMed] [Google Scholar]
  • 35.McCormack V. A. and dos Santos Silva I., “Breast density and parenchymal patterns as markers of breast cancer risk: A meta-analysis,” Cancer Epidemiol., Biomarkers Prev. 15, 1159–1169 (2006). 10.1158/1055-9965.EPI-06-0034 [DOI] [PubMed] [Google Scholar]
  • 36.Antoniou A., Pharoah P., Narod S., Risch H. A., Eyfjord J. E., Hopper J., Loman N., Olsson H. K., Johannsson O., and Borg Ã. K., “Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: A combined analysis of 22 studies,” Am. J. Hum. Genet. 72, 1117–1130 (2003). 10.1086/375033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Burnside E. S., Chhatwal J., and Alagoz O., “What is the optimal threshold at which to recommend breast biopsy?,” PLoS One 7, e48820 (2012). 10.1371/journal.pone.0048820 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lazarus E., Mainiero M. B., Schepps B., Koelliker S. L., and Livingston L. S., “BI-RADS lexicon for US and mammography: Interobserver variability and positive predictive value,” Radiology 239, 385–391 (2006). 10.1148/radiol.2392042127 [DOI] [PubMed] [Google Scholar]
  • 39.Baker J. A., Kornguth P. J., and C. Floyd, Jr., “Breast imaging reporting and data system standardized mammography lexicon: Observer variability in lesion description,” Am. J. Roentgenol. 166, 773–778 (1996). 10.2214/ajr.166.4.8610547 [DOI] [PubMed] [Google Scholar]
  • 40.Berg W. A., Campassi C., Langenberg P., and Sexton M. J., “Breast imaging reporting and data system: Inter- and intraobserver variability in feature analysis and final assessment,” Am. J. Roentgenol. 174, 1769–1777 (2000). 10.2214/ajr.174.6.1741769 [DOI] [PubMed] [Google Scholar]
  • 41.Berg W. A., D’Orsi C. J., Jackson V. P., Bassett L. W., Beam C. A., Lewis R. S., and Crewson P. E., “Does training in the breast imaging reporting and data system (BI-RADS) improve biopsy recommendations or feature analysis agreement with experienced breast imagers at mammography?,” Radiology 224, 871–880 (2002). 10.1148/radiol.2243011626 [DOI] [PubMed] [Google Scholar]
  • 42.Metz C. E., “Basic principles of ROC analysis,” Semin. Nucl. Med. 8, 283–298 (1978). 10.1016/S0001-2998(78)80014-2 [DOI] [PubMed] [Google Scholar]
  • 43.Horsch K., Giger M. L., and Metz C. E., “Prevalence scaling: Applications to an intelligent workstation for the diagnosis of breast cancer,” Acad. Radiol. 15, 1446–1457 (2008). 10.1016/j.acra.2008.04.022 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Medical Physics are provided here courtesy of American Association of Physicists in Medicine

RESOURCES