Abstract
Purpose:
To compare machine learning methods for classifying mass lesions on mammography images that use predefined image features computed over lesion segmentations to those that leverage segmentation-free representation learning on a standard, public evaluation dataset.
Methods:
We apply several classification algorithms to the public Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM), in which each image contains a mass lesion. Segmentation-free representation learning techniques for classifying lesions as benign or malignant include both a Bag-of-Visual-Words (BoVW) method and a Convolutional Neural Network (CNN). We compare classification performance of these techniques to that obtained using two different segmentation-dependent approaches from the literature that rely on specific combinations of end classifiers (e.g. linear discriminant analysis, neural networks) and predefined features computed over the lesion segmentation (e.g. spiculation measure, morphological characteristics, intensity metrics).
Results:
We report area under the receiver operating characteristic curve (AZ) values for malignancy classification on CBIS-DDSM for each technique. We find average AZ values of 0.73 for a segmentation-free BoVW method, 0.86 for a segmentation-free CNN method, 0.75 for a segmentation-dependent linear discriminant analysis of Rubber-Band Straightening Transform features, and 0.58 for a hybrid rule-based neural network classification using a small number of hand-designed features.
Conclusions:
We find that malignancy classification performance on the CBIS-DDSM dataset using segmentation-free BoVW features is comparable to that of the best segmentation-dependent methods we study, but also observe that a common segmentation-free CNN model substantially and significantly outperforms each of these (p<0.05). These results reinforce recent findings suggesting that representation learning techniques such as BoVW and CNNs are advantageous for mammogram analysis because they do not require lesion segmentation, the quality and specific characteristics of which can vary substantially across datasets. We further observe that segmentation-dependent methods achieve performance levels on CBIS-DDSM inferior to those achieved on the original evaluation datasets reported in the literature. Each of these findings reinforces the need for standardization of datasets, segmentation techniques, and model implementations in performance assessments of automated classifiers for medical imaging.
Keywords: Mammography, Computer Assisted Diagnosis, Deep Learning, Segmentation
Graphical Abstract

INTRODUCTION
Breast cancer is the most deadly cancer for women in developing countries and the second most deadly cancer for those in developed nations.1 Mammograms are an essential component in early detection of breast cancer, and interpretation sensitivity greatly affects patient survival rates.2 On the other hand, imperfect specificity of breast lesion diagnosis by mammography causes physical and psychological discomfort to false positive patients who are subjected to further, possibly invasive tests.3,4 As a result, a variety of Computer Aided Diagnosis (CADx) systems designed to provide quantitative, objective mass classification have been developed.5
Existing CADx techniques for mass classification generally fall into two categories: segmentation-dependent methods, which require a detailed outline of the lesion upon which to compute features for classification, and segmentation-free methods, which do not. Segmentation-dependent methods tend to rely on predefined sets of features, while segmentation-free approaches leverage representation learning techniques to learn explanatory features directly from the available data. Segmentation-free approaches could provide substantial value in clinical practice by obviating the need for detailed segmentations, but have only recently been able to achieve results similar to those of segmentation-dependent methods.6–9 Though recent studies of segmentation-free approaches have shown promising results, CADx systems for mass classification have rarely been evaluated on the same datasets, making the type of comparative performance analysis that should precede clinical deployment difficult to perform.8 Major limitations to the evaluation of different mammography CADx systems on common datasets have ranged from insufficient descriptions of model implementation in the original literature to challenges with data availability and provenance.
In this work, we leverage the recent Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) to compare segmentation-free and segmentation-dependent approaches to automated mass classification on a standard, public dataset. To perform this comparison, we implement four different techniques from the literature: a segmentation-free Bag-of-Visual-Words (BoVW) mass classification algorithm inspired by traditional computer vision,10 a segmentation-free Convolutional Neural Network (CNN) trained using commodity deep learning software,11 a segmentation-dependent algorithm based on the Rubber-Band Straightening Transform (RBST) of Sahiner et al.,6 and the segmentation-dependent approach of Huo et al. that combines predefined features with an artificial neural network.7 Evaluating the performance of each of these techniques on the CBIS-DDSM test set after training and tuning was performed using the standard CBIS-DDSM training set yields several useful conclusions. First, we observe that malignancy classification performance on the CBIS-DDSM dataset obtained using the segmentation-free BoVW method is comparable to that of the best segmentation-dependent methods, but also find that the segmentation-free CNN model substantially and significantly (p<0.05) outperforms each of these by 11 points of area under the receiver operating characteristic curve (AZ). These results reinforce recent findings suggesting that representation learning techniques such as BoVW and CNN can obviate the need for precise or method-specific lesion segmentation while maintaining high levels of performance. Second, we find that our re-implementations of existing segmentation-free methods yield performance levels on CBIS-DDSM inferior to those achieved on the original evaluation datasets reported in the literature. We propose that these discrepancies result from some combination of differences in the segmentation techniques used, parameter tuning on small datasets in the original work, and implementation choices.
BACKGROUND AND RELATED WORK
Due to the importance of characteristics of the lesion margin in differentiating benign and malignant tumors, many existing CADx methods have been based on obtaining mathematical descriptions of the tumor outline.7,12–20 Such segmentation-dependent techniques require accurate segmentation of the lesion margin in order to extract image features. Methods that require hand-drawn segmentation of lesions, a process not usually performed in clinical practice, can make resultant CADx systems inefficient for clinical use. CADx systems utilizing automated lesion segmentation have, however, been studied with promising results. For example, Mudigonda et al. obtained an AZ of 0.85 for binary classification of breast masses using hand-drawn Regions-of-Interest (ROIs) as a basis for their automated segmentation method.13 Likewise, Sahiner et al.14 and Huo et al.7 developed CADx systems using segmentation methods requiring only a general bounded region identified by the radiologist and achieved AZ results of 0.91 and 0.94, respectively. An extensive analysis of existing methods for automated mass segmentation methods that support segmentation-dependent CADx can be found in the review of Oliver et al.21
More recent segmentation-free CADx approaches have attempted to attain high levels of diagnostic performance without any lesion segmentation requirements. For instance, multiple workers such as Jamieson et al.22 (AZ = 0.71), Liu and Jiang23 (AZ = 0.72), and Li et al.24 (AZ = 0.796) have demonstrated the promise of segmentation-free algorithms, most of which have been based on various types of learned features. A particularly large body of work has arisen that applies deep learning algorithms such as Convolutional Neural Networks (CNNs) to automated mass analysis. The recent review of deep learning techniques in mammography by Hamidinekoo et al. found that nine major mass classification studies performed between 1996 and 2017 using deep learning approaches yielded AZ values between 0.71 and 0.97.9 However, because each of these studies used different datasets, several of which are not public, it is difficult to fairly compare the performance of these different algorithms.
The focus of this work is on analyzing the performance of multiple different techniques on the relatively recent CBIS-DDSM dataset. We leverage this dataset because it (a) is provided with standard train/validation splits (b) contains cases verified by pathology (c) contains images curated by expert mammographers from multiple institutions and (d) contains a large amount of data in addition to the raw images – e.g. lesion segmentation, Breast Imaging Reporting And Data System (BI-RADS) descriptors, BI-RADS abnormality ratings, and severity ratings – that make it amenable to analysis via a wide variety of machine learning approaches. Note that BI-RADS describes both a classification system and lexicon for reporting breast imaging results that includes both descriptors of imaging features that were shown to correlate with high predictive values associated with either benign or malignant disease, and a classification system to describe the likelihood that the imaging findings represent malignancy.25
For the sake of completeness, we provide additional background on recent work in mass classification and the datasets on which these studies were evaluated here. Wang et al. extracted five manually-designed image features and applied logistic regression for classification, achieving AZ of 0.806±0.025 on an internal dataset.26 Zhu et al. developed a deep multi-instance network and achieved AZ of 0.859 ± 0.03 on the public INbreast dataset.27 Arevalo et al. used a combination of CNN-extracted features and hand-crafted features and obtained AZ of 0.826 on the public BCDR-FM dataset.28 Kim et al. applied ResNet to a large-scale mammography dataset from 5 institutions, and achieved AZ of 0.906.29 Ribli et al. applied Faster R-CNN to the INbreast database, which achieved AZ of 0.95.30 Al-masni et al. applied a YOLO model for this task and achieved 5-fold cross-validation AZ of 0.965 on a subset of the DDSM dataset.31 Lotter et al. developed a curriculum training method with a multi-scale CNN and obtained 0.901 ± 0.031 on a subset of DDSM; these authors also note “as full image mammogram classification lacks standardized evaluation framework, it is somewhat difficult to directly compare our results to other work.”32 In other words, because most studies are evaluated on different, incompletely reported, or private datasets, it is difficult even for researchers in the field to understand how the performance of their methods compares to that reported by other work. Ting et al. used hierarchical features from a CNN and achieved AUC 0.92 ± 0.02 on a subset of DDSM.33 Chougrad et al. applied various CNNs including VGG, ResNet-50 and Inception networks, and achieved AZ of 0.99 on MIAS dataset and 0.98 on a subset of the DDSM dataset.34
Many studies have also used the CBIS-DDSM dataset we leverage here, albeit in slightly different ways. Ragab et al.35 and Li et al.36 applied convolutional neural networks on segmented ROIs, and achieved AZ of 0.94 and 0.85 on CBIS-DDSM respectively. Tsochatzidis et al. examined multiple popular CNNs, and achieved AZ of 0.859 and 0.804 on DDSM-400 and CBIS-DDSM respectively.37 Chougrad et al. proposed a multi-label classification setting and fine-tuned a pre-trained CNN, achieving mean AZ of 0.89 ± 0.08 for 5-fold cross-validation on CBIS-DDSM.38 Chen et al.39 and Falconi et al.40 fine-tuned CNNs and achieved AZ of 0.86 and 0.844 on CBIS-DDSM respectively. Alkhaleefah et al. fine-tuned VGG-19 and applied data augmentation techniques on their own data splits of CBIS-DDSM, and achieved AZ of 0.961.41 Shu et al. proposed different pooling structures for CNNs and obtained AZ of 0.838±0.0001 on CBIS-DDSM.42 Samala et al. aimed to assess the generalization errors of CNNs, and found AZ of 0.83±0.03 on internal and CBIS-DDSM combined data.43 Gossmann et al. investigated the performance deterioration of deep neural networks for lesion classification due to distribution shift, and achieved AZ of 0.833 on CBIS-DDSM.44 Beltran-Perez et al. developed a three-step pipeline to extract image features using a multiscale generalized radial basis function and the discrete cosine transform, and achieved 93.99% accuracy on CBIS-DDSM.45 Ansar et al. applied transfer learning of MobileNet and obtained 74.5% accuracy on CBIS-DDSM.46 De Vriendt et al. proposed a all-in-one graph-based deep semi-supervised learning framework, and obtained AZ of 0.811 with only 40% labeled data on CBIS-DDSM.47
A small number of existing studies compare multiple computer assisted detection or diagnosis techniques on the same datasets. The work of Oliver et al. implements multiple approaches for mass detection techniques on a single dataset, but does not address mass classification.21 The recent work of Kooi et al. compares a CNN-based approach to a single reference system based on extracted features, but does so on a non-public dataset.8 They find that the segmentation-free CNN approach (AZ = 0.93) slightly outperforms their segmentation-dependent, feature-based method (AZ = 0.91).
MATERIALS AND METHODS
In order to compare multiple segmentation-free representation learning approaches to several segmentation-dependent predefined feature methods using a standard, public dataset, we implemented four high-performing methods from the literature that are both trained and evaluated using standard splits from the CBIS-DDSM mass classification dataset.48 Segmentation-free techniques analyzed include both a BoVW approach adapted from traditional computer vision as well as a standard CNN-based approach from the field of deep learning.10,11 Segmentation-dependent techniques analyzed include two high-performing approaches from the literature: linear discriminant analysis of features computed on the lesion margin as described by Sahiner et al.6 and hybrid rule-based neural network classification of a small number of salient features as described by Huo et al.7 We obtained existing code for each technique where possible, and re-implemented the remainder as faithfully as possible following descriptions in the literature; specific implementation decisions are described in detail below. We note that it was not possible to obtain code for the exact methods for lesion segmentation implemented in previous studies, and we therefore use the segmentations provided as part of the public CBIS-DDSM dataset. The technique used by Lee et al.49 to compute these segmentations was based on the Chan-Vese local level set framework,50 where the coarse annotation from the original DDSM dataset was used for initialization. The process of computing and validating these segmentations is described in detail by Lee et al.49 We describe each method implemented for this study in detail below; implementations are built using MATLAB (v. R2011) and Python 3.6 unless otherwise noted. Note that each method uses only the mammogram image, and no other descriptors included in the CBIS-DDSM dataset.
Segmentation-Free Representation Learning Methods
Each of the segmentation-free methods described below relies on classification of a set of features learned directly from training data, and does not require lesion segmentation.
Bag-of-Visual Words
Figure 1 outlines the procedure for the BoVW method. Each image was first preprocessed using the method developed by Chan et al., which involves filtering a bounded portion of the image around the ROI in order to smooth out structures in the background tissue that may obscure the mass margin.14,51 Figure 2 shows an example of an ROI before and after this preprocessing. We then utilize the Scale-Invariant Feature Transform (SIFT) to compute the primitive feature set on which our BoVW method is based using a bounding box around the entire ROI; all SIFT features were computed using the VLFeat open source library in MATLAB (v. R2011).52 Subsequent image classification is based on a feature histogram created by counting the number of image patches assigned to each individual visual word by a consensus clustering approach described in detail in the Supplementary Material. Finally, we train an L1-regularized logistic regression classifier, also known as Least Absolute Shrinkage and Selection Operator (LASSO), utilizing the glmnet (v. 2.0–16) software package.53 Hyperparameters, such as the number of clusterings upon which to evaluate consensus clustering and the regularization parameter for LASSO, were tuned using 20% of the training data as a held-out validation set.
Figure 1.
Flowchart of a Bag-of-Visual-Words (BoVW) method, which relies on sequential steps including image processing, feature extraction, formation of visual words dictionary, and ultimate classification.
Words (BoVW) classifier.
Figure 2.
Example of a Region-of-Interest (ROI) before (a) and after (b) preprocessing for the Bag-of-Visual-Words (BoVW) classifier.
Convolutional Neural Network
We evaluate the performance of a standard Convolutional Neural Network (CNN) architecture implemented using the Keras (v. 2.2.0) Python software package with a Tensorflow backend.54–56 Due to its superior performance on a variety of image classification tasks both inside and outside of medicine, we utilize a 121-layer Densely Connected CNN (DenseNet-121) of Huang et al. for this comparison.11,57,58 The DenseNet-121 model was trained on a random 90% sample of the DDSM-CBIS training set and validated on the remaining 10% for 100 epochs using a single Tesla P100 GPU with a batch size of 32 images, a dropout rate of 0.2, a learning rate of 0.001, and the Adam optimizer. Hyperparameter values were determined using coarse grid search in the vicinity of default parameters. The network was initialized using weights from a model pre-trained on the ImageNet dataset, all parameters were assumed learnable, and the learning rate was decreased by a factor of when validation accuracy had not increased for more than 10 epochs.57 All images were mean-standard-deviation-normalized, cropped to 750 × 750 pixels around the segmentation centroid, and further downsampled to 224 × 224 pixels before entering into the CNN. Data augmentation was achieved via random application of mild zooms (in the range of [0.8, 1.2]), horizontal flips, and random rotations. While such augmentations are important in most state-of-the art image classification results, note that due to the physical particulars of mammography, augmentations such as contrast and brightness enhancements were specifically not applied.59
Segmentation-Dependent Predefined Feature Methods
Each of the segmentation-dependent methods described below is drawn from the CADx literature, and relies on classification of a set of features defined a priori over a provided lesion segmentation.
Linear Discriminant Analysis of Rubber-Band Straightening Transform Features (LDA-RBST)
The CADx system of Sahiner et al. relies on linear discriminant analysis performed on a variety of predefined features, including those derived using the Rubber-Band Straightening Transform (RBST) for which Sahiner has generously provided the code.6 The RBST converts the margin of the image into a straight line as shown in Figure 3, and is accomplished by determining the normal direction to each ROI edge pixel and taking 40 pixels along that direction to compose each line of the RBST. Several texture features were computed from the RBST image, including features from the gray-level co-occurrence matrix (GLCM) at ten pixel differences (d = 1, 2, 3, 4, 6, 8, 10, 12, 16, 20) and four directions (θ = 0°, 45°, 90°, 135°) as well as run-length statistics (RLS) features in four directions ( θ = 0°, 45°, 90°, 135°). Additionally, various morphological features were obtained from the original ROI image. Each set of features is listed in Table 1.
Figure 3.
Example Region-of-Interest (ROI) and resulting Rubber-Band Straightening Transform (RBST) image.
Table 1.
List of features employed by method of Sahiner et al.
| Morphological | GLCM | RLS |
|---|---|---|
| Fourier descriptor | Difference average | Long run emphasis |
| Convexity | Difference entropy | Run percentage |
| Rectangularity | Inverse difference moment | Gray level non-uniformity |
| Perimeter | Difference variance | Run length non-uniformity |
| NRL mean | Inertia | Short run emphasis |
| Contrast | Correlation | |
| NRL entropy | Inf. Measure of correlation 1 | |
| Circularity | Inf. Measure of correlation 2 | |
| NRL area ratio | Energy | |
| NRL standard deviation | Entropy | |
| NRL zero-crossing count | Sum variance | |
| Perimeter-to-area ratio | Sum entropy | |
| Area | Sum average |
Neural Network Classification of Hand-Designed Features (NN-HDF)
Finally, we re-implemented the feature extraction method from Huo et al. that supports a hybrid rule-based neural network classifier.7 Five features were included in this method: spiculation measure, sharpness, average gray level, contrast, and texture. The spiculation measure was the key feature for the original work of Huo et al.7 It is found using radial edge gradient analysis over the four different neighborhoods shown in Figure 4, where Sobel filters were used to obtain the gradient magnitude and orientation of the ROI.60 The orientation was then normalized based on the radial direction, and a gradient-magnitude-weighted histogram of the normalized orientation was found. The spiculation measure is the Full-Width at Half Maximum (FWHM) of this histogram. As Huo et al. do not describe the exact method for determining the FWHM required for their method, we utilize the following straightforward computation. We first smooth the histogram with an averaging filter of length two. We then find the first and last bins at which the histogram exceeded the half maximum. Finally, we use linear interpolation to determine the exact bin position and convert to degrees. We use 24 bins, allowing for a bin size of 15° as in the original paper. We note here that Huo et al. report that only an approximately correct outline of the mass lesion was required for the purposes of this analysis.
Figure 4.
Neighborhoods used in NN-HDF method (excluded areas in black). Panel (a) represents the segmented mass, panel (b) represents the mass margin, panel (c) represents the mass plus the surrounding periphery, and panel (d) represents the surrounding periphery.
Evaluation
Each method was trained (where appropriate) and tested using the CBIS-DDSM data set with the provided train and test splits.48 The data set includes 691 training cases (355 benign, 336 malignant) and 200 test cases (117 benign, 83 malignant). We assess the performance of each method by analyzing AZ values and associated 95% confidence intervals.61
Statistical Techniques
Model performance was assessed using AZ on a held-out test set, computed using either the scikit-learn (v 0.19) Python library or MATLAB (v. R2011). A statistical test of non-inferiority implemented in the rocNIT (v. 1.0) R library was used to compare different classifiers characterized by similar performance levels. The method of Hanley and McNeil was used to compute 95% confidence intervals on AZ values, and p-values less than 0.05 were considered statistically significant throughout the analysis.62 Statistical computations performed by J.A. D., A.H., and R.S.L..
RESULTS
Table 2 contains the results for each method described above on CBIS-DDSM. Most noticeably, the CNN substantially and significantly outperforms all other approaches with an AZ of 0.86 [0.83, 0.89] (p<0.05); note that these results are on par with the best segmentation-free results of which we are aware on the standard CBIS-DDSM dataset. Amongst the remaining methods, the LDA-RBST technique of Sahiner et al. yielded the best classification performance results with an AZ of 0.75 [0.69, 0.81], followed closely by the BoVW method with an AZ of 0.73 [0.66, 0.79]. A non-inferiority test performed on the results from these two techniques results in a p-value of 0.01 for a δ of 0.15 and an α of 0.05, demonstrating significant non-inferiority of the BoVW method with respect to that LDA-RBST.63 The NN-HDF method performed significantly worse than all other methods on CBIS-DDSM, with an AZ value of 0.58 [0.51, 0.65] (p<0.05). We present additional results specific to each technique in detail below.
Table 2.
Results for different classification methods on CBIS-DDSM dataset.
| Method | AZ [95% Confidence Interval] |
|---|---|
| BoVW | 0.73 [0.66, 0.79] |
| CNN | 0.86 [0.83, 0.89] |
| LDA-RBST | 0.75 [0.69, 0.81] |
| NN-HDF | 0.58 [0.51, 0.65] |
Segmentation-Free Representation Learning Methods
Bag-of-Visual-Words
As described in the Supplementary Material, the BoVW method requires parameter optimization with respect to the number of clusterings used in consensus clustering and the regularization parameter in LASSO. Based on the empirical findings shown in Figure 5, we chose 10 clusterings and a regularization parameter value of 0.014. With these optimized parameters, we achieved AZ of 0.73 using the BoVW method.
Figure 5.
Results per parameter using Bag-of-Visual-Words (BoVW) method: (a) area under the receiver operating characteristic curve (AZ) versus number of clusterings using Scale Invariant Feature Transform (SIFT) features, (b) mean-squared error versus logarithm of regularization parameter λ.
Convolutional Neural Network
The fundamental difference between the DenseNet-121 deep learning approach and other methods in this paper is the fact that deep neural networks are able to learn their own feature maps in a manner that best explains the data available. Thus, the trained neural network is itself a feature extractor, and the combination of a fully connected linear layer and a softmax operator is responsible for classification. Performing the training procedure described above with ten different random seeds yielded best-case test set AZ of 0.88 [0.85, 0.90], worst-case test set AZ of 0.85 [0.82, 0.89], and median AZ of 0.86 [0.83 0.89], which represents the best performance of any method described in this manuscript.
In order to ensure that this classification performance is not caused by anomalies within the data or training process, we compute class activation maps (CAMs) to assess whether the CNN classifications are leveraging appropriate spatial regions of the image.64 Pertinent visualizations can be found in Figure 6, where we observe that correct classifications result when the network activations are directly over the lesion, while errors in both directions occur when weights for the correct class are high in spatial areas outside of the mass itself.
Figure 6.
Images and class activation maps (CAMs) from the convolutional neural network (CNN) model for (a) true positive, (b) false positive, (c) false negative, and (d) true negative examples. CAMs presented depict areas most responsible for the correct classification in red and those least responsible in blue– i.e. for examples (a) and (c), weights are those for the malignant class while in examples (b) and (d) these weights are for the benign class.
Segmentation-Dependent Predefined Feature Methods
Linear Discriminant Analysis of Rubber-Band Straightening Transform Features
Our analysis is similar to that of Sahiner et al.,6 where we choose the top ten features from morphological and texture features separately as well as together for use in linear discriminant analysis. Feature selection was accomplished using Wilks’ lambda. The resulting AZ from each of these scenarios was 0.62, 0.70, and 0.75, respectively. Table 3 lists the top ten features found for each category. Note that Sahiner et al.6 specifically mention that they expect the selected features to change based on the particular training dataset used, so our procedure represents an intended implementation of this technique.
Table 3.
List of top features from method of Sahiner et al.
| Top 10 Morphological | Top 10 Texture | Top 10 Combined |
|---|---|---|
| Perimeter | RLS Long Run Emphasis of RBST, Offset: 45° | Perimeter |
| Contrast | GLCM Entropy of RBST, Offset: −2, 2, Distance: 2 | Contrast |
| Mean of RDSa) | GLCM Entropy of RBST, Offset: −3, −3, Distance: 3 | GLCM Difference Entropy of RBST, Offset: −16, −16, Distance: 16 |
| Area Ratio of RDS a) | GLCM Difference Variance of RBST, Offset: 0, 12, Distance: 12 | GLCM Difference Average of RBST, Offset: −12, 12, Distance: 12 |
| Perimeter to Area Ratio | GLCM Difference Variance of RBST, Offset: −12,−12, Distance: 12 | GLCM Difference Variance of RBST, Offset: −4, 0, Distance: 4 |
| Convexity | GLCM Difference Variance of RBST, Offset: −6, −6, Distance: 6 | GLCM Difference Variance of RBST, Offset: 0, 8, Distance: 8 |
| Area | GLCM Entropy of RBST, Offset: −8, 0, Distance: 8 | GLCM Difference Entropy of RBST, Offset: 0, 16, Distance: 16 |
| Entropy of RDS a) | GLCM Energy of RBST, Offset: −16, −16, Distance: 16 | GLCM Entropy of RBST, Offset: −2, 0, Distance: 2 |
| Standard Deviation of RDS a) | GLCM Difference Average of RBST, Offset: −2, 2, Distance: 2 | GLCM Entropy of RBST, Offset: 0, 4, Distance: 4 |
| Zero-crossing Count of RDS a) | GLCM Inverse Difference Moment of RBST, Offset: −6, −6, Distance: 6 | GLCM Sum Entropy of RBST, Offset: 0, 1, Distance: 1 |
Radial Distance Signal
Neural Network Classification of Hand-Designed Features
Huo et al. developed a hybrid rules-based neural network classifier in their work.7 The rule pertains to the spiculation measure, automatically concluding that any mass with a spiculation measure higher than 160° was malignant, and using the rest of the features as input to a neural network to determine the malignancy of the lesions with lower spiculation measure. The results of our re-implementation of this method evaluated on CBIS-DDSM are shown per feature in Table 4 along with the results reported by Huo et al. on their original dataset. The resulting AZ for the hybrid classifier using the 160° threshold was 0.51. With a threshold of 320° optimized for our dataset, the AZ was 0.58.
Table 4.
Per-feature Results of NN-HDF Method on CBIS-DDSM and Literature Datasets
| Feature | Literature AZ | CBIS-DDSM Az |
|---|---|---|
| Spiculation Measure | 0.88 | 0.53 |
| Sharpness | 0.53 | 0.53 |
| Average Gray Level | 0.65 | 0.52 |
| Contrast | 0.59 | 0.52 |
| Texture Measure | 0.54 | 0.51 |
DISCUSSION
Our analysis supports several distinct conclusions. First, we find that the two segmentation-free methods are able to classify benign vs. malignant masses as well as or better than segmentation-based methods that use predefined features. In particular, while BoVW performs similarly to the best predefined feature method, the deep learning method (Az = 0.86) improves Az by 11 points over the best competing segmentation-based method (Az = 0.75). These results support the conclusion that the two segmentation-free mass classification methods that leverage representation learning, BoVW and CNN, can obviate the need for accurate segmentations while improving performance with respect to traditional segmentation-based CADx methods that use predefined features. While our observation that segmentation-free deep learning models can outperform segmentation-based models is consistent with expectations of previous work,8 our study is the first to demonstrate this across multiple different techniques on a standard, public mammography dataset. A robust finding that mammography CADx could safely move to segmentation-free methods would benefit busy clinical workflows, as providing precise lesion segmentation or region-of-interest outlines can be laborious and time consuming in practice. Thus, further evaluation studies in larger patient cohorts and more diverse image sets would be useful to confirm the performance improvement we have observed on CBIS-DDSM from segmentation-free representation learning techniques.
Our second important finding is that our re-implementation of existing segmentation-dependent methods yielded performance levels on CBIS-DDSM inferior to those reported on the original evaluation datasets in the literature. For instance, Huo et al. reported an AZ of 0.88 using only the spiculation measure feature, while our re-implementation achieved an AZ of only 0.53 on CBIS-DDSM using that same feature. Additionally, while Sahiner et al. reported an AZ of 0.91, we find an AZ of only 0.75 on CBIS-DDSM using our re-implementation of this method. There exist several possible explanations for these discrepancies. First, the technique used to provide mass lesion segmentations for DDSM was not the same as that originally used in any of the segmentation-dependent algorithms, which could affect their efficacy. We propose that further investigating this sensitivity would be a productive direction for future work. Second, as described in the Methods section, the literature does not always describe existing methods in sufficient detail to ensure an exact re-implementation, and the original code is rarely available, meaning that there likely exist differences in implementation between our study and the original work. Third, segmentation-dependent techniques in the literature are often tuned on small datasets, such as 95 images from 68 patients for Huo et al.7 and 168 mammograms from 72 patients for Sahiner et al.6 While we do tune salient parameters for these methods as described in Methods, other preprocessing choices made in the literature (e.g. neighborhood size for Huo et al.) could affect these models’ ability to transfer to new datasets. These results reinforce the importance of performing algorithm assessment on large public datasets and exerting consistent effort to publicly release new datasets for evaluation as acquisition hardware and software change.
Our study has several important limitations, some of which have been previously mentioned. First, the comparison we have performed between our CBIS-DDSM results and those reported in the literature is imperfect, as we were not able to acquire the complete code for every method. This being said, we use standard implementations of CNN and BoVW methods,10,11 use code provided by Sahiner for LDA-RBST,6 and had two separate researchers implement NN-HDF to ensure repeatability; code for each method can be made available upon request. Furthermore, our results in Figure 4 indicate that our implementation for isolating the margin and periphery – two key parameters of NN-HDF – yields appropriate results (cf. the original paper7). Another limitation of our work is that we utilize the segmentations provided by CBIS-DDSM because code for methods used in Sahiner et al. and Huo et al. was unavailable; while we believe this to be a reasonable approach, the difference in segmentation methods could affect the performance of these two techniques. A final limitation of our study is that the DDSM data set is itself an old collection comprising scanned film mammography. Modern mammography is digital, and the results of the methods described in this paper could be different if we used a digital mammography collection. Note that this caveat is not confined to segmentation-dependent methods, as segmentation-free deep learning methods in particular carry the potential to focus on features that are semantically nonsensical as a result of data-driven feature learning, and rigorous evaluation procedures should be utilized to ensure that clinically reasonable features and spatial regions are being utilized. Given the scarcity of public, freely available collections of digital mammography images, and because many prior works have used the DDSM for evaluation, we have chosen to use the CBIS-DDSM collection as the basis of the present work. In the future, it would be helpful to evaluate all CADx methods on digital mammography data sets should they become available, and to remain keenly attuned to the potential for confounding variables such as image quality, latent subsets in the data, and label inconsistency to result in flawed assessments of classifier performance.
CONCLUSIONS
In this work, we use the public CBIS-DDSM dataset to compare the performance of multiple segmentation-free and segmentation-dependent CADx algorithms using a common evaluation standard. We find that segmentation-free representation learning techniques such as BoVW and CNN are able to equal or outperform re-implementations of segmentation-dependent CADx algorithms on CBIS-DDSM. If verified on larger populations, the use of segmentation-free techniques could increase the positive impact of CADx systems on clinical workflows by minimizing the amount of clinician time and precision required to utilize them effectively. We also observe that segmentation-dependent CADx algorithms do not perform as well on CBIS-DDSM as on the original evaluation datasets in the literature, implying that some combination of differences in segmentation approach, variations in implementation, or an underlying lack of generalizability are affecting algorithm performance. It is our hope that this work provides motivation for further study of different mass classification algorithms using public datasets, which would greatly benefit both clinical and scientific communities.
Supplementary Material
Highlights.
Segmentation-based and segmentation-free breast mass analysis approaches are compared
Comparison on common, public datasets is crucial for comparative algorithm analysis
Modern representation learning techniques often outperform manual feature engineering
More public evaluation data -- e.g. for digital mammograms -- would be beneficial
Acknowledgments
Y.S., K.J.H. and C.H. acknowledge the financial support by the NIH Eunice Kennedy Shriver National Institute of Child Health and Human Development (grant R01HD086325). K.J.H. would like to acknowledge financial support from Nanyang Technological University (start-up grant M4082428.050). C.H. would also like to acknowledge financial support from Nanyang Technological University (start-up grant M4082352.050) and the Ministry of Education, Singapore, under its Academic Research Fund Tier 1 (M4012229.050). The computational work for this article was fully performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg).
Footnotes
CONFLICT OF INTEREST STATEMENT
The authors have no relevant conflicts of interest to disclose.
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Contributor Information
Rebecca Sawyer Lee, Stanford University Biomedical Informatics Training Program.
Jared A. Dunnmon, Stanford University Department of Computer Science.
Ann He, Stanford University Department of Computer Science.
Siyi Tang, Stanford University Department of Electrical Engineering.
Christopher Ré, Stanford University Department of Computer Science.
Daniel L. Rubin, Stanford University Departments of Radiology and Biomedical Data Science.
REFERENCES
- 1.International Agency for Research on Cancer, World Health Organization. Breast Cancer Estimated Incidence, Mortality and Prevalence Worldwide in 2012.
- 2.Berry DA, Cronin KA, Plevritis SK, et al. Effect of Screening and Adjuvant Therapy on Mortality from Breast Cancer. N Engl J Med. 2005;353(17):1784–1792. doi: 10.1056/NEJMoa050518 [DOI] [PubMed] [Google Scholar]
- 3.American Cancer Society. Breast Cancer: Facts and Figures 2015–2016.
- 4.Fuller MS, Lee CI, Elmore JG. Breast Cancer Screening : An Evidence-Based Update. Med Clin North Am. 2016;99(3):451–468. doi: 10.1016/j.mcna.2015.01.002.Breast [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Henriksen EL, Carlsen JF, Mm Vejborg I, Nielsen MB, Lauridsen CA. The efficacy of using computer-aided detection (CAD) for detection of breast cancer in mammography screening: a systematic review. doi: 10.1177/0284185118770917 [DOI] [PubMed]
- 6.Sahiner B, Chan HP, Petrick N, Helvie M a, Goodsitt MM. Computerized characterization of masses on mammograms: the rubber band straightening transform and texture analysis. Med Phys. 1998;25(4):516–526. doi: 10.1118/1.598228 [DOI] [PubMed] [Google Scholar]
- 7.Huo Z, Giger ML, Vyborny CJ, Wolverton DE, Schmidt RA, Doi K. Automated Computerized Classification of Malignant and Benign Masses on Digitized Mammograms. Acad Radiol. 1998;5(3):155–168. doi: 10.1016/S1076-6332(98)80278-X [DOI] [PubMed] [Google Scholar]
- 8.Kooi T, Litjens G, van Ginneken B, et al. Large scale deep learning for computer aided detection of mammographic lesions. Med Image Anal. 2017;35:303–312. doi: 10.1016/J.MEDIA.2016.07.007 [DOI] [PubMed] [Google Scholar]
- 9.Hamidinekoo A, Denton E, Rampun A, Honnor K, Zwiggelaar R. Deep learning in mammography and breast histology, an overview and future trends. Med Image Anal. 2018;47:45–67. doi: 10.1016/j.media.2018.03.006 [DOI] [PubMed] [Google Scholar]
- 10.Vedaldi A, Zisserman A. Efficient Additive Kernels via Explicit Feature Maps. IEEE Trans Pattern Anal Mach Intell. 2012;34(3):480–492. https://www.robots.ox.ac.uk/~vgg/publications/2011/Vedaldi11/vedaldi11.pdf. Accessed June 4, 2018. [DOI] [PubMed] [Google Scholar]
- 11.Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. IEEE CVPR. 2017;(2):2261–2269. doi: 10.1109/CVPR.2017.243 [DOI] [Google Scholar]
- 12.Rangayyan RM, Mudigonda NR, Desautels JE. Boundary modelling and shape analysis methods for classification of mammographic masses. Med Biol Eng Comput. 2000;38(5):487–496. doi: 10.1007/BF02345742 [DOI] [PubMed] [Google Scholar]
- 13.Mudigonda NR, Rangayyan RM, Desautels JE. Gradient and texture analysis for the classification of mammographic masses. IEEE Trans Med Imaging. 2000;19(10):1032–1043. doi: 10.1109/42.887618 [DOI] [PubMed] [Google Scholar]
- 14.Sahiner B, Chan HP, Petrick N, Helvie MA, Hadjiiski LM. Improvement of mammographic mass characterization using spiculation meausures and morphological features. Med Phys. 2001;28(7):1455–1465. doi: 10.1118/1.1381548 [DOI] [PubMed] [Google Scholar]
- 15.Bozek J, Kallenberg M, Grgic M, Karssemeijer N. Use of volumetric features for temporal comparison of mass lesions in full field digital mammograms. Med Phys. 2014;41(2):021902. doi: 10.1118/1.4860956 [DOI] [PubMed] [Google Scholar]
- 16.Görgel P, Sertbas A, Uçan ON. Computer-aided classification of breast masses in mammogram images based on spherical wavelet transform and support vector machines. Expert Syst. 2015;32(1):155–164. doi: 10.1111/exsy.12073 [DOI] [Google Scholar]
- 17.Brzakovic D, Luo XM, Brzakovic U. An approach to automated detection of tumors in mammograms. IEEE Trans Med Imaging. 1990;9(3):233–241. doi: 10.1109/42.57760 [DOI] [PubMed] [Google Scholar]
- 18.Timp S, Varela C, Karssemeijer N. Temporal change analysis for characterization of mass lesions in mammography. IEEE Trans Med Imaging. 2007;26(7):945–953. doi: 10.1109/TMI.2007.897392 [DOI] [PubMed] [Google Scholar]
- 19.Ganesan K, Acharya UR, Chua CK, Min LC, Abraham TK. Automated Diagnosis of Mammogram Images of Breast Cancer Using Discrete Wavelet Transform and Spherical Wavelet Transform Features: A Comparative Study. Technol Cancer Res Treat. 2014;13(6):605–615. doi: 10.7785/tcrtexpress.2013.600262 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Choi JY, Kim DH, Plataniotis KN, Ro YM. Classifier ensemble generation and selection with multiple feature representations for classification applications in computer-aided detection and diagnosis on mammography. Expert Syst Appl. 2016;46:106–121. doi: 10.1016/j.eswa.2015.10.014 [DOI] [Google Scholar]
- 21.Oliver A, Freixenet J, Martí J, et al. A review of automatic mass detection and segmentation in mammographic images. 2009. doi: 10.1016/j.media.2009.12.005 [DOI] [PubMed] [Google Scholar]
- 22.Jamieson AR, Drukker K, Giger ML. Breast image feature learning with adaptive deconvolutional networks. 2012:831506. doi: 10.1117/12.910710 [DOI] [Google Scholar]
- 23.Liu B, Jiang Y. A multitarget training method for artificial neural network with application to computer-aided diagnosis. Med Phys. 2013;40(1):011908. doi: 10.1118/1.4772021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li XZ, Williams S, Lee G, Deng M. Computer-aided mammography classification of malignant mass regions and normal regions based on novel texton features. 2012 12th Int Conf Control Autom Robot Vision, ICARCV 2012. 2012;2012(December):1431–1436. doi: 10.1109/ICARCV.2012.6485399 [DOI] [Google Scholar]
- 25.Magny SJ, Shikhman R, Keppke AL. Breast Imaging Reporting and Data System. StatPearls Publishing; 2020. http://www.ncbi.nlm.nih.gov/pubmed/29083600. Accessed October 31, 2020. [PubMed] [Google Scholar]
- 26.Wang Y, Aghaei F, Zarafshani A, Qiu Y, Qian W, Zheng B. Computer-aided classification of mammographic masses using visually sensitive image features. J Xray Sci Technol. 2017;25(1):171–186. doi: 10.3233/XST-16212 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zhu W, Lou Q, Vang YS, Xie X. Deep multi-instance networks with sparse label assignment for whole mammogram classification. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol 10435 LNCS. Springer Verlag; 2017:603–611. doi: 10.1007/978-3-319-66179-7_69 [DOI] [Google Scholar]
- 28.Arevalo J, González FA, Ramos-Pollán R, Oliveira JL, Guevara Lopez MA. Representation learning for mammography mass lesion classification with convolutional neural networks. Comput Methods Programs Biomed. 2016;127:248–257. doi: 10.1016/j.cmpb.2015.12.014 [DOI] [PubMed] [Google Scholar]
- 29.Kim EK, Kim HE, Han K, et al. Applying Data-driven Imaging Biomarker in Mammography for Breast Cancer Screening: Preliminary Study. Sci Rep. 2018;8(1):1–8. doi: 10.1038/s41598-018-21215-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ribli D, Horváth A, Unger Z, Pollner P, Csabai I. Detecting and classifying lesions in mammograms with Deep Learning. Sci Rep. 2018;8(1):1–7. doi: 10.1038/s41598-018-22437-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Al-masni MA, Al-antari MA, Park JM, et al. Simultaneous detection and classification of breast masses in digital mammograms via a deep learning YOLO-based CAD system. Comput Methods Programs Biomed. 2018;157:85–94. doi: 10.1016/j.cmpb.2018.01.017 [DOI] [PubMed] [Google Scholar]
- 32.Lotter W, Sorensen G, Cox D. A multi-scale CNN and curriculum learning strategy for mammogram classification. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol 10553 LNCS. Springer Verlag; 2017:169–177. doi: 10.1007/978-3-319-67558-9_20 [DOI] [Google Scholar]
- 33.Ting FF, Tan YJ, Sim KS. Convolutional neural network improvement for breast cancer classification. Expert Syst Appl. 2019;120:103–115. doi: 10.1016/j.eswa.2018.11.008 [DOI] [Google Scholar]
- 34.Chougrad H, Zouaki H, Alheyane O. Deep Convolutional Neural Networks for breast cancer screening. Comput Methods Programs Biomed. 2018;157:19–30. doi: 10.1016/j.cmpb.2018.01.011 [DOI] [PubMed] [Google Scholar]
- 35.Ragab DA, Sharkas M, Marshall S, Ren J. Breast cancer detection using deep convolutional neural networks and support vector machines. PeerJ. 2019;2019(1):e6201. doi: 10.7717/peerj.6201 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li H, Chen D, Nailon WH, Davies ME, Laurenson D. Dual Convolutional Neural Networks for Breast Mass Segmentation and Diagnosis in Mammography. August 2020. http://arxiv.org/abs/2008.02957. Accessed October 31, 2020. [DOI] [PubMed]
- 37.Tsochatzidis L, Costaridou L, Pratikakis I. Deep Learning for Breast Cancer Diagnosis from Mammograms—A Comparative Study. J Imaging. 2019;5(3):37. doi: 10.3390/jimaging5030037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chougrad H, Zouaki H, Alheyane O. Multi-label transfer learning for the early diagnosis of breast cancer. Neurocomputing. 2020;392:168–180. doi: 10.1016/j.neucom.2019.01.112 [DOI] [Google Scholar]
- 39.Chen Y, Zhang Q, Wu Y, Liu B, Wang M, Lin Y. Fine-tuning ResNet for breast cancer classification from mammography. In: Lecture Notes in Electrical Engineering. Vol 536. Springer Verlag; 2019:83–96. doi: 10.1007/978-981-13-6837-0_7 [DOI] [Google Scholar]
- 40.Falconi LG, Perez M, Aguilar WG, Conci A. Transfer learning and fine tuning in breast mammogram abnormalities classification on CBIS-DDSM database. Adv Sci Technol Eng Syst. 2020;5(2):154–165. doi: 10.25046/aj050220 [DOI] [Google Scholar]
- 41.Alkhaleefah M, Kumar Chittem P, Achhannagari VP, Ma SC, Chang YL. The Influence of Image Augmentation on Breast Lesion Classification Using Transfer Learning. In: 2020 International Conference on Artificial Intelligence and Signal Processing, AISP 2020. Institute of Electrical and Electronics Engineers Inc.; 2020. doi: 10.1109/AISP48273.2020.9073516 [DOI] [Google Scholar]
- 42.Shu X, Zhang L, Wang Z, Lv Q, Yi Z. Deep Neural Networks with Region-Based Pooling Structures for Mammographic Image Classification. IEEE Trans Med Imaging. 2020;39(6):2246–2255. doi: 10.1109/TMI.2020.2968397 [DOI] [PubMed] [Google Scholar]
- 43.Samala RK, Chan HP, Hadjiiski LM, Helvie MA, Richter CD. Generalization error analysis for deep convolutional neural network with transfer learning in breast cancer diagnosis. Phys Med Biol. 2020;65(10):105002. doi: 10.1088/1361-6560/ab82e8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Gossmann A, Cha KH, Sun X. Performance deterioration of deep neural networks for lesion classification in mammography due to distribution shift: an analysis based on artificially created distribution shift. In: Hahn HK, Mazurowski MA, eds. Medical Imaging 2020: Computer-Aided Diagnosis. Vol 11314. SPIE; 2020:3. doi: 10.1117/12.2551346 [DOI] [Google Scholar]
- 45.Beltran-Perez C, Wei HL, Rubio-Solis A. Generalized Multiscale RBF Networks and the DCT for Breast Cancer Detection. Int J Autom Comput. 2020;17(1):55–70. doi: 10.1007/s11633-019-1210-y [DOI] [Google Scholar]
- 46.Ansar W, Shahid AR, Raza B, Dar AH. Breast cancer detection and localization using mobilenet based transfer learning for mammograms. In: Communications in Computer and Information Science. Vol 1187 CCIS. Springer; 2020:11–21. doi: 10.1007/978-3-030-43364-2_2 [DOI] [Google Scholar]
- 47.de Vriendt M, Sellars P, Aviles-Rivero AI. The GraphNet Zoo: An All-in-One Graph Based Deep Semi-supervised Framework for Medical Image Classification. In: LNCS. Vol 12443. Springer, Cham; 2020:187–197. doi: 10.1007/978-3-030-60365-6_18 [DOI] [Google Scholar]
- 48.Lee RS, Gimenez F, Hoogi A, Miyake KK, Gorovoy M, Rubin DL. The Curated Breast Imaging Subset of the Digital Database for Screening Mammography. 2015. doi: 10.7937/K9/TCIA.2016.7O02S9CY [DOI] [Google Scholar]
- 49.Lee RS, Gimenez F, Hoogi A, Miyake KK, Gorovoy M, Rubin DL. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci Data. 2017;4:170177. doi: 10.1038/sdata.2017.177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Newton-Cheh C, Johnson T, Gateva V, et al. Genome-wide association study identifies eight loci associated with blood pressure. Nat Genet. 2009;41(6):666–676. doi: 10.1038/ng.361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Chan H, Wei D, Helvie MA, et al. Computer-aided classification of mammographic masses and normal tissue : linear discriminant analysis in texture feature space. Phys Med Biol. 1995;40(5):857–876. doi: 10.1088/0031-9155/40/5/010 [DOI] [PubMed] [Google Scholar]
- 52.Vedaldi A, Fulkerson B. VLFeat: An Open and Portable Library of Computer Vision Algorithms. 2008.
- 53.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33(1). doi: 10.18637/jss.v033.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016;316(22):2402. doi: 10.1001/jama.2016.17216 [DOI] [PubMed] [Google Scholar]
- 55.Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. 2017. doi: 10.1038/nature21056 [DOI] [PMC free article] [PubMed]
- 56.Shin H-C, Roth HR, Gao M, et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans Med Imaging. 2016;35(5):1285–1298. doi: 10.1109/TMI.2016.2528162 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Deng Jia, Dong Wei, Socher R, Li-Jia Li, Li Kai, Fei-Fei Li. ImageNet: A large-scale hierarchical image database. IEEE CVPR. June 2009:248–255. doi: 10.1109/CVPR.2009.5206848 [DOI] [Google Scholar]
- 58.Dunnmon JA, Yi D, Langlotz CP, Re C, Rubin DL, Lungren MP. Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. Radiology. 2019;290(2):537–544. doi: 10.1148/radiol.2018181422 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Ratner AJ, Ehrenberg HR, Hussain Z, Dunnmon J, Re C. Learning to Compose Domain-Specific Transformations for Data Augmentation. September 2017. http://arxiv.org/abs/1709.01643. Accessed October 2, 2017. [PMC free article] [PubMed] [Google Scholar]
- 60.Huo Z, Giger ML, Vyborny CJ, et al. Analysis of spiculation in the computerized classification of mammographic masses. Med Phys. 1995;22(10):1569–1579. doi: 10.1118/1.597626 [DOI] [PubMed] [Google Scholar]
- 61.Liu J, Ma M, Wu C, Tai J. Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves. Stat Med. 2006;25(7):1219–1238. doi: 10.1002/sim.2358 [DOI] [PubMed] [Google Scholar]
- 62.Hanley JA, McNeil BJ. A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology. 1983;148(3):839–843. https://pubs.rsna.org/doi/pdf/10.1148/radiology.148.3.6878708. Accessed March 25, 2018. [DOI] [PubMed] [Google Scholar]
- 63.Du Z, Hao Y. rocNIT: Non-Inferiority Test for Paired ROC Curves. 2016. doi: 10.1002/sim.2358 [DOI] [Google Scholar]
- 64.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization. In: IEEE CVPR. 2016:2921–2929. https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf. Accessed March 30, 2018. [Google Scholar]
- 65.Lowe DG. Object recognition from local scale-invariant features. Proc Seventh IEEE Int Conf Comput Vis. 1999;2. doi: 10.1109/ICCV.1999.790410 [DOI] [Google Scholar]
- 66.Nguyen N, Caruana R. Consensus Clusterings. In: Seventh IEEE International Conference on Data Mining. 2007:607–612. doi: 10.1109/ICDM.2007.73 [DOI] [Google Scholar]
- 67.Strehl A, Ghosh J. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. J Mach Learn Res. 2002;3:583–617. [Google Scholar]
- 68.Giecold G Cluster Ensembles.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






