A deep learning methodology for improved breast cancer diagnosis using multiparametric MRI

Qiyuan Hu; Heather M Whitney; Maryellen L Giger

doi:10.1038/s41598-020-67441-4

. 2020 Jun 29;10:10536. doi: 10.1038/s41598-020-67441-4

A deep learning methodology for improved breast cancer diagnosis using multiparametric MRI

Qiyuan Hu ^1,^✉, Heather M Whitney ^1,², Maryellen L Giger ¹

PMCID: PMC7324398 PMID: 32601367

Abstract

Multiparametric magnetic resonance imaging (mpMRI) has been shown to improve radiologists’ performance in the clinical diagnosis of breast cancer. This machine learning study develops a deep transfer learning computer-aided diagnosis (CADx) methodology to diagnose breast cancer using mpMRI. The retrospective study included clinical MR images of 927 unique lesions from 616 women. Each MR study included a dynamic contrast-enhanced (DCE)-MRI sequence and a T2-weighted (T2w) MRI sequence. A pretrained convolutional neural network (CNN) was used to extract features from the DCE and T2w sequences, and support vector machine classifiers were trained on the CNN features to distinguish between benign and malignant lesions. Three methods that integrate the sequences at different levels (image fusion, feature fusion, and classifier fusion) were investigated. Classification performance was evaluated using the receiver operating characteristic (ROC) curve and compared using the DeLong test. The single-sequence classifiers yielded areas under the ROC curves (AUCs) [95% confidence intervals] of AUC_DCE = 0.85 [0.82, 0.88] and AUC_T2w = 0.78 [0.75, 0.81]. The multiparametric schemes yielded AUC_ImageFusion = 0.85 [0.82, 0.88], AUC_{FeatureFusion} = 0.87 [0.84, 0.89], and AUC_{ClassifierFusion} = 0.86 [0.83, 0.88]. The feature fusion method statistically significantly outperformed using DCE alone (P < 0.001). In conclusion, the proposed deep transfer learning CADx method for mpMRI may improve diagnostic performance by reducing the false positive rate and improving the positive predictive value in breast imaging interpretation.

Subject terms: Breast cancer, Cancer imaging, Computational science, Computer science, Information technology, Software, Statistics

Introduction

Breast magnetic resonance imaging (MRI) has been reported to be a highly sensitive imaging modality for breast cancer detection and characterization¹. Dynamic contrast-enhanced (DCE)-MRI offers morphological and functional lesion information with excellent sensitivity and variable specificity for breast cancer diagnosis². Moderate specificity may lead to unnecessary subsequent patient work-up and biopsies, which may contribute to anxiety in patients awaiting biopsy results that indicate benignity. To overcome this limitation and assess more functional data, approaches to examining other MRI sequences alongside DCE-MRI images have been implemented in the routine clinical interpretation of breast MRI exams over recent decades^{2, 3}. This approach is defined as multiparametric MRI (mpMRI), in which T2-weighted (T2w) MRI is a commonly used additional sequence. Studies have shown that the incorporation of T2w sequence during interpretation is useful in the differential diagnosis of benign and malignant lesions^4–6. For example, fibroadenomas, a type of benign lesions that can exhibit similar contrast agent enhancement to that of malignant lesions on T1-weighted DCE-MRI, usually have high signal intensity on T2w images compared with malignant lesions⁴.

In order to assist radiologists in the interpretation of diagnostic imaging, computer-aided diagnosis (CADx) systems continue to be developed with artificial intelligence (AI) techniques to potentially improve the accuracy of evaluating suspicious breast lesions⁷. Multiparametric CADx schemes using multiple MRI protocols have also started to be explored as MRI technology advances^8–11. In this study, we propose and evaluate the performance of three AI-integrated multiparametric CADx methods that incorporate the complementary information provided in DCE and T2w MRI protocols in the task of distinguishing between benign and malignant breast lesions. In addition, we compare them with the performances of the two single-sequence-based methods, i.e., DCE and T2w.

We employ a deep transfer learning methodology that extracts and pools low- to mid-level features using a pretrained convolutional neural network (CNN) and perform classification using a support vector machine (SVM). We explore integrating the information from the DCE and T2w MRI sequences at three different levels of the classification framework, namely via (i) input of the multiparametric images directly to the CNN (image level), (ii) input of the CNN features extracted from DCE and T2w into a multiparametric classifier (feature level), and (iii) aggregation of the outputs of the DCE SVM and T2w SVM (classifier output level). We believe that this is the first comprehensive study of mpMRI schemes, and our methodologies demonstrate strong potential in utilizing information from mpMRI to estimate the probability of breast lesion malignancy without the need for preprocessing, image registration, large datasets, or long training times.

Methods

Study participants

The study was approved by the Institutional Review Board (IRB) of the University of Chicago and followed Health Insurance Portability and Accountability Act (HIPAA)-compliant protocols. The database was retrospectively collected under the above-mentioned protocols, and all procedures were conducted in accordance with relevant guidelines and regulations. The requirement for informed consent was waived because all clinical information and images in this study were de-identified to the investigators. The MRI exams in the database were consecutively acquired over the span of eight years, from 2005 to 2013, imaged at a single institution. Exclusion criteria included MRI studies that did not exhibit a visible lesion, lesions that did not have validation of the final diagnosis, or lesions that could not be clearly allocated to either the benign or malignant category. A total of 927 unique breast lesions from 616 women (mean age 55.0 ± 12.8 years; age range 23–89 years) were ultimately included in this study.

Of all lesions, 199 were benign (21%) and 728 were malignant (79%). For all lesions clinically categorized at MRI as Breast Imaging Reporting and Data System (BI-RADS) category 4, 5, or 6, malignant/benign status was confirmed by histopathology. For all lesions clinically categorized at MRI as BI-RADS category 2 or 3, benign diagnosis was confirmed by MRI follow-up of at least 24 months. Lesions were thus labeled as either benign or malignant based on pathology and radiology reports. Images in the database were using either 1.5 T or 3 T Philips Achieva scanners with a T1-weighted spoiled gradient sequence and a T2-weighted turbo spin echo sequence without fat suppression. Therefore, each MR study contained a DCE-MRI sequence and a T2w MRI sequence acquired during the same exam. The temporal resolution for each dynamic acquisition in the DCE sequence was 60 s. Image slice thickness varied across the dataset and across the two sequences. The slice thickness was consistent across the two sequences (i.e., DCE and T2w) in 96% of the exams, while the in-plane resolution was consistent across the two sequences in 46% of the exams. Figure 1 shows the distribution of slice thickness and in-plane resolution of images in this dataset. Clinical characteristics of the dataset are detailed in Table 1.

Distribution of slice thickness and in-plane resolution of the dynamic contrast-enhance (DCE) sequences and T2-weighted (T2w) sequences in the multiparametric MRI database.

Table 1.

Clinical characteristics of the dataset.

Benign/malignant prevalence	Benign: 199 (21.5)
Benign/malignant prevalence	Malignant: 728 (78.5)
Age (years): mean ± standard deviation	55.0 ± 12.8
Age (years): mean ± standard deviation	Unknown: 97
Benign lesion characteristics
Lesion size (mm)	Mean: 8.86
	Median: 7.33
	Range: 3.38–42.8
Lesion subtypes	Fibroadenoma: 60 (30.2)
	Columnar change: 15 (7.5)
	Papilloma: 13 (6.5)
	Parenchyma tissue: 12 (6.0)
	Fibrotic tissue: 10 (5.0)
	Hyperplasia: 8 (4.0)
	Cystic change: 6 (3.0)
	Fat necrosis: 5 (2.5)
	Other: 27 (13.6)
	Unknown: 43 (21.6)
Malignant lesion characteristics
Lesion size (mm)	Mean: 17.9
	Median: 14.9
	Range: 3.37–73.7
Lesion subtypes	IDC: 147 (20.2)
	DCIS: 120 (16.5)
	IDC + DCIS: 359 (49.3)
	ILC: 31 (4.3)
	ILC mixed: 26 (3.6)
	Other: 33 (4.5)
	Unknown: 12 (1.6)
Estrogen receptor status	Positive: 410 (56.3)
	Negative: 128 (17.6)
	Unknown: 190 (26.1)
Progesterone receptor status	Positive: 352 (48.4)
	Negative: 184 (25.3)
	Unknown: 192 (26.4)
HER-2 status	Positive: 87 (12.0)
	Negative: 404 (55.5)
	Equivocal: 5 (0.7)
	Unknown: 232 (31.9)

Open in a new tab

Numbers in parentheses are percentages. Patient age is summarized on a patient basis, and lesion information (malignancy status and subtypes) is summarized on a lesion basis.

For some subjects, only the decade of age was available (e.g., “60 s”) as part of the patient information deidentification process. In these situations, the middle of the decade was used for the calculation of the mean subject age. Lesion size is measured by the effective diameter, i.e., the greatest dimension of a sphere with the same volume as the lesion.

IDC invasive ductal carcinoma, DCIS Ductal carcinoma in situ, ILC invasive lobular carcinoma, HER-2 human epidermal growth factor receptor 2.

Single-sequence methods

Figure 2 schematically shows the machine learning classification and evaluation process for both single-sequence and mpMRI schemes.

Lesion classification pipeline based on diagnostic images. Information from dynamic contrast-enhanced (DCE) and T2-weighted (T2w) MRI sequences are incorporated in three different ways: *image fusion*—fusing DCE and T2w images to create RGB composite image, *feature fusion*—merging convolutional neural network features extracted from DCE and T2w as the support vector machine (SVM) classifier input, and *classifier fusion*—aggregating the probability of malignancy output from the DCE and T2w classifiers via soft voting. *MIP* maximum intensity projection, *ROI* region of interest, *ROC* receiver operating characteristic.

Lesions were segmented using a fuzzy C-means method requiring only the manual indication of a seed-point¹². Lesion segmentations were not directly used as input to the CNN, but enabled automatic region of interest (ROI) construction described below. To capture the 4D (volumetric and temporal) characteristics of the lesions from the DCE sequence, maximum intensity projection (MIP) images of the second postcontrast subtraction DCE-MRI series were used as the input to a deep learning network¹³. The second post-contrast timepoint was chosen because the BI-RADS atlas defines the initial phase of enhancement as the first two minutes after contrast administration, which has diagnostic utility for distinguishing benign and malignant breast lesions¹⁴. From the T2w sequence of each lesion, the slice that contained the largest lesion area according to the automatic lesion segmentation was selected as the representative center slice, which was used as the input to a deep learning network. The T2w center slice was rescaled using bicubic interpolation to match the in-plane resolution of its corresponding DCE sequence. To avoid confounding contributions from distant voxels, a ROI around each lesion was cropped from the image to use in the subsequent classification process. The ROI size was chosen based on the maximum dimension of each lesion and was held constant across sequences. A small part of the parenchyma, three pixels wide around the lesion, was included in each ROI. Appropriate shifts in the coordinates were applied to ensure that the DCE and T2w ROIs were cropped from the same location relative to the lesion.

Through transfer learning, CNN features were extracted separately from the ROIs of the DCE subtraction MIPs and the ROIs of the T2w center slices using the publicly available VGG19 model¹⁵, pretrained on ImageNet¹⁶. The pretrained VGG19 network, which takes three-channel (red, green, and blue, or RGB) input images, has previously been shown to be useful in transfer learning for breast lesion analyses^13,17,18. For the single-sequence DCE and T2w image datasets, the ROIs were grayscale and were duplicated across the three channels. Feature vectors were extracted at various network depths from the five max-pooling layers of the VGGNet. These features were then average-pooled along the spatial dimensions and normalized with Euclidian distance. The pooled features were then concatenated to form a CNN feature vector of 1,472 features for a given lesion^17,18.

Nonlinear SVM classifiers with Gaussian radial basis function kernel were trained on the CNN features to differentiate between benign and malignant lesions (Python Version 3.4.2, Python Software Foundation)¹⁹. SVM was chosen over other classification methods due to its ability to handle sparse high-dimensional data, which is an attribute of the CNN features. To address the problem of class imbalance (i.e., due to the 79% cancer prevalence), a misclassification penalty for cases in the malignant (or benign) class was assigned to be inversely proportional to the malignant (or benign) class prevalence in the training data.

Evaluation of single-sequence methods

Each SVM classifier was trained and evaluated using nested fivefold cross-validation, where the inner cross-validation was used for hyperparameter tuning and the outer cross-validation was used for training and testing, resulting in a 64%/16%/20% split into independent training, validation, and test sets, respectively. Class prevalence was held constant across the five cross-validation folds, and all lesions from the same patient were kept together in the same fold in order to eliminate the impact of using correlated lesions for both training and testing. The training set was standardized to zero mean and unit variance, and the test set was standardized using those statistics of the corresponding training set. Principal component analysis fit on the training set was applied to both training and test sets to reduce feature dimensionality²⁰.

Within each training/validation fold in the outer cross-validation loop, two SVM hyperparameters, namely the scaling parameter γ and the regularization parameter C, were optimized on a grid search with an internal fivefold cross-validation²¹. Prediction scores were transformed to posterior probabilities of malignancy (PMs) assuming a scaled prevalence of 50%²². The predictions on the five test folds were aggregated for classification performance evaluation.

Classifier performances were evaluated using receiver operating characteristic (ROC) curve analysis, with area under the ROC curve (AUC) serving as the figure of merit^23,24. Standard errors and 95% confidence intervals (CIs) of the AUCs were calculated by bootstrapping the posterior PMs (2000 bootstrap samples)²⁵. Other clinical metrics, including sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), for each classifier were also reported. These metrics were calculated at the optimal operating point on the ROC curve determined by minimizing m = (1 − sensitivity)² + (1 − specificity)²¹¹.

Multiparametric methods

We explored integrating information from both the DCE and T2w MRI sequences at three different levels of the classification framework, as illustrated in Fig. 2. The three mpMRI schemes are referred to as image fusion, feature fusion, and classifier fusion.

For the input image fusion scheme, a three-channel RGB fusion image was constructed for each lesion by inputting the DCE MIP into the red channel, the T2w center slice into the green channel, and leaving the blue channel of the VGGNet blank. A composite ROI was cropped from the fusion image, which was then input into the pretrained VGG19 network for feature extraction. Figure 3 includes an example to illustrate the process of ROI extraction from MRI images and creating RGB fusion ROIs. The classifier training process then followed the single-sequence methods to predict PMs.

Example input images. A dynamic contrast-enhanced (DCE)-MRI transverse second post-contrast subtraction maximum intensity projection (MIP) and a T2-weighted (T2w)-MRI transverse center slice are shown with their corresponding regions of interest (ROIs) extracted. The RGB fusion ROI is created by inputting the DCE ROI into the red channel and the T2w ROI into the green channel.

For the feature fusion scheme, CNN features extracted from DCE and T2w separately were included into an ensemble of features, which was then input to an SVM classifier. The classifier training process then followed the single-sequence methods to yield PMs.

For the classifier fusion scheme, PM outputs from the DCE-based and from the T2w-based single-sequence SVM classifiers were aggregated via soft voting. That is, the DCE and T2w PM outputs were averaged to yield prediction scores.

In the evaluation of each mpMRI scheme, the same evaluation method was used as for the single-sequence classifiers.

Inter-sequence image registration

A preliminary study was performed to investigate whether image registration between DCE and T2w sequences would improve the performance of the proposed mpMRI classification schemes, especially the image fusion method. The T2w center slices were rescaled to match the in-plane resolution and then registered to the corresponding slice of the second post-contrast DCE image using a multi-modality rigid registration method that consists of translation and rotation^{26, 27}. The same five classification mechanisms were evaluated after image registration.

Statistical analysis

The AUCs from the three mpMRI classification schemes were statistically compared with those from the two single sequence classifiers using the DeLong test^28,29. Bonferroni–Holm corrections were used to account for multiple comparisons³⁰, and a corrected P < 0.05 was considered to indicate a statistically significant difference in performance. Equivalence testing was performed to assess if image registration had any effect on the classification performances³¹. An equivalence margin of difference in AUC = 0.05 was chosen prima facie.

Finally, to assess the performance reproducibility of the method, the highest performing classifier of the three mpMRI methods was trained and evaluated 100 times using different random seeds for the cross-validation split, and the mean and standard error of AUC was calculated from all the runs.

Results

Classification performance

Figure 4 presents the ROC curves for the five classification schemes without image registration, and Table 2 summarizes the classification performances as measured by AUC, sensitivity, specificity, PPV, and NPV. Note that the mpMRI classifiers achieved improvements in terms of all these metrics for classification performance. Table 3 shows the p-values and the 95% CIs for the comparisons of AUCs between the mpMRI and single-sequence classifiers. Among the three mpMRI classification schemes, while all of them yielded statistically significantly higher AUCs than using T2w alone, only the feature fusion method significantly outperformed using DCE alone in terms of AUC, and the other two methods, image fusion and classifier fusion, failed to demonstrate a statistically significant difference in AUCs compared with using DCE alone.

Fitted binomial receiver operating characteristic (ROC) curves for two single-sequence and three mpMRI classifiers using (i) convolutional neural network (CNN) features extracted from dynamic contrast-enhanced (DCE) subtraction maximum intensity projections (MIPs), (ii) CNN features extracted from T2-weighted (T2w) center slices, (iii) CNN features extracted from DCE and T2w fusion images, (iv) ensemble of features extracted from DCE and T2w images, and (v) probability of malignancy outputs from the DCE MIP and T2w classifiers aggregated via soft voting. The legend gives the area under the ROC curve (AUC) with standard error (SE) for each classifier scheme. T2w images were rescaled to match the in-plane resolution of their corresponding DCE sequences, but image registration was not performed.

Table 2.

Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) along with the 95% confidence interval (CI) for AUC for each classifier.

Classifier	DCE	T2w	Image fusion	Feature fusion	Classifier fusion
AUC [95% CI]	0.85 [0.82, 0.88]	0.78 [0.75, 0.81]	0.85 [0.82, 0.88]	0.87 [0.84, 0.89]	0.86 [0.83, 0.88]
Sensitivity (%)	75.9	69.8	76.5	77.9	77.6
Specificity (%)	76.5	72.7	77.1	78.5	77.1
PPV (%)	89.7	87.3	90.0	90.7	90.1
NPV (%)	54.2	47.3	55.0	56.9	56.2

Open in a new tab

Sensitivity and specificity presented are for the optimal operating point determined using a metric for cut-off value that minimizes m = (1 − sensitivity)² + (1 − specificity)².

Table 3.

Performance comparison for the five classifiers.

Classifier

DCE MIP

T2w center slice

Image fusion

P = 0.73

95% CI ∆AUC = [− 0.01, 0.02]

P < 0.001*

95% CI ∆AUC = [0.05, 0.09]

Feature fusion

P < 0.001*

95% CI ∆AUC = [0.01, 0.03]

P < 0.001*

95% CI ∆AUC = [0.06, 0.11]

Classifier fusion

P = 0.28

95% CI ∆AUC = [− 0.00, 0.02]

P < 0.001*

95% CI ∆AUC = [0.06, 0.09]

Open in a new tab

The classifier names are shown in the first row (single-parametric) and first column (multiparametric). P-value and 95% confidence interval (CI) of the difference in area under the receiver operating characteristic curves (AUCs) for each comparison are presented in the table, where each multiparametric classifier was compared with each single-parametric classifier using the DeLong test. P-values were corrected for multiple comparisons using Bonferroni–Holm corrections. Asterisks denote significance (P < 0.05) after accounting for multiple comparisons.

In assessing performance reproducibility, the mean and standard error of AUCs from 100 runs for the feature fusion classifier was 0.864 ± 0.003, indicating that the classification performance was very stable regardless of the random seed chosen.

Figures 5 and 6 illustrate the comparison between the PMs predicted by the single-sequence classifiers using DCE and T2w. Figure 5 also shows example lesions on which these two classifiers agree or disagree. While the majority of benign and malignant lesions are separated from the other class, there appears to be moderate disagreement between the two classifiers, suggesting that a fusion technique could likely improve the predictive performance.

A diagonal classifier agreement plot between the T2-weighted (T2w) and dynamic contrast-enhanced (DCE) single-sequence classifiers. The x-axis and y-axis denote the probability of malignancy (PM) scores predicted by the DCE classifier and the T2w classifier, respectively. Each point represents a lesion for which predictions were made. Points along or near the diagonal from bottom left to top right indicate high classifier agreement; points far from the diagonal indicate low agreement. Examples of lesions on which the two classifiers were in extreme agreement/disagreement are also included.

Bland–Altman plot illustrating classifier agreement between the dynamic contrast-enhanced (DCE) maximum intensity projection and T2-weighted (T2w)-based single-sequence classifiers. The y-axis shows the difference between the support vector machine output scores (predicted posterior probabilities of malignancy) of the two classifiers; the x-axis shows the mean of two classifiers’ outputs, which is also the probability of malignancy scores calculated in the classifier fusion method.

Inter-sequence image registration

Performing inter-sequence rigid image registration did not have a significant effect on the classification performances of any classification scheme. Namely, the four classifiers affected by the registration (i.e., use information from T2w images) yielded AUC values of AUC_T2w = 0.79 ± 0.02 (95% CI [0.76, 0.82]), AUC_ImageFusion = 0.84 ± 0.01 (95% CI [0.81, 0.87]), AUC_{FeatureFusion} = 0.87 ± 0.01 (95% CI [0.84, 0.89]), and AUC_{ClassifierFusion} = 0.86 ± 0.01 (95% CI [0.83, 0.88]). Just as when T2w was not registered to DCE, while all three mpMRI classification schemes significantly outperformed using T2w alone, only feature fusion significantly outperformed using DCE alone. According to the 95% CIs of the difference in AUCs (∆AUCs) between performing inter-sequence image registration or not, image registration between T2w and DCE failed to show a statistically significant effect on the performance of any classifiers examined. In addition, equivalence testing demonstrated that whether image registration was performed or not yielded equivalent performance with an equivalence margin of ∆AUC = 0.05, chosen prima facie. Thus, all findings held regardless of whether image registration was employed or not, indicating that registration did not lead to a change in the performance of the mpMRI schemes.

Discussion

The proposed convolutional neural network (CNN)-based multiparametric magnetic resonance imaging (mpMRI) computer-aided diagnosis (CADx) methods that take advantage of the complimentary information provided by dynamic contrast-enhanced (DCE) and T2-weighted (T2w) MRI protocols demonstrated potential to improve the performance of current CADx schemes in the task of distinguishing between benign and malignant breast lesions. Among the three mpMRI methods examined, the feature fusion method, i.e., using CNN features extracted from both DCE and T2w as the classifier input, significantly outperformed using DCE-MRI alone as in currently available CADx systems. The image fusion method, i.e., fusing the DCE and T2w images into one RGB image prior to input to the VGGNet, and the classifier fusion method, i.e., aggregating the probability of malignancy output from the DCE and T2w classifiers via soft voting, failed to show a statistically significant difference in performance compared with using DCE alone. All three mpMRI schemes statistically significantly outperformed using T2w alone. Furthermore, we demonstrated that image registration of the DCE and T2w images did not affect the classification performances when applied in addition to image resolution matching.

Training CNNs from scratch typically relies on massive datasets for training and is thus often intractable for medical research due to data scarcity. It has been shown that standard transfer learning techniques like fine-tuning or feature extraction based on ImageNet-trained CNNs can be used for CADx^32,33. As a result, deep learning techniques have exhibited strong predictive performances on CADx tasks without requiring massive datasets^{17,18,34–36}. Previous studies have investigated mpMRI CADx in distinguishing between malignant and benign lesions using human-engineered radiomic features^8,9,37. However, to the best of our knowledge, few CADx studies have explored mpMRI analysis using deep learning. Dalmis et al. reported an approach of training a 3D CNN from scratch using three MRI protocols and patient information which yielded an AUC of 0.831 [0.791–0.867]¹⁰, while our study explored transfer learning and achieved better performance. Truhn et al. fine-tuned a pretrained residual neural network and achieved an AUC of 0.88¹¹, whereas our study exploited feature extraction which is less computationally expensive and more suitable for small medical datasets. Note that the CNN input that yielded the best performance in Truhn et al. only contained information from DCE. Nonetheless, their approach was similar to the image fusion method in our study, which is not the optimal mpMRI scheme according to our results. Although additional information would be needed to statistically compare these results from the literature and ours, our approach demonstrated comparable and, in some cases, higher performance than others. We believe that our study is the first comprehensive study that investigated three different deep transfer learning schemes of exploiting multiparametric MRI information for lesion classification. The findings demonstrated superiority of one method, which can potentially inform future research in this field.

Our method extracts CNN features from the five max-pooling layers at the end of each convolutional block, average-pools, normalizes, and then concatenates to form CNN feature vectors^17,18. Even though a more common way of applying a pretrained CNN to medical images is to extract features from fully connected layers at the end of the entire network architecture, the method requires image preprocessing to transform the images to a fixed size. Our method allows for using images of various sizes that correspond to enclosed lesion sizes and takes full advantage of the low- to mid-level features learned by the network.

When performing inter-sequence image registration, we chose to use multi-modality rigid registration that consists of translation and rotation. Scaling, shearing, or deformable registration was not employed because it was not desirable for quantitative image analysis to alter the geometry of and the texture within the lesions. More in-depth registration optimizations can be explored in future studies. Image registration can be computationally expensive and time-consuming. Given that all classifier performances were equivalent with or without image registration, we suggest that image registration might not be a necessary step in this proposed method of distinguishing between benign and malignant breast lesions using mpMRI.

A limitation of this study was the selection of the equivalence margin. The margin in equivalence testing is ideally a predetermined clinically meaningful limit. However, due to complexities and impracticalities in applying the statistical principles of equivalence testing to diagnostic performance studies, there is currently no widely used standard procedure to establish this margin³¹. Nonetheless, we were able to demonstrate equivalence between all classifier pairs using a rather conservative margin of 5% for ∆AUC. Furthermore, the pretrained CNN network requires 2D input, which limited the inclusion of the high-dimensional information contained in breast MRI exams. We chose to capture part of the 4D information in DCE-MRI by using second post-contrast MIP images in this study, and future work will include investigating the optimal approach to include high-dimensional information in medical images in deep transfer learning frameworks. Moreover, while not necessarily a limitation, we reported cross-validation performance scores instead of using a single training/validation/test split. Although a single split would be preferred if the data were abundant, we chose fivefold cross-validation to use the available data more efficiently and obtain high statistical power. It is important to note that the nested cross-validation scheme resulted in a 64%/16%/20% split into independent training, validation, and test sets within one partition in the outer cross-validation loop, and thus overfitting due to data leakage did not occur. In addition, the dataset for this study was moderately sized and was from a single institution, and therefore the model optimized on our dataset might not be the best solution on a different dataset from a different institution or population. The variation in image acquisition parameters in our dataset might also have impacted the results, but we believe that it positively contributed to the generalizability of the method to images acquired under different protocols.

In conclusion, our study proposed a mpMRI approach that significantly outperformed the CADx benchmark that uses DCE alone in the task of distinguishing between benign and malignant breast lesions. Our methodology is computationally efficient and does not require intensive image preprocessing. Future work will expand the analysis to include other valuable MRI sequences, such as diffusion-weighted imaging. In addition, while this study focused on the computational aspect of improving the performance of a CAD system, we would like to perform reader studies in the future to assess our system’s clinical significance when used as a secondary or concurrent reader for radiologists. Furthermore, by increasing the size of our database, the performance could potentially improve with a fine-tuned CNN and a standard training/validation/test split of the data instead of fivefold cross-validation. Finally, we intend to perform validation on independent datasets from other institutions in order to investigate the robustness of the methodology relative to imaging manufacturers, facility protocols, and patient populations.

Acknowledgements

The authors acknowledge other lab members, including Karen Drukker, PhD, MBA, Alexandra Edwards, MA, Hui Li, PhD, and John Papaioannou, MS, Department of Radiology, The University of Chicago, Chicago, IL for their contributions to the datasets and discussions. The work was partially supported by NIH QIN Grant U01CA195564, NIH NCI R15 CA227948, the University of Chicago Comprehensive Cancer Center, and an RSNA/AAPM Graduate Fellowship.

Author contributions

Q.H., H.M.W., and M.L.G. made substantial contributions to the design of the work. M.L.G. oversaw data acquisition and controlled the database. Q.H. performed analysis and interpretation of the data, created software used in the work, and drafted the manuscript. H.M.W. and M.L.G. performed interpretation of resulting calculations and substantively revised the manuscript.

Competing interests

QH declares no competing interests. HMW declares no competing interests. MLG is a stockholder in R2 technology/Hologic and QView, receives royalties from Hologic, GE Medical Systems, MEDIAN Technologies, Riverain Medical, Mitsubishi and Toshiba, and is a cofounder of and equity holder in Quantitative Insights (now Qlarity Imaging). It is the University of Chicago Conflict of Interest Policy that investigators disclose publicly actual or potential significant financial interest that would reasonably appear to be directly and significantly affected by the research activities.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Qiyuan Hu, Email: qhu@uchicago.edu.

Heather M. Whitney, Email: hwhitney@uchicago.edu

Maryellen L. Giger, Email: m-giger@uchicago.edu

References

1.Kuhl CK, et al. Mammography, breast ultrasound, and magnetic resonance imaging for surveillance of women at high familial risk for breast cancer. J. Clin. Oncol. 2005;23:8469–8476. doi: 10.1200/JCO.2004.00.4960. [DOI] [PubMed] [Google Scholar]
2.Leithner D, et al. Clinical role of breast MRI now and going forward. Clin. Radiol. 2018;73:700–714. doi: 10.1016/j.crad.2017.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Menezes GLG, Knuttel FM, Stehouwer BL, Pijnappel RM, van den Bosch MAAJ. Magnetic resonance imaging in breast cancer: A literature review and future perspectives. World J. Clin. Oncol. 2014;5:61. doi: 10.5306/wjco.v5.i2.61. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kuhl CK, et al. Do T2-weighted pulse sequences help with the differential diagnosis of enhancing lesions in dynamic breast MRI? J. Magn. Reson. Imaging Off. J. Int. Soc. Magn. Reson. Med. 1999;9:187–196. doi: 10.1002/(sici)1522-2586(199902)9:2<187::aid-jmri6>3.0.co;2-2. [DOI] [PubMed] [Google Scholar]
5.Santamaría G, et al. Radiologic and pathologic findings in breast tumors with high signal intensity on T2-weighted MR images. Radiographics. 2010;30:533–548. doi: 10.1148/rg.302095044. [DOI] [PubMed] [Google Scholar]
6.Westra C, Dialani V, Mehta TS, Eisenberg RL. Using T2-weighted sequences to more accurately characterize breast masses seen on MRI. Am. J. Roentgenol. 2014;202:W183–W190. doi: 10.2214/AJR.13.11266. [DOI] [PubMed] [Google Scholar]
7.Giger ML, Karssemeijer N, Schnabel JA. Breast image analysis for risk assessment, detection, diagnosis, and treatment of cancer. Annu. Rev. Biomed. Eng. 2013;15:327–357. doi: 10.1146/annurev-bioeng-071812-152416. [DOI] [PubMed] [Google Scholar]
8.Maforo N, Li H, Weiss W, Lan L, Giger M. SU-D-BRA-02: Radiomics of multi-parametric breast mri in breast cancer diagnosis: A quantitative investigation of diffusion weighted imaging, dynamic contrast-enhanced, and T2-weighted magnetic resonance imaging. Med. Phys. 2015;42:3213. doi: 10.1118/1.4923882. [DOI] [Google Scholar]
9.Bhooshan N, et al. Combined use of T2-weighted MRI and T1-weighted dynamic contrast—enhanced MRI in the automated analysis of breast lesions. Magn. Reson. Med. 2011;66:555–564. doi: 10.1002/mrm.22800. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Dalmis MU, et al. Artificial intelligence-based classification of breast lesions imaged with a multiparametric breast MRI protocol with ultrafast DCE-MRI, T2, and DWI. Investig. Radiol. 2019;54:325–332. doi: 10.1097/RLI.0000000000000544. [DOI] [PubMed] [Google Scholar]
11.Truhn D, et al. Radiomic versus convolutional neural networks analysis for classification of contrast-enhancing lesions at multiparametric breast MRI. Radiology. 2019;290:290–297. doi: 10.1148/radiol.2018181352. [DOI] [PubMed] [Google Scholar]
12.Chen W, Giger ML, Bick U. A fuzzy c-means (FCM)-based approach for computerized segmentation of breast lesions in dynamic contrast-enhanced MR images. Acad. Radiol. 2006;13:63–72. doi: 10.1016/j.acra.2005.08.035. [DOI] [PubMed] [Google Scholar]
13.Antropova N, Abe H, Giger ML. Use of clinical MRI maximum intensity projections for improved breast lesion classification with deep convolutional neural networks. J. Med. Imaging. 2018;5:14503. doi: 10.1117/1.JMI.5.1.014503. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Morris, E., Comstock, C. & Lee, C. ACR BI-RADS® magnetic resonance imaging. In ACR BI-RADS^® Atlas, Breast Imaging Reporting and Data System. American College of Radiology (2013).
15.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014). https://arxiv.org/abs/1409.1556.
16.Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (2009).
17.Huynh BQ, Li H, Giger ML. Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J. Med. Imaging. 2016;3:34501. doi: 10.1117/1.JMI.3.3.034501. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Antropova N, Huynh BQ, Giger ML. A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Med. Phys. 2017;44:5162–5171. doi: 10.1002/mp.12453. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Scholkopf B, Smola AJ. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT Press; 2001. [Google Scholar]
20.Jolliffe I. Principal Component Analysis. New York: Springer; 2011. [Google Scholar]
21.Shawe-Taylor J, Sun S. A review of optimization methodologies in support vector machines. Neurocomputing. 2011;74:3609–3618. doi: 10.1016/j.neucom.2011.06.026. [DOI] [Google Scholar]
22.Horsch K, Giger ML, Metz CE. Prevalence scaling: Applications to an intelligent workstation for the diagnosis of breast cancer. Acad. Radiol. 2008;15:1446–1457. doi: 10.1016/j.acra.2008.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat. Med. 1998;17:1033–1053. doi: 10.1002/(SICI)1097-0258(19980515)17:9<1033::AID-SIM784>3.0.CO;2-Z. [DOI] [PubMed] [Google Scholar]
24.Metz CE, Pan X. “Proper” binormal ROC curves: Theory and maximum-likelihood estimation. J. Math. Psychol. 1999;43:1–33. doi: 10.1006/jmps.1998.1218. [DOI] [PubMed] [Google Scholar]
25.Efron B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 1987;82:171–185. doi: 10.1080/01621459.1987.10478410. [DOI] [Google Scholar]
26.Collignon A, et al. Automated multi-modality image registration based on information theory. Inf. Process. Med. Imaging. 1995;3:263–274. [Google Scholar]
27.Ashburner, J. & Friston, K. J. Rigid body registration. in Statistical parametric mapping: The analysis of functional brain images 49–62 (Academic Press London, 2007).
28.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988;44:837–845. doi: 10.2307/2531595. [DOI] [PubMed] [Google Scholar]
29.Sun X, Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process. Lett. 2014;21:1389–1393. doi: 10.1109/LSP.2014.2337313. [DOI] [Google Scholar]
30.Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979;6:65–70. [Google Scholar]
31.Ahn S, Park SH, Lee KH. How to demonstrate similarity by using noninferiority and equivalence statistical testing in radiology research. Radiology. 2013;267:328–338. doi: 10.1148/radiol.12120725. [DOI] [PubMed] [Google Scholar]
32.Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 3320–3328 (2014).
33.Donahue, J. et al. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning 647–655 (2014).
34.Greenspan H, Van Ginneken B, Summers RM. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans. Med. Imaging. 2016;35:1153–1159. doi: 10.1109/TMI.2016.2553401. [DOI] [Google Scholar]
35.Tajbakhsh N, et al. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Trans. Med. Imaging. 2016;35:1299–1312. doi: 10.1109/TMI.2016.2535302. [DOI] [PubMed] [Google Scholar]
36.Shin H-C, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging. 2016;35:1285–1298. doi: 10.1109/TMI.2016.2528162. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Parekh VS, Jacobs MA. Integrated radiomic framework for breast cancer and tumor biology using advanced machine learning and multiparametric MRI. NPJ Breast Cancer. 2017;3:43. doi: 10.1038/s41523-017-0045-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR1] 1.Kuhl CK, et al. Mammography, breast ultrasound, and magnetic resonance imaging for surveillance of women at high familial risk for breast cancer. J. Clin. Oncol. 2005;23:8469–8476. doi: 10.1200/JCO.2004.00.4960. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Leithner D, et al. Clinical role of breast MRI now and going forward. Clin. Radiol. 2018;73:700–714. doi: 10.1016/j.crad.2017.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Menezes GLG, Knuttel FM, Stehouwer BL, Pijnappel RM, van den Bosch MAAJ. Magnetic resonance imaging in breast cancer: A literature review and future perspectives. World J. Clin. Oncol. 2014;5:61. doi: 10.5306/wjco.v5.i2.61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Kuhl CK, et al. Do T2-weighted pulse sequences help with the differential diagnosis of enhancing lesions in dynamic breast MRI? J. Magn. Reson. Imaging Off. J. Int. Soc. Magn. Reson. Med. 1999;9:187–196. doi: 10.1002/(sici)1522-2586(199902)9:2<187::aid-jmri6>3.0.co;2-2. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Santamaría G, et al. Radiologic and pathologic findings in breast tumors with high signal intensity on T2-weighted MR images. Radiographics. 2010;30:533–548. doi: 10.1148/rg.302095044. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Westra C, Dialani V, Mehta TS, Eisenberg RL. Using T2-weighted sequences to more accurately characterize breast masses seen on MRI. Am. J. Roentgenol. 2014;202:W183–W190. doi: 10.2214/AJR.13.11266. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Giger ML, Karssemeijer N, Schnabel JA. Breast image analysis for risk assessment, detection, diagnosis, and treatment of cancer. Annu. Rev. Biomed. Eng. 2013;15:327–357. doi: 10.1146/annurev-bioeng-071812-152416. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Maforo N, Li H, Weiss W, Lan L, Giger M. SU-D-BRA-02: Radiomics of multi-parametric breast mri in breast cancer diagnosis: A quantitative investigation of diffusion weighted imaging, dynamic contrast-enhanced, and T2-weighted magnetic resonance imaging. Med. Phys. 2015;42:3213. doi: 10.1118/1.4923882. [DOI] [Google Scholar]

[CR9] 9.Bhooshan N, et al. Combined use of T2-weighted MRI and T1-weighted dynamic contrast—enhanced MRI in the automated analysis of breast lesions. Magn. Reson. Med. 2011;66:555–564. doi: 10.1002/mrm.22800. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Dalmis MU, et al. Artificial intelligence-based classification of breast lesions imaged with a multiparametric breast MRI protocol with ultrafast DCE-MRI, T2, and DWI. Investig. Radiol. 2019;54:325–332. doi: 10.1097/RLI.0000000000000544. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Truhn D, et al. Radiomic versus convolutional neural networks analysis for classification of contrast-enhancing lesions at multiparametric breast MRI. Radiology. 2019;290:290–297. doi: 10.1148/radiol.2018181352. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Chen W, Giger ML, Bick U. A fuzzy c-means (FCM)-based approach for computerized segmentation of breast lesions in dynamic contrast-enhanced MR images. Acad. Radiol. 2006;13:63–72. doi: 10.1016/j.acra.2005.08.035. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Antropova N, Abe H, Giger ML. Use of clinical MRI maximum intensity projections for improved breast lesion classification with deep convolutional neural networks. J. Med. Imaging. 2018;5:14503. doi: 10.1117/1.JMI.5.1.014503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Morris, E., Comstock, C. & Lee, C. ACR BI-RADS® magnetic resonance imaging. In ACR BI-RADS^® Atlas, Breast Imaging Reporting and Data System. American College of Radiology (2013).

[CR15] 15.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014). https://arxiv.org/abs/1409.1556.

[CR16] 16.Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (2009).

[CR17] 17.Huynh BQ, Li H, Giger ML. Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J. Med. Imaging. 2016;3:34501. doi: 10.1117/1.JMI.3.3.034501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Antropova N, Huynh BQ, Giger ML. A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Med. Phys. 2017;44:5162–5171. doi: 10.1002/mp.12453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Scholkopf B, Smola AJ. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT Press; 2001. [Google Scholar]

[CR20] 20.Jolliffe I. Principal Component Analysis. New York: Springer; 2011. [Google Scholar]

[CR21] 21.Shawe-Taylor J, Sun S. A review of optimization methodologies in support vector machines. Neurocomputing. 2011;74:3609–3618. doi: 10.1016/j.neucom.2011.06.026. [DOI] [Google Scholar]

[CR22] 22.Horsch K, Giger ML, Metz CE. Prevalence scaling: Applications to an intelligent workstation for the diagnosis of breast cancer. Acad. Radiol. 2008;15:1446–1457. doi: 10.1016/j.acra.2008.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat. Med. 1998;17:1033–1053. doi: 10.1002/(SICI)1097-0258(19980515)17:9<1033::AID-SIM784>3.0.CO;2-Z. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Metz CE, Pan X. “Proper” binormal ROC curves: Theory and maximum-likelihood estimation. J. Math. Psychol. 1999;43:1–33. doi: 10.1006/jmps.1998.1218. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Efron B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 1987;82:171–185. doi: 10.1080/01621459.1987.10478410. [DOI] [Google Scholar]

[CR26] 26.Collignon A, et al. Automated multi-modality image registration based on information theory. Inf. Process. Med. Imaging. 1995;3:263–274. [Google Scholar]

[CR27] 27.Ashburner, J. & Friston, K. J. Rigid body registration. in Statistical parametric mapping: The analysis of functional brain images 49–62 (Academic Press London, 2007).

[CR28] 28.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988;44:837–845. doi: 10.2307/2531595. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Sun X, Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process. Lett. 2014;21:1389–1393. doi: 10.1109/LSP.2014.2337313. [DOI] [Google Scholar]

[CR30] 30.Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979;6:65–70. [Google Scholar]

[CR31] 31.Ahn S, Park SH, Lee KH. How to demonstrate similarity by using noninferiority and equivalence statistical testing in radiology research. Radiology. 2013;267:328–338. doi: 10.1148/radiol.12120725. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 3320–3328 (2014).

[CR33] 33.Donahue, J. et al. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning 647–655 (2014).

[CR34] 34.Greenspan H, Van Ginneken B, Summers RM. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans. Med. Imaging. 2016;35:1153–1159. doi: 10.1109/TMI.2016.2553401. [DOI] [Google Scholar]

[CR35] 35.Tajbakhsh N, et al. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Trans. Med. Imaging. 2016;35:1299–1312. doi: 10.1109/TMI.2016.2535302. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Shin H-C, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging. 2016;35:1285–1298. doi: 10.1109/TMI.2016.2528162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Parekh VS, Jacobs MA. Integrated radiomic framework for breast cancer and tumor biology using advanced machine learning and multiparametric MRI. NPJ Breast Cancer. 2017;3:43. doi: 10.1038/s41523-017-0045-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A deep learning methodology for improved breast cancer diagnosis using multiparametric MRI

Qiyuan Hu

Heather M Whitney

Maryellen L Giger

Abstract

Introduction