Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2023 May 10;5(3):e220159. doi: 10.1148/ryai.220159

Transformer-based Deep Neural Network for Breast Cancer Classification on Digital Breast Tomosynthesis Images

Weonsuk Lee 1,, Hyeonsoo Lee 1, Hyunjae Lee 1, Eun Kyung Park 1, Hyeonseob Nam 1, Thijs Kooi 1
PMCID: PMC10245183  PMID: 37293346

Abstract

Purpose

To develop an efficient deep neural network model that incorporates context from neighboring image sections to detect breast cancer on digital breast tomosynthesis (DBT) images.

Materials and Methods

The authors adopted a transformer architecture that analyzes neighboring sections of the DBT stack. The proposed method was compared with two baselines: an architecture based on three-dimensional (3D) convolutions and a two-dimensional model that analyzes each section individually. The models were trained with 5174 four-view DBT studies, validated with 1000 four-view DBT studies, and tested on 655 four-view DBT studies, which were retrospectively collected from nine institutions in the United States through an external entity. Methods were compared using area under the receiver operating characteristic curve (AUC), sensitivity at a fixed specificity, and specificity at a fixed sensitivity.

Results

On the test set of 655 DBT studies, both 3D models showed higher classification performance than did the per-section baseline model. The proposed transformer-based model showed a significant increase in AUC (0.88 vs 0.91, P = .002), sensitivity (81.0% vs 87.7%, P = .006), and specificity (80.5% vs 86.4%, P < .001) at clinically relevant operating points when compared with the single-DBT-section baseline. The transformer-based model used only 25% of the number of floating-point operations per second used by the 3D convolution model while demonstrating similar classification performance.

Conclusion

A transformer-based deep neural network using data from neighboring sections improved breast cancer classification performance compared with a per-section baseline model and was more efficient than a model using 3D convolutions.

Keywords: Breast, Tomosynthesis, Diagnosis, Supervised Learning, Convolutional Neural Network (CNN), Digital Breast Tomosynthesis, Breast Cancer, Deep Neural Networks, Transformers

Supplemental material is available for this article.

© RSNA, 2023

Keywords: Breast, Tomosynthesis, Diagnosis, Supervised Learning, Convolutional Neural Network (CNN), Digital Breast Tomosynthesis, Breast Cancer, Deep Neural Networks, Transformers


Summary

A transformer-based deep neural network architecture that uses neighboring image sections to detect breast cancer was developed using 6174 digital breast tomosynthesis studies. The model outperformed a three-dimensional and sectionwise baseline model.

Key Points

  • ■ Two three-dimensional models that analyze neighboring image sections to make the final prediction, developed using 6174 digital breast tomosynthesis (DBT) studies, showed better breast cancer classification performance, area under the receiver operating characteristic curve, on a test set of 655 DBT studies, compared with the baseline model that uses each DBT section independently (baseline: 0.88 vs Conv3D: 0.90, P = .002 and baseline: 0.88 vs TimeSformer: 0.91, P = .003).

  • ■ A transformer-based three-dimensional model showed similar classification performance to a convolution-based model and was more computationally efficient, using 75% fewer operations and having a shorter inference time.

Introduction

Digital breast tomosynthesis (DBT) is a medical imaging technique in which a detector is rotated at a limited angle around the patient and records multiple images. These images are then reconstructed into a stack of two-dimensional (2D) sections, allowing improved lesion detection, characterization, and localization. Many studies show improvements in both screening and diagnostic imaging outcomes with DBT when compared with 2D digital mammography (1,2). Although DBT is becoming the standard of care for the detection of breast cancer, its interpretation time remains a concern (3). Recently, computer-aided detection systems based on convolutional neural networks (CNNs) have been developed to improve breast cancer detection and reduce the workload of human readers (4).

A key challenge when using a neural network for DBT is the three-dimensional (3D) volume of data; each scan has a high spatial resolution and several sections, meaning one patient case can easily be a few gigabytes of data when uncompressed. Additionally, it is difficult to apply 3D CNNs (5) because of their large computational cost. Most computer-aided detection methods for DBT, therefore, evaluate only a single section at a time (6,7) or synthesize the whole DBT stack into an image based on the inference result of every section and evaluate that image (810). A downside of these approaches is that relations between sections are not exploited optimally.

Vision transformers (11,12), a method inspired by successes in natural language processing (13), have recently gained popularity in medical image analysis (14). Transformers make use of self-attention, which enables them to better capture context. They are especially useful when modeling long-range dependencies in the input (12), something convolution operators cannot do. Variants of transformers have been shown to be effective when using 3D data for detection tasks, such as video action recognition (15,16), achieving state-of-the-art results and drastically higher efficiency than do 3D CNNs (17).

In this study, we propose a method that takes neighboring sections into account to detect breast cancer on DBT images. Our method relies on a transformer equipped with divided space-time attention to learn relations between neighboring sections (17). The proposed method was trained and evaluated on a dataset collected from multiple institutions. We compare the classification performance of the proposed model with that of a baseline model that analyzes only a single DBT section at one time and a 3D convolution baseline model.

Materials and Methods

Study data were retrospectively collected in compliance with the Health Insurance Portability and Accountability Act. This work was supported by funds secured by Lunit. All the authors are employees of Lunit.

Data

Our in-house DBT dataset comprises 6829 (1699 cancer, 3418 benign, 1712 normal) four-view Hologic DBT studies, which were retrospectively collected from nine institutions in the United States through an external entity. Cancer was confirmed with biopsy, benign examination findings were confirmed with biopsy or at least 1 year of follow-up imaging, and normal examination findings were confirmed with at least 1 year of follow-up imaging. For cancer and biopsy-proven benign examination findings, we restricted our data to one mammogram per patient. According to the Health Insurance Portability and Accountability Act Privacy Rule, all patient data from the examinations were de-identified using the Safe Harbor Method, which includes the removal of all 18 Protected Health Information elements. We also do not have access to a link allowing re-identification of data. Therefore, our study complies with the Health Insurance Portability and Accountability Act, and institutional review board approval was not required.

The dataset was split into training, validation, and test sets. A total of 655 (163 cancer, 328 benign, 164 normal) studies from one institution were used as a test set. The rest are split randomly into training and validation sets consisting of 5174 (1286 cancer, 2590 benign, 1298 normal) studies and 1000 (250 cancer, 500 benign, 250 normal) studies, respectively. The test set has not been used for training or tuning.

All 1699 studies with findings positive for cancer were annotated by one of six board-certified radiologists with breast subspecialty, who had been trained in breast imaging for at least 6 months, by referring to the ground truth of the case. For each DBT study, the radiologists were asked to draw a contour delineating the lesion at the section showing the largest cross-sectional area of the lesion. This was considered more cost-effective than drawing contours in all sections and still captured the most important part of a lesion. In addition to the contour, the annotators were asked to classify the subtype of the lesion as either a calcification, a soft-tissue lesion (which includes architectural distortions, masses, and asymmetries), or both.

Model Development

A DBT scan is a stack of 2D sections reconstructed from 2D radiographs taken from multiple angles (1). Our method inputs a DBT stack of reconstructed sections and generates a prediction for each section. The model outputs a sectionwise likelihood of it containing a malignant lesion and a heat map containing a prediction for every pixel in every section. The value of the pixel represents the likelihood that this pixel belongs to a malignant lesion.

Training a deep neural network on DBT data is challenging, mainly because of its high memory and computational requirements. Similar to 2D mammography, a DBT section is recorded at a high resolution (typically 50–80 μm) to capture fine details, like calcifications. Furthermore, the number of sections varies per view (50 to 100 images for each view), meaning a typical model that assumes a fixed-size input cannot readily be employed.

To trade off between the information provided to the model and the computational cost, we did not feed the whole DBT stack to the model, but only a subset of sections at a time. We sampled neighboring sections because suspicious lesions are usually visible only in a consecutive part of the stack. This way, it is easier to process a large, varying number of sections. During testing, our method makes predictions for the entire DBT stack.

Our model consists of three networks: a backbone network, an interaction network, and an aggregation network (Fig 1). The backbone network extracts a feature map from each input section independently. The interaction network subsequently generates a context-aware representation of each section by interacting with neighboring section features. Finally, the aggregation network reduces the neighbor features and generates the final prediction score, the likelihood of it containing a malignant lesion, and a heat map for malignant lesions. Each network is explained in detail below.

Figure 1:

Diagram shows the proposed method (based on TimeSformer), which considers neighboring sections directly in the model for the detection of breast cancer in digital breast tomosynthesis studies. The model comprises a backbone network that extracts features from individual sections, an interaction network that learns from interaction among neighboring section features, and an aggregation network that produces the final score and heat map of the target section. MLP = multilayer perceptron.

Diagram shows the proposed method (based on TimeSformer), which considers neighboring sections directly in the model for the detection of breast cancer in digital breast tomosynthesis studies. The model comprises a backbone network that extracts features from individual sections, an interaction network that learns from interaction among neighboring section features, and an aggregation network that produces the final score and heat map of the target section. MLP = multilayer perceptron.

Backbone Network

The backbone network takes a single section as input and outputs a feature representation. There are various architecture choices for the backbone network, from a 2D CNN to 3D CNNs or long short-term memory. Although our method aims to capture relations between neighboring sections, we chose to use a 2D backbone to accommodate pretraining with 2D mammograms. More details can be found in Appendix S1.

Interaction Network

The interaction network aims to capture context from neighboring sections and works on the spatial features extracted by the backbone network. We experimented with two architectures: TimeSformer (17) and a 3D convolution baseline (Conv3D). For the latter, we stacked four 3D residual blocks consisting of 3D convolutions and batch normalization and activation layers (5,18).

TimeSformer (17) is a recently introduced transformer architecture for efficient video classification. It decomposes the input into patches, every section in the scan for our dataset, that are subsequently used as input tokens to the transformer. Then, divided space-time attention is applied to the tokens, processing time attention and space attention separately (Fig 2). This way, the 3D volume can be handled efficiently without sacrificing representational power. When deployed to a DBT scan, the section axis represents the time dimension, and the height and width represent the spatial dimension.

Figure 2:

Divided space-time attention block of TimeSformer in the interaction network. For a token in the input (highlighted in the left block), temporal attention is calculated for the tokens in the same spatial location over the sections (middle block). Subsequently, the spatial attention is calculated for the tokens in the same section (right block).

Divided space-time attention block of TimeSformer in the interaction network. For a token in the input (highlighted in the left block), temporal attention is calculated for the tokens in the same spatial location over the sections (middle block). Subsequently, the spatial attention is calculated for the tokens in the same section (right block).

Aggregation Network

The aggregation network combines the features of multiple sections and predicts the final section-level score and heat map for each target section. We used max pooling along the section direction to aggregate feature maps of neighbors. The aggregated feature map was then used to predict the score for the center section and for the pixel-level heat map.

Subgroup Analysis

To better understand the performance improvement, we split our test set into subgroups and analyzed how the 3D models perform on these specific groups. First, to determine the effect of taking neighboring sections into account on reading challenging or ambiguous examinations, we extracted a subset consisting of biopsy-proven cancer and benign studies. The biopsy-proven benign studies were recalled and biopsied and can therefore be considered hard-negative findings, as radiologists could not classify lesions using the image alone.

As a second experiment, the examinations containing biopsy-proven cancers were split according to two criteria. First, the dataset was split into three subsets on the basis of their radiologic findings (ie, soft-tissue lesion, calcification, or both) to determine in which group the model is the most effective. Second, the set was split on the basis of the estimated lesion size, which was based on the number of pixels in the annotated polygon and the pixel spacing in the Digital Imaging and Communications in Medicine header. The data were split into two ranges: diameter less than or equal to 2 cm and diameter greater than 2 cm.

Statistical Analysis

We compared the methods by using three metrics: (a) the area under the receiver operating characteristic curve (AUC), (b) the sensitivity at a fixed specificity, and (c) the specificity at a fixed sensitivity. To compare sensitivity, we chose an operating point at a high specificity of 0.8, which is relevant for systems operating as detection aids, where one would prefer a small number of false-positive results. For the specificity comparison, we choose an operating point with a high sensitivity of 0.8, which is relevant, for example, for triaging applications where the model would work as a prefilter.

The DeLong test (19) was used to generate confidence bounds and compare the AUCs of different algorithms. To generate confidence bounds and compare the models at specific operating points, we used an asymptotic normal approximation (20) and a McNemar test. We also compared the computational cost of the different architectures by counting the number of giga floating point operations per second (FLOPS) and measuring the relative model latency in wall time.

Results

2D (Single DBT Section) versus 3D Comparison

Two 3D methods that take neighboring sections as input were compared with a 2D single-DBT-section baseline on the test set of 655 DBT studies (Table 1, Fig 3). Both 3D methods had higher AUCs than the baseline (Conv3D: 0.90 vs baseline: 0.88, P = .002 and TimeSformer: 0.91 vs baseline: 0.88, P = .003). Sensitivity at 0.8 specificity (Conv3D: 85.3% [139 of 163] vs baseline: 81.0% [132 of 163], P = .046 and TimeSformer: 87.7% [143 of 163] vs baseline: 81.0% [132 of 163], P = .006) and specificity at the 0.8 sensitivity (Conv3D: 85.8% [422 of 492] vs baseline: 80.5% [396 of 492], P < .001 and TimeSformer: 86.4% [425 of 492] vs baseline: 80.5% [396 of 492], P < .001) were also higher using the 3D methods.

Table 1:

Breast Cancer Classification Performance of the Three Compared Methods When Taking Neighboring Sections into Account

graphic file with name ryai.220159.tbl1.jpg

Figure 3:

Test set receiver operating characteristic (ROC) curves of the three models: Blue is a sectionwise baseline, orange is a three-dimensional convolution baseline (Conv3D), and green is the proposed model (TimeSformer).

Test set receiver operating characteristic (ROC) curves of the three models: Blue is a sectionwise baseline, orange is a three-dimensional convolution baseline (Conv3D), and green is the proposed model (TimeSformer).

Two 3D methods did not have a significantly different AUC (Conv3D: 0.90 vs TimeSformer: 0.91, P = .15). Sensitivity at 0.8 specificity (Conv3D: 85.3% [139 of 163] vs TimeSformer: 87.7% [143 of 163], P = .29) and specificity at 0.8 sensitivity (Conv3D: 85.8% [422 of 492] vs TimeSformer: 86.4% [425 of 492], P = .66) were also not significantly different.

The proposed transformer-based model reduced the number of missed cancers in the test set from 31 (using the single-DBT-section baseline) to 20, a reduction of 35% (11 of 31). When evaluating the sensitivity at a high specificity, the proposed model reduced the number of false-positive findings from 96 to 67, a reduction of 30% (29 of 96). An example output of missed cancer is visualized (Fig 4).

Figure 4:

An example visualization of output. Outputs from digital breast tomosynthesis sections were aggregated and visualized in C-view (synthetic two-dimensional image; Hologic). Left: Examined C-view image and annotated contour. Middle: Heat map created by using the baseline. Right: Heat map created by using the proposed method. LMLO = left mediolateral oblique.

An example visualization of output. Outputs from digital breast tomosynthesis sections were aggregated and visualized in C-view (synthetic two-dimensional image; Hologic). Left: Examined C-view image and annotated contour. Middle: Heat map created by using the baseline. Right: Heat map created by using the proposed method. LMLO = left mediolateral oblique.

Computational Efficiency

While both 3D methods improved performance to a similar degree, the proposed method based on the TimeSformer architecture required drastically less computation compared with Conv3D. If we assume the single-DBT-section baseline takes 5 minutes to infer a DBT stack, the same task takes 22 minutes 30 seconds if the 3D convolution model is used. TimeSformer has 25% of the FLOPS of Conv3D. Furthermore, the latency of TimeSformer is even shorter and comparable to the baseline. Note that the exact inference time strongly depends on the hardware and software used.

Subgroup Analysis

First, we compared how models perform in the challenging subset, which contains only biopsy-proven cancer and benign studies. Both 3D methods achieved higher AUCs than did the baseline (Conv3D: 0.85 vs baseline: 0.83, P = .003 and TimeSformer: 0.86 vs baseline: 0.83, P = .002), while the two 3D methods were comparable (Conv3D: 0.85 vs TimeSformer: 0.86, P = .07).

Performance analyses on subgroups defined by radiologic finding and lesion size showed that the proposed method using neighbor context improved breast cancer detection in all subgroups, especially in soft-tissue lesions (AUCs, TimeSformer: 0.93 and Conv3D: 0.91 vs baseline: 0.89; P < .001 for both comparisons with baseline, P = .002 for comparing 3D models) and small lesions (TimeSformer: 0.91 and Conv3D: 0.90 vs baseline: 0.87; P < .001 for both comparisons with baseline, P = .02 for comparing 3D models) (Table 2). These subgroups are also known to be better detected on DBT images compared with on full-field digital mammogram (3), which implies that our method takes advantage of DBT.

Table 2:

Breast Cancer Classification Performance of the Three Compared Methods Split by Radiologic Finding and Lesion Size

graphic file with name ryai.220159.tbl2.jpg

Discussion

We proposed a method to detect breast cancer on DBT images that takes neighboring sections into account by using a transformer-based deep neural network architecture. The proposed method reduced the number of false-positive results. The method was trained on a large DBT dataset comprising 6174 studies and evaluated on a separate test set of 655 DBT studies. The proposed method improved AUC (0.91 vs 0.88, P = .003), sensitivity at fixed specificity (87.7% vs 81.0%, P = .006), and specificity at fixed sensitivity (85.8% vs 80.5%, P < .001) compared with the per-section baseline. Furthermore, we showed that using neighbor context improves model ability to separate challenging benign and cancerous lesions and to detect soft-tissue and small cancers. Last, we showed that adopting TimeSformer drastically reduces computation compared with using the 3D convolution baseline.

Our study had limitations. Although we made use of a large and multi-institutional dataset for development, the method is currently only evaluated on data from a single institution. Additionally, our dataset only contained DBT studies acquired with scanners from a single manufacturer (Hologic), and the entire dataset was sampled from a U.S. population. To fully assess the generalization performance and thereby the clinical merit of our approach, we plan to extend the dataset to multiple institutions, multiple device manufacturers, and a mixed patient sample in the future.

In this work, we have made use of a transformer model in combination with a convolutional architecture to better attend to relevant sections in the DBT stack. Recently, transformers have also been applied to raw images directly, without any convolutional operators. Some computer vision studies show that this can outperform convolutional architectures, although more data are typically required. In future work, we plan to collect more data and explore this architecture further.

Supported by funds secured by Lunit.

Disclosures of conflicts of interest: W.L. Employment at Lunit stock/stock options for Lunit. Hyeonsoo Lee Employment at Lunit; stock/stock options for Lunit. Hyunjae Lee Employment at Lunit; stock/stock options for Lunit. E.P. Employment at Lunit; stock/stock options for Lunit. H.N. Employment at Lunit; stock/stock options for Lunit. T.K. Employment at Lunit; stock/stock options for Lunit.

Abbreviations:

AUC
area under the receiver operating characteristic curve
CNN
convolutional neural network
DBT
digital breast tomosynthesis
FLOP
floating point operations per second
3D
three-dimensional
2D
two-dimensional

References

  • 1. Sechopoulos I . A review of breast tomosynthesis. Part I. The image acquisition process . Med Phys 2013. ; 40 ( 1 ): 014301 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Sechopoulos I . A review of breast tomosynthesis. Part II. Image reconstruction, processing and analysis, and advanced applications . Med Phys 2013. ; 40 ( 1 ): 014302 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Chong A , Weinstein SP , McDonald ES , Conant EF . Digital breast tomosynthesis: concepts and clinical practice . Radiology 2019. ; 292 ( 1 ): 1 – 14 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Sechopoulos I , Teuwen J , Mann R . Artificial intelligence for breast cancer detection in mammography and digital breast tomosynthesis: State of the art . Semin Cancer Biol 2021. ; 72 : 214 – 225 . [DOI] [PubMed] [Google Scholar]
  • 5. Hara K , Kataoka H , Satoh Y . Learning spatio-temporal features with 3D residual networks for action recognition . In : Proceedings of the IEEE International Conference on Computer Vision Workshops , 2017. ; 3154 – 3160 . [Google Scholar]
  • 6. Shoshan Y , Zlotnick A , Ratner V , Khapun D , Barkan E , Gilboa-Solomon F . Beyond non-maximum suppression-detecting lesions in digital breast tomosynthesis volumes . In : International Conference on Medical Image Computing and Computer-Assisted Intervention , 2021 Sep 27 ; 772 – 781 . Springer; , Cham: . [Google Scholar]
  • 7. Samala RK , Chan HP , Hadjiiski L , Helvie MA , Wei J , Cha K . Mass detection in digital breast tomosynthesis: Deep convolutional neural network with transfer learning from mammography . Med Phys 2016. ; 43 ( 12 ): 6654 – 6666 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Lotter W , Diab AR , Haslam B , et al . Robust breast cancer detection in mammography and digital breast tomosynthesis using an annotation-efficient deep learning approach . Nat Med 2021. ; 27 ( 2 ): 244 – 249 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tardy M , Mateus D . Trainable summarization to improve breast tomosynthesis classification . In : International Conference on Medical Image Computing and Computer-Assisted Intervention , 2021 Sep 27 ; 140 – 149 . Springer; , Cham: . [Google Scholar]
  • 10. Bai J , Posner R , Wang T , Yang C , Nabavi S . Applying deep learning in digital breast tomosynthesis for automatic breast cancer detection: A review . Med Image Anal 2021. ; 71 : 102049 . [DOI] [PubMed] [Google Scholar]
  • 11. Dosovitskiy A , Beyer L , Kolesnikov A , et al . An image is worth 16x16 words: Transformers for image recognition at scale . arXiv 2010.11929 [preprint] https://arxiv.org/abs/2010.11929. Posted October 22, 2020. Accessed March 1, 2022 .
  • 12. Liu Z , Lin Y , Cao Y , et al . Swin transformer: hierarchical vision transformer using shifted windows . In : Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021. ; 9992 – 10002 . [Google Scholar]
  • 13. Vaswani A , Shazeer N , Parmar N , et al . Attention is all you need . In : Advances in Neural Information Processing Systems 30 (NIPS 2017) . https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html . [Google Scholar]
  • 14. Shamshad F , Khan S , Zamir SW , et al . Transformers in medical imaging: A survey . Medical Image Analysis , 2023. ; 102802 . [DOI] [PubMed] [Google Scholar]
  • 15. Arnab A , Dehghani M , Heigold G , Sun C , Lučić M , Schmid C . Vivit: a video vision transformer . In : Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021. ; 6816 – 6826 . [Google Scholar]
  • 16. Liu Z , Ning J , Cao Y , et al . Video swin transformer . In : Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022. ; 3192 – 3201 . [Google Scholar]
  • 17. Bertasius G , Wang H , Torresani L . Is space-time attention all you need for video understanding? In : ICML 2021. Jul 1 (Vol 2, No. 3, p. 4). https://proceedings.mlr.press/v139/bertasius21a.html . [Google Scholar]
  • 18. He K , Zhang X , Ren S , Sun J . Deep residual learning for image recognition . In : Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016. ; 770 – 778 . [Google Scholar]
  • 19. DeLong ER , DeLong DM , Clarke-Pearson DL . Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach . Biometrics 1988. ; 44 ( 3 ): 837 – 845 . [PubMed] [Google Scholar]
  • 20. Brown LD , Cai TT , DasGupta A . Interval estimation for a binomial proportion . Stat Sci 2001. ; 16 ( 2 ): 101 – 133 . [Google Scholar]

Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES