An efficient context-aware approach for whole-slide image classification

Hongru Shen; Jianghua Wu; Xilin Shen; Jiani Hu; Jilei Liu; Qiang Zhang; Yan Sun; Kexin Chen; Xiangchun Li

doi:10.1016/j.isci.2023.108175

. 2023 Oct 12;26(12):108175. doi: 10.1016/j.isci.2023.108175

An efficient context-aware approach for whole-slide image classification

Hongru Shen ^1,⁶, Jianghua Wu ^2,⁶, Xilin Shen ¹, Jiani Hu ¹, Jilei Liu ¹, Qiang Zhang ³, Yan Sun ⁴, Kexin Chen ^5,^∗, Xiangchun Li ^1,^7,^∗∗

PMCID: PMC10690557 PMID: 38047071

Summary

Computational pathology for gigapixel whole-slide images (WSIs) at slide level is helpful in disease diagnosis and remains challenging. We propose a context-aware approach termed WSI inspection via transformer (WIT) for slide-level classification via holistically modeling dependencies among patches on WSI. WIT automatically learns feature representation of WSI by aggregating features of all image patches. We evaluate classification performance of WIT and state-of-the-art baseline method. WIT achieved an accuracy of 82.1% (95% CI, 80.7%–83.3%) in the detection of 32 cancer types on the TCGA dataset, 0.918 (0.910–0.925) in diagnosis of cancer on the CPTAC dataset, and 0.882 (0.87–0.890) in the diagnosis of prostate cancer from needle biopsy slide, outperforming the baseline by 31.6%, 5.4%, and 9.3%, respectively. WIT can pinpoint the WSI regions that are most influential for its decision. WIT represents a new paradigm for computational pathology, facilitating the development of digital pathology tools.

Subject areas: Oncology, pathology, Computer science

Graphical abstract

Highlights

•
WIT aggregates representations of all image patches for slide-level classification
•
WIT achieves high accuracy in the detection of 32 cancer types and diagnosis of cancer
•
Saliency maps obtained from WIT are visually interpretable

Oncology; Pathology; Computer science

Introduction

The development of digital pathology leads to accumulation of large-scale whole-slide imaging data, laying the foundation of big data for computational pathology. Rich morphological features buried in whole-slide image (WSI) provide diagnostic information of the disease and offer guidance on the decision for treatment. Advances in deep learning algorithms enable the analyses of gigapixel WSIs at scale for disease diagnosis,¹^,²^,³ prognosis,⁴^,⁵^,⁶^,⁷ and treatment selection.⁸^,⁹

Deep learning approaches have achieved human-level performance in recognizing natural images in the ImageNet competition.¹⁰^,¹¹^,¹²^,¹³ However, automatic recognition of WSI remains challenging due to the super-high spatial resolution of WSI as compared with images from ImageNet.¹⁰ To address this challenge, researchers divided WSI into small image patches and subsequently aggregated the features of image patches to obtain slide-level features.⁵^,¹⁴^,¹⁵^,¹⁶^,¹⁷ For example, Campanella and colleagues used standard multiple-instance learning (MIL) to diagnose prostate cancer, basal cell carcinoma, and auxiliary lymph node metastasis of breast cancer by first ranking image patches with regard to slide-level labels and using the most relevant image patch for slide-level classification.¹ Lu and colleagues developed a data-efficient weakly supervised approach¹⁸ for slide-level classification using attention-based pooling¹⁹ of all image patches instead of the most relevant patch used by standard MIL.¹ Based on this approach, Lu and colleagues introduced tumor origin assessment via deep learning (TOAD) to predict tissue-of-origins for cancer of unknown primary.²⁰ Meanwhile, this attention-based MIL method has been utilized for addressing the diagnostic tasks for cardiac allograft rejection screening in WSIs⁸ and prognostic prediction by fusing WSIs with different modalities of genomic data.⁴ Apart from these diagnostic endeavors, analyses of large-scale WSIs have been proved to be feasible for the prediction of genetic markers. Coudray and colleagues reported a deep-learning-based approach for predicting somatic mutations in canonical driver genes for lung cancer via averaging the probabilities of image patches or counting the percentage of image patches classified as positive.¹⁵ In addition, multiple studies reported that micro-satellite instability can be predicted from WSIs in gastrointestinal cancer,²¹ colorectal cancer,²²^,²³^,²⁴ and endometrial carcinoma.²⁵

The transformer architecture designed for natural language understanding can capture long-range dependencies among different entities.²⁶ Transformer-based language architectures have achieved superior performance in various language understanding tasks.²⁶^,²⁷^,²⁸ The self-attention operation is the key module underlying the success of transformer in that it captures dependencies in the input.²⁶ Although it was proposed for language understanding, transformer is inherently task-agnostic. It has been widely adopted or revised for image recognition. Vision transformer (ViT) is a direct adoption of transformer for image classification by splitting image into multiple patches and taking the flatten image patches as input.²⁶ Thereafter, ViT-based architectures have been widely used in medical imaging analyses.²⁹^,³⁰^,³¹

Inspired by the success of transformer-based natural language understanding³²^,³³ and image recognition,²⁷ we present an approach called WSI inspection via transformer (WIT) for slide-level classification via holistically modeling dependencies among patches on the WSI. WIT takes as input the features of image patches that were extracted with an image model pretrained on ImageNet.³⁴ We collected a total number of 22,457 WSIs from TCGA, CPTAC, and PANDA projects to develop and systematically evaluate WIT for detection of 32 cancer types and diagnosis of cancer. The TCGA consists of 11,623 WSIs covering 32 cancer types. The CPTAC dataset includes 3,414 WSIs from cancer patients and 1,638 WSIs from non-cancer controls. The PANDA dataset consists of 5,782 needle biopsy slides; 2,891 of them are prostate cancers and rest are non-cancer controls. WIT achieved an accuracy of 82.1% in the detection of 32 cancer types on the TCGA dataset, 91.8% in diagnosis of cancer on the CPTAC dataset, and 88.2% on the PANDA dataset, outperforming the attention-based MIL baseline by 31.6%, 5.4%, and 9.3%, respectively. WIT can pinpoint the WSI regions that are most influential for its decision. WIT represents a new paradigm for computational pathology. It will facilitate the development of assistive tools for digital pathology.

Results

An overview of WIT

The procedures to develop WIT includes WSI segmentation and tiling, model development, and evaluation (Figure 1). Firstly, we segmented the WSI to identify tissue regions and subsequently tiled WSI into patches of 256 × 256 pixels (Figure 1A). WIT takes these flattened image patches as input. We used a pretrained model to extract a feature with 1,024 dimensions for each image patch (See STAR methods). Meanwhile, the position embeddings of image patches on that WSI along with their extracted feature were fed into a transformer block. The transformer block consists of a multi-headed self-attention module and point-wise feed-forward neural network. Residual connection is employed around these two sub-modules, followed by layer normalization²⁶ (Figure 1B). The multi-headed self-attention module learns the dependencies among different image patches and the influence of each patch on the output, such as slide labels (Figure 1B). WIT was evaluated for its capacity in slide classification and localization of image patches that exhibit significant association with slide labels (Figure 1C).

A flowchart illustrating the framework of WIT

(A) Illustration of the preprocessing steps: segmentation of tissue regions, patch tiling and flattening.

(B) The architecture of WIT.

(C) Evaluation of WIT for classification and model interpretability. WSI, Whole Slide Image; AbMIL, Attention-based Multiple Instance Learning.

High performance of WIT in tissue-of-origin localization

We systematically evaluated the classification performance of WIT on The Cancer Genome Atlas (TCGA) dataset for tissue-of-origin localization via 5-fold cross-validation (See STAR methods). The TCGA dataset consists of 11,623 formalin-fixed paraffin-embedded WSIs from 9,565 individuals covering 32 cancer types (Table S1). We examined classification performance of WIT with varying parameters such as 1, 2, 5, and 17 megabytes (Table S2). We used the attention-based MIL model as the baseline model for comparison. The baseline possesses model parameters of 1 megabyte.

The accuracy of WIT was increasing with model size. Its top-1 accuracy ranged from 73.1% (95% confidence interval [CI], 72.5%–73.9%) for WIT-1Mb to 82.1% (80.7%–83.3%) for WIT-17Mb, whereas the baseline had a top-1 accuracy of 64.2% (60.6%–66.0%) (Figure 2A). Top-2 and top-3 accuracies exhibited the same trend as top-1 accuracy (Figure 2A; Table S3). Meanwhile, the micro-average AUROC of four WIT models were also higher than the baseline model (Figure 2B). WIT-17Mb achieved high performance in localization of 32 cancer types with respect to precision and recall rate (Figure 2C). WIT-17M achieved an average precision of 77.3% and recall rate of 75.6%, outperforming the baseline method by 29.5% and 37.5%, respectively. The confusion matrix of the baseline method was shown in Figure S1D. WIT of different model size also had higher performance as compared with the baseline method when stratified by cancer types (Figure 2D; Tables S4‒S6). In addition, the F1 scores achieved by different WIT models are higher than the baseline method (Figure 2E; Table S7). For example, WIT-1M had an average F1 score of 0.618 versus 0.554 as obtained by the baseline method, albeit WIT-1M and the baseline method had comparable model size.

The classification performance of WIT in localization of tissue origins for 32 cancer types on TCGA dataset

(A) Top-K accuracy for localization of tumor origins, $K \in {1,2,3}$ .

(B) Micro-average area under the receiver operating curve.

(C) Patient-level performance from 5-fold cross-validation. Per origin count, precision, and recall rate are plotted next to the confusion matrix. The columns represent the true origin of the tumor, and rows represent the prediction by the WIT model.

(D) Area under the precision-recall curve (PRAUC) stratified by cancer types.

(E) Scatterplots of F1 scores between different models. AbMIL, attention-based multiple instance learning.

High performance of WIT in cancer diagnosis

WIT achieved high classification performance in the diagnosis of cancer on the CPTAC and PANDA datasets (See STAR methods). The CPTAC dataset consists of 5,052 formalin-fixed paraffin-embedded WSIs from 1,330 individuals (Table S8). The PANDA dataset consists of 5,782 prostate WSIs subjected to needle biopsies.³⁵

On the CPTAC dataset, WIT models achieved AUROCs ranging from 0.941 (95% CI, 0.934–0.949) to 0.953 (0.946–0.960), whereas the baseline model achieved an AUROC of 0.931 (0.931–0.969) (Figure 3A). WIT-17Mb achieved an accuracy of 0.918 (0.910–0.925) as compared with WIT models of smaller sizes as well as the baseline model. Similar trends were observed with respect to other classification metrics (Figures 3B and 3C; Table S9). On the PANDA dataset, WIT-17Mb achieved the significantly higher AUROC as compared with WIT of smaller sizes and the baseline model (DeLong’s test, all adjusted p values <2.2e-16, Figure 3D). Classification metrics such as accuracy, sensitivity, specificity, precision, negative predictive value, and F1 score achieved by WIT-17Mb were also significantly higher than the other models (Figures 3E and 3F; Table S10).

The classification performance of WIT in the diagnosis of cancer on CPTAC and PANDA datasets

(A and D) The receiver operating curves and area under the curves.

(B and E) Confusion matrices.

(C and F) Classification metrics of accuracy, sensitivity, specificity, precision, negative predictive value (NPV), and F1-score. AbMIL, attention-based multiple instance learning.

Model interpretability

The multi-headed self-attention modules in WIT measure the association between the classification representation and each image patch. Therefore, the attention scores can be interpreted as the association between each image patch and the classification output. We converted attention scores derived from WIT into human-interpretable heatmaps, which highlights importance of WSI regions for prediction (See STAR methods). In localization of 32 cancer types, WIT captures tumor regions that are considered to be morphology of different cancer types by pathologists in lung adenocarcinoma (LUAD, Figure 4A), rectum adenocarcinoma (Figure 4B), pancreatic adenocarcinoma (Figure 4C), and uterine corpus endometrial carcinoma (Figure 4D). For example, WIT identifies micropapillary tufts forming florets structure as strong evidence in detection of LUAD (Figure 4A). In the diagnosis of cancer, WIT pinpoints the tumor regions of non-keratinizing squamous cells with solid pattern in lung squamous cell carcinoma (Figure 4E) and confluent glandular and cribriform structure in UCEC (Figure 4F). In addition, WIT is able to identify prostate adenocarcinoma (Figure 4G and 4H) and a cluster of small poorly formed glands (Figure 4G) from needle biopsy. We provided visualization of attention maps for a number of slides for exploration purpose in our interactive website (https://deeplearningplus.github.io/WIT-attention-maps/).

Attention maps of WIT for interpretability in localization of tissue origins and diagnosis of cancer from FFPE WSIs and biopsy

Boxes highlight the typical morphologic features corresponding to the textual description. The interactive visualization is available at https://deeplearningplus.github.io/WIT-attention-maps/.

Discussion

In our study, we proposed a context-aware deep learning approach WIT for slide-level localization of tumor origins and diagnosis of cancer from WSIs. WIT outperformed the attention-based MIL²⁰ baseline by significant marginals across all classification tasks evaluated, especially in the detection of 32 cancer types where WIT achieved a micro-average area under the receiver operating curve (AUROC) of 0.991 (0.991–0.992) versus 0.968 (0.966–0.969) as obtained by the baseline method.

The high performance of WIT can be attributed to its context-aware ability to learn the potential nonlinear associations among image patches, whereas the baseline method treats different image patches as independent instances. As WIT was built upon transformer,³⁶ the multi-headed self-attention module in transformer enables WIT to learn interrelation of patches in different subspaces, whereas attention-based multiple-instance learning (MIL) is designed to aggregate multiple instances independently. Attention-based MIL methods have been widely and successfully adopted in addressing the challenges of computational pathology such as CLAM,¹⁸ TOAD,²⁰ and CRANE.⁸ WIT has the advantage of CLAM and TOAD in that it uses only the slide-level labels without any manual annotation. However, both CLAM and TOAD share the common limitations of MIL-based approaches³⁷ in that they are context-independent but not context-aware.

As compared with TOAD developed in the previous study, our method has fine-grainer classification. Our method performs classification for 32 cancer types, whereas TOAD performs classification for 18 cancer types. TOAD did not include MESO and DLBCL and did not distinguish between READ and COAD; LUSC and LUAD; and KICH, KIRC, and KIRP. In contrast, our method treats each of these cancer subtypes as different classes.

Better performance for UVM, THCA, and PRAD when compared with DLBC, MESO, and READ is related to their morphological features. For example, UVM is characterized by well distinctive features such as ciliary body location, diffuse-type tumor, ring melanoma of the iris, presence of vascular mimickers, and extraocular extension.³⁸ THCA presents with a papillary pattern or a follicular pattern with or without thyroid colloid.³⁹ PRAD is characterized with perineural invasion, glomerulations, and mucinous fibroplasia (also known as collagenous micronodule).⁴⁰ These features of UVM, THCA, and PRAD are separately unique and predominantly different from other cancer types. Conversely, DLBC, MESO, and READ are more complex and challenging for accurate diagnosis. DLBCLs are characterized by partial or complete effacement of the normal architecture (nodal or extranodal) by medium- to large-sized lymphoid cells with vesicular chromatin. These features necessitate immunohistochemical staining in clinical setting for confirmatory diagnosis.⁴¹ Mesothelioma cells were morphologically diverse. It is difficult to distinguish between epithelioid mesothelioma and metastatic carcinoma.⁴² READs are characterized by glandular tubular or diffuse nests depending on its differentiation. These features of tumor cells or structure are not specific among these tumors, and it is difficult to distinguish these tumors from STAD and COAD.

WIT has several specific advantages. First, WIT can be easily scaled into models of different sizes. Large model has better classification performance as compared with smaller ones. However, the high performance of different WIT models cannot be merely attributed to their model sizes as compared with the attention-based MIL baseline. For example, in the detection of 32 cancer types, WIT-1Mb achieved significantly higher overall accuracy in comparison to the baseline method [73.1% (95% CI, 72.5%–73.9%) versus 64.2% (62.4%–66.0%)] although their model sizes are comparable. Therefore, the high performance of WIT is likely due to its ability to take into account nonlinear associations among all image patches. Besides, overall accuracy is steadily increasing with model size (Table S2). Second, WIT is data-efficient in that we extracted image patches at ×20 magnification instead of full magnification. In this scenario, the 16 terabytes of TCGA WSI dataset were converted into a dataset of 200 gigabytes, enabling fast experimentation. Third, multi-head attentions used by WIT enable model interpretability from different feature representation subspaces, allowing for different morphological features to be identified by different attention heads. For example, we observed that one attention head of WIT identified micropapillary tufts forming floret structure as strong evidence for lung adenocarcinoma (Figure 4A), whereas the other heads pay attention to different tissue structure such as normal pulmonary alveoli (Figure S2).

Conclusion

Weakly supervised learning such as MIL-based approaches have been successfully applied in addressing the challenges of computational pathology. However, their limitations are apparent in that they treat instances independently. Here, we addressed this challenge by presenting WIT—a deep learning method based on transformer for learning feature presentation of whole slide by taking into account nonlinear associations among image patches. WIT will facilitate adoption of deep-learning-based solution and enable knowledge discovery in computational pathology.

Limitations of the study

However, WIT was not without limitations. We used the ResNet50 model³⁴ pretrained on the ImageNet dataset as feature extractor for image patches of WSI. The ImageNet is a collection of natural scene images. Therefore, it is definitely suboptimal by using this pretrained ResNet50 model³⁴ in characterizing image patches clipped from WSIs. This strategy was also adopted by CLAM, TOAD, and CRANE. Pretraining the feature extractor on image patches of WSIs may have the potential to improve the performance of WIT and all MIL-based methods. However, this will drastically increase the computational resources. We will address this issue in our future study. In addition, the 2D spatial dependencies among image patches is lost, as WIT accepts flattened patches as input. Addressing this drawback with multi-dimensional transformers such as axial attention⁴³ will improve the performance of WIT.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Raw and analyzed data	https://portal.gdc.cancer.gov	TCGA
Raw and analyzed data	https://cancerimagingarchive.net/datascope/cptac	CPTAC
Raw and analyzed data	https://www.kaggle.com/c/prostate-cancer-grade-assessment/data	PANDA

Software and algorithms

CLAM	(Lu et al.¹⁸)	https://github.com/mahmoodlab/CLAM
TOAD	(Lu et al.²⁰)	https://github.com/mahmoodlab/TOAD

Open in a new tab

Resource availability

Lead contact

Further information and requests for resources and materials should be directed to and will be fulfilled by the lead contact, Xiangchun Li (lixiangchun2014@foxmail.com).

Materials availability

This study did not generate new unique reagents.

Data and code availability

•
All datasets were downloaded from public databases. The source list of these datasets was provided in the key resources table. Source code is available at https://github.com/deeplearningplus/WIT.
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Experimental model and study participant details

WSI datasets

We collected a total number of 22,457 WSIs from The Cancer Genome Atlas (TCGA dataset, n = 11,623), The Clinical Proteomic Tumor Analysis Consortium (CPTAC dataset, n = 5,052) and PANDA (PANDA dataset, n = 5,782).

TCGA dataset

The TCGA dataset covers 32 cancer types: BRCA, KIRC, THCA, UCEC, LGG, LUSC, LUAD, HNSC, COAD, SKCM, PRAD, STAD, BLCA, GBM, LIHC, KIHC, CESC, SARC, PAAD, PCPG, READ, ESCA, TGCT, THYM, KICH, OV, UVM, MESO, UCS, ACC, DLBC and CHOL. The formalin-fixed paraffin embedded (FFPE) hematoxylin and eosin (H&E) stained WSIs are used. The details are in Table S1.

CPTAC dataset

We collected a total of 11,623 WSIs from the Cancer Imaging Archive CPTAC Pathology Portal. The collected projects consisted of CPTAC-LUAD, CPTAC-LSCC, CPTAC-SAR, CPTAC-UCEC, CPTAC-UCEC, CPTAC-CCRCC, CPTAC-PDA, CPTAC-HNSCC, CPTAC-SAR and CPTAC-CM (Table S8). The FFPE, H&E stained WSIs from normal donors and cancer patients are used.

The PANDA dataset

This dataset consists of 5,782 slides from prostate cancer patients and non-cancer individuals subjected to needle biopsies. There are 2,891 non-cancer biopsy WSIs. We randomly sampled 5,782 cancer biopsy WSIs to mitigate class imbalance cancer and non-cancer slides.

Method details

Whole-slide image (WSI) preprocessing

The slide image was segmented for the tissue regions using the CLAM Python package. We used ×20 magnification. We cropped the WSI into 256 × 256 patches within the segmented tissue regions and flattened them into an array. We extracted a feature of 1024 dimensions for these image patches from the second residual layer of pretrained ResNet50 model¹²^,³⁴ on ImageNet dataset. The extracted features of image patches from a WSI were saved to disk file.

WIT architecture

WIT consists of an embedding layer and a transformer encoder followed by a softmax layer.

Embedding layer

This layer takes as input the elementwise summation of image patch features and position embeddings of the flattened image patches. We used the pretrained ResNet50 model¹²^,²⁶ as the feature extractor for image patches.

The transformer encoder

The encoder has two components: a multi-headed self-attention and a position-wise feedforward neural network.

The i^th self-attention head is formulated as²⁶:

A t t e n t i o n_{i} (Q_{i}, K_{i}, V_{i}) = s o f t \max (\frac{Q_{i} {K_{i}}^{T}}{\sqrt{d_{k}}}) V_{i}

The input embeddings outputted from the embedding layer are projected to three matrices: query ( $Q_{i}$ ), key ( $K_{i}$ ) and value ( $V_{i}$ ). d_k is the dimension of the query and it is used as scaling factor to mitigate the extreme small gradient.⁴⁴

The multi-headed self-attention is the concatenation of multiple self-attention heads, allowing for the transformer attending to information in different feature representation subspaces. Multi-headed self-attention is formulated as⁴⁴:

M u l t i - H e a d - A t t e n t i o n (Q, K, V) = C o n c a t (A t t e n t i o n_{1}, . . ., A t t e n t i o n_{h}) W^{O}

where $W^{O} \in R^{h d_{v} \times d_{m o d e l}}$ denotes the learned projection matrix.

The position-wise feedforward neural network (FFN) consists of two linear layers with ReLu activation in-between:

F F N (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2}

where W₁ and W₂ are weight matrices and $b_{1}$ and $b_{2}$ are the bias.

Layer-wise normalization⁴⁵ is used in the front and rear of FFN. Residual connection¹² is applied to improve information flow.

Model training

The WSIs are random sampled and trained using WIT for 100 epochs. The weights and bias parameters of the model are initialized randomly, and the ground-truth label is slide-level labels. We used the cross-entropy loss⁴⁶ as the objective function in classification. The model parameters are updated via the AdamW optimizer with an initial learning rate of 2 × 10⁻⁵, weight decay of 1 × 10⁻⁵. WIT was trained with PyTorch (version 1.12.0) and transformers (version 4.21.1) on NVIDIA DGX A100.

Different WIT models

We evaluated four WIT models with different parameters by varying the hidden size: WIT-1Mb, WIT-2Mb, WIT-5Mb and WIT-17Mb. Details of these models are provided in Table S2.

Baseline method

We used attention-based MIL¹⁸^,¹⁹^,²⁰ implemented in TOAD²⁰ as baseline method. Attention-based MILs are widely used in computational pathology studies. It takes a WSI as a bag and image patches on that WSI as instances. It uses attention-based pooling to aggregate the features of all image patches to obtain slide-level feature representations.

Let $H = {h_{1}, . . ., h_{k}}$ be a bag of K instances, the MIL pooling is defined as³⁰:

z = \sum_{k = 1}^{K} a_{k} h_{k}

a_k is the attention score for the k^th instance, which is defined as³⁰:

a_{k} = \frac{\exp {w_{k}^{T} \tanh (V h_{k}^{T})}}{\sum_{j = 1}^{K} \exp {w_{j}^{T} \tanh (V h_{j}^{T})}}

where $\forall_{k = 1, . . ., K}$ , and $V \in R^{L \times M}$ are parameters. The tanh is used as activation function. The network module is trained to assign an attention score $a_{t}$ for each patch³⁰:

a_{k} = \frac{\exp {w_{k}^{T} (\tanh (V h_{k}^{T}) ⊙ s i g m (U h_{k}^{T}))}}{\sum_{j = 1}^{K} \exp {w_{j}^{T} (\tanh (V h_{j}^{T}) ⊙ s i g m (U h_{k}^{T}))}}

where $U \in R^{L \times M}$ are parameters, $⊙$ is an element-wise multiplication and sigm(.) is sigmoid non-linearity.

Visualization of attention map

For a given self-attention head, let α is the self-attention matrix; $α_{i, j}$ is the attention weight between the i^th and j^th. The attention score of the i^th patch with slide-level representation measures the contribution of the i^th patch on classification. CLS stands for a slide-level representation where we added at the start of flattened feature array of image patches for each WSI, which is used for classification during training. The self-attention is obtained via:

S o f t m a x (\frac{Q_{i} \times {K_{i}}^{T}}{\sqrt{d_{k}}})

Assumed there are K patches in a WSI, the first row of each self-attention matrix (denoted as α₀) quantifies the influence of each patch on classification. α₀ is converted to normalized percentile scores and scaled to the interval of $[0, 1]$ as proposed in CLAM.¹⁸ The normalized attention scores were converted to RGB colors using a disperse colourmap values and displayed on the spatial regions in the slide with high attention displayed in red and low attention in deep purple using Matlibplot (version 3.5.2). We tiled the WSI into 256 × 256 patches using a overlap of 0.80 to create more fine-grained heatmaps. Gaussian blur is used to smooth uneven pixel values in a heatmap image using OpenCV (version 4.7.0). We use the code of CLAM Python package for attention map visualization.¹⁸ We used diverging color scheme (i.e., seismic palette in python matplotlib package) to represent the attention scores and overlay them onto the WSI image. The redder the higher probability of that region to be cancer, whereas the bluer the high probability of that region to be non-cancer.

Quantification and statistical analysis

Model evaluation

We used area under the receiver operating cureve (AUROC), accuracy, precision (also known as positive predictive value), recall rate, negative predictive value (NPV) and F1 score to assess the perfomance of WIT. Precision is the ratio of true positives to total predicted positives. Recall rate is the ratio of true positives to total actual positives. We reported the top-K accuracy for K = 1,2,3 on localization of 32 cancer types. NPV is defined as the number of true negatives divided by the number of samples predicted to be negative. F1-score is the harmonic mean of precision and recall rate.

Statistical and software

We conducted our experiment with Python (version 3.8.10), OpenSlide (version 1.2.0), Pillow (version 9.1.1), R (version 4.2.1), ggplot2 (version 3.3.6), ROCR (version 1.0.11), multiROC (version 1.1.1) and PROC⁴⁷ (version 1.18.0). The visualization of precision-recall curve (PRC) and calculation of area under PRC were performed with ROCR. Calculation of micro-averaged AUROC was performed with multiROC. Calculation of AUROC was performed with PROC.⁴⁷ The 95% confidence intervals of the AUROC were calculated using DeLong’s methods implemented in pROC. The calculation of 95% confidence intervals for accuracy, sensitivity, specificity, precision, negative predictive value and F1 score with Clopper-Pearson method.⁴⁸

Additional resources

This study did not generate additional data.

Acknowledgments

We are grateful for researchers for their generosity to make their data publicly available. This work was supported by the National Natural Science Foundation of China (Grant No. 32270688 and 31801117 to X.L. and 82073287 to Q.Z.), National Key Research and Development Program of China (Grant No. 2021YFC2500400 to K.C.), and Program for Changjiang Scholars and Innovative Research Team in University in China (Grant No. IRT_14R40 to K.C.). This work was funded by Tianjin Key Medical Discipline (Specialty) Construction Project (TJYXZDXK-009A).

Author contributions

Xiangchun Li and Kexin Chen designed and supervised the study; Xiangchun Li and Hongru Shen performed data analysis and wrote the manuscript; Xiangchun Li developed the model; Jianghua Wu interpreted the whole-slide image data. Xiangchun Li, Hongru Shen, Xilin Shen, Jiani Hu, Jilei Liu, and Qiang Zhang collected data; Yan Sun provided comments on the results. Hongru Shen, Xiangchun Li, and Kexin Chen revised the manuscript.

Declaration of interests

The authors declare that they have no conflict of interest.

Published: October 12, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2023.108175.

Contributor Information

Kexin Chen, Email: chenkexin@tmu.edu.cn.

Xiangchun Li, Email: lixiangchun2014@foxmail.com.

Supplemental information

Document S1. Figures S1 and S2 and Tables S1‒S10

mmc1.pdf^{(1.7MB, pdf)}

References

1.Campanella G., Hanna M.G., Geneslaw L., Miraflor A., Werneck Krauss Silva V., Busam K.J., Brogi E., Reuter V.E., Klimstra D.S., Fuchs T.J. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 2019;25:1301–1309. doi: 10.1038/s41591-019-0508-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Ström P., Kartasalo K., Olsson H., Solorzano L., Delahunt B., Berney D.M., Bostwick D.G., Evans A.J., Grignon D.J., Humphrey P.A., et al. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study. Lancet Oncol. 2020;21:222–232. doi: 10.1016/S1470-2045(19)30738-7. [DOI] [PubMed] [Google Scholar]
3.Kotei E., Thirunavukarasu R. Computational techniques for the automated detection of mycobacterium tuberculosis from digitized sputum smear microscopic images: A systematic review. Prog. Biophys. Mol. Biol. 2022;171:4–16. doi: 10.1016/j.pbiomolbio.2022.03.004. [DOI] [PubMed] [Google Scholar]
4.Chen R.J., Lu M.Y., Williamson D.F.K., Chen T.Y., Lipkova J., Noor Z., Shaban M., Shady M., Williams M., Joo B., Mahmood F. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell. 2022;40:865–878.e6. doi: 10.1016/j.ccell.2022.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Mobadersany P., Yousefi S., Amgad M., Gutman D.A., Barnholtz-Sloan J.S., Velázquez Vega J.E., Brat D.J., Cooper L.A.D. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl. Acad. Sci. USA. 2018;115:E2970–E2979. doi: 10.1073/pnas.1717139115. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chen R.J., Lu M.Y., Wang J., Williamson D.F.K., Rodig S.J., Lindeman N.I., Mahmood F. Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis. IEEE Trans. Med. Imaging. 2022;41:757–770. doi: 10.1109/TMI.2020.3021387. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Courtiol P., Maussion C., Moarii M., Pronier E., Pilcer S., Sefta M., Manceron P., Toldo S., Zaslavskiy M., Le Stang N., et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 2019;25:1519–1525. doi: 10.1038/s41591-019-0583-3. [DOI] [PubMed] [Google Scholar]
8.Lipkova J., Chen T.Y., Lu M.Y., Chen R.J., Shady M., Williams M., Wang J., Noor Z., Mitchell R.N., Turan M., et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat. Med. 2022;28:575–582. doi: 10.1038/s41591-022-01709-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shamai G., Livne A., Polónia A., Sabo E., Cretu A., Bar-Sela G., Kimmel R. Deep learning-based image analysis predicts PD-L1 status from H&E-stained histopathology images in breast cancer. Nat. Commun. 2022;13:6753. doi: 10.1038/s41467-022-34275-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM. 2017;60:84–90. [Google Scholar]
11.Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. 2017. Densely Connected Convolutional Networks; pp. 4700–4708. [Google Scholar]
12.He K., Zhang X., Ren S., Sun J. 2016. Deep Residual Learning for Image Recognition; pp. 770–778. [Google Scholar]
13.Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv. 2014 doi: 10.48550/arXiv.1409.1556. Preprint at. [DOI] [Google Scholar]
14.Hou L., Samaras D., Kurc T.M., Gao Y., Davis J.E., Saltz J.H. Patch-based Convolutional Neural Network for Whole Slide Tissue Image Classification. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016;2016:2424–2433. doi: 10.1109/CVPR.2016.266. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Coudray N., Ocampo P.S., Sakellaropoulos T., Narula N., Snuderl M., Fenyö D., Moreira A.L., Razavian N., Tsirigos A. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 2018;24:1559–1567. doi: 10.1038/s41591-018-0177-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Wei J.W., Tafe L.J., Linnik Y.A., Vaickus L.J., Tomita N., Hassanpour S. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci. Rep. 2019;9:3358. doi: 10.1038/s41598-019-40041-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Su Z., Tavolara T.E., Carreno-Galeano G., Lee S.J., Gurcan M.N., Niazi M.K.K. Attention2majority: Weak multiple instance learning for regenerative kidney grading on whole slide images. Med. Image Anal. 2022;79 doi: 10.1016/j.media.2022.102462. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lu M.Y., Williamson D.F.K., Chen T.Y., Chen R.J., Barbieri M., Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 2021;5:555–570. doi: 10.1038/s41551-020-00682-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Ilse M., Tomczak J.M., Welling M. Attention-based Deep Multiple Instance Learning. ArXiv. 2018 doi: 10.48550/arXiv.1802.04712. Preprint at. [DOI] [Google Scholar]
20.Lu M.Y., Chen T.Y., Williamson D.F.K., Zhao M., Shady M., Lipkova J., Mahmood F. AI-based pathology predicts origins for cancers of unknown primary. Nature. 2021;594:106–110. doi: 10.1038/s41586-021-03512-4. [DOI] [PubMed] [Google Scholar]
21.Kather J.N., Pearson A.T., Halama N., Jäger D., Krause J., Loosen S.H., Marx A., Boor P., Tacke F., Neumann U.P., et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 2019;25:1054–1056. doi: 10.1038/s41591-019-0462-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Lou J., Xu J., Zhang Y., Sun Y., Fang A., Liu J., Mur L.A.J., Ji B. PPsNet: An improved deep learning model for microsatellite instability high prediction in colorectal cancer from whole slide images. Comput. Methods Programs Biomed. 2022;225 doi: 10.1016/j.cmpb.2022.107095. [DOI] [PubMed] [Google Scholar]
23.Echle A., Grabsch H.I., Quirke P., van den Brandt P.A., West N.P., Hutchins G.G.A., Heij L.R., Tan X., Richman S.D., Krause J., et al. Clinical-Grade Detection of Microsatellite Instability in Colorectal Tumors by Deep Learning. Gastroenterology. 2020;159:1406–1416.e11. doi: 10.1053/j.gastro.2020.06.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yamashita R., Long J., Longacre T., Peng L., Berry G., Martin B., Higgins J., Rubin D.L., Shen J. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. 2021;22:132–141. doi: 10.1016/S1470-2045(20)30535-0. [DOI] [PubMed] [Google Scholar]
25.Wang T., Lu W., Yang F., Liu L., Dong Z., Tang W., Chang J., Huan W., Huang K., Yao J. IEEE; 2020. Microsatellite Instability Prediction of Uterine Corpus Endometrial Carcinoma Based on H&E Histology Whole-Slide Imaging; pp. 1289–1292. [Google Scholar]
26.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:15–26. [Google Scholar]
27.Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. 2020 doi: 10.48550/arXiv.2010.11929. Preprint at. [DOI] [Google Scholar]
28.Kotei E., Thirunavukarasu R. A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning. Information. 2023;14:187. [Google Scholar]
29.Gao X., Qian Y., Gao A. Covid-vit: Classification of covid-19 from ct chest images based on vision transformer models. arXiv. 2021 doi: 10.48550/arXiv.2107.01682. Preprint at. [DOI] [Google Scholar]
30.Karimi D., Vasylechko S.D., Gholipour A. Springer; 2021. Convolution-free Medical Image Segmentation Using Transformers; pp. 78–88. [Google Scholar]
31.Chen J., Lu Y., Yu Q., Luo X., Adeli E., Wang Y., Lu L., Yuille A.L., Zhou Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv. 2021 doi: 10.48550/arXiv.2102.04306. Preprint at. [DOI] [Google Scholar]
32.Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020;33:1877–1901. [Google Scholar]
33.Zellers R., Holtzman A., Rashkin H., Bisk Y., Farhadi A., Roesner F., Choi Y. Defending against neural fake news. Adv. Neural Inf. Process. Syst. 2019;32:9054–9065. [Google Scholar]
34.Caron M., Misra I., Mairal J., Goyal P., Bojanowski P., Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020;33:9912–9924. [Google Scholar]
35.Bulten W., Kartasalo K., Chen P.H.C., Ström P., Pinckaers H., Nagpal K., Cai Y., Steiner D.F., van Boven H., Vink R., et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat. Med. 2022;28:154–163. doi: 10.1038/s41591-021-01620-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018 doi: 10.48550/arXiv.1810.04805. Preprint at. [DOI] [Google Scholar]
37.Maron O., Lozano-Pérez T. A framework for multiple-instance learning. Adv. Neural Inf. Process. Syst. 1997;10:570–576. [Google Scholar]
38.Chévez-Barrios P. In: Uveal Melanoma: Biology and Management. Bernicker E.H., editor. Springer International Publishing; 2021. Pathology of Uveal Melanoma; pp. 37–51. [DOI] [Google Scholar]
39.Shah J.P. Thyroid carcinoma: epidemiology, histology, and diagnosis. Clin. Adv. Hematol. Oncol. 2015;13:3–6. [PMC free article] [PubMed] [Google Scholar]
40.Magi-Galluzzi C. Prostate cancer: diagnostic criteria and role of immunohistochemistry. Mod. Pathol. 2018;31:12–21. doi: 10.1038/modpathol.2017.139. [DOI] [PubMed] [Google Scholar]
41.Diebold J., Anderson J.R., Armitage J.O., Connors J.M., Maclennan K.A., Müller-Hermelink H.K., Nathwani B.N., Ullrich F., Weisenburger D.D. Diffuse large B-cell lymphoma: a clinicopathologic analysis of 444 cases classified according to the updated Kiel classification. Leuk. Lymphoma. 2002;43:97–104. doi: 10.1080/10428190210173. [DOI] [PubMed] [Google Scholar]
42.Addis B., Roche H. Problems in mesothelioma diagnosis. Histopathology. 2009;54:55–68. doi: 10.1111/j.1365-2559.2008.03178.x. [DOI] [PubMed] [Google Scholar]
43.Ho J., Kalchbrenner N., Weissenborn D., Salimans T. Axial attention in multidimensional transformers. arXiv. 2019 doi: 10.48550/arXiv.1912.12180. Preprint at. [DOI] [Google Scholar]
44.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. 2017. Attention Is All You Need; pp. 5998–6008. [Google Scholar]
45.Ba J.L., Kiros J.R., Hinton G.E. Layer normalization. arXiv. 2016 doi: 10.48550/arXiv.1607.06450. Preprint at. [DOI] [Google Scholar]
46.Zhang Z., Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018;31:11–25. [Google Scholar]
47.Sing T., Sander O., Beerenwinkel N., Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. doi: 10.1093/bioinformatics/bti623. [DOI] [PubMed] [Google Scholar]
48.Clopper C.J., Pearson E.S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26:404–413. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1 and S2 and Tables S1‒S10

mmc1.pdf^{(1.7MB, pdf)}

Data Availability Statement

•
All datasets were downloaded from public databases. The source list of these datasets was provided in the key resources table. Source code is available at https://github.com/deeplearningplus/WIT.
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

[bib1] 1.Campanella G., Hanna M.G., Geneslaw L., Miraflor A., Werneck Krauss Silva V., Busam K.J., Brogi E., Reuter V.E., Klimstra D.S., Fuchs T.J. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 2019;25:1301–1309. doi: 10.1038/s41591-019-0508-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Ström P., Kartasalo K., Olsson H., Solorzano L., Delahunt B., Berney D.M., Bostwick D.G., Evans A.J., Grignon D.J., Humphrey P.A., et al. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study. Lancet Oncol. 2020;21:222–232. doi: 10.1016/S1470-2045(19)30738-7. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Kotei E., Thirunavukarasu R. Computational techniques for the automated detection of mycobacterium tuberculosis from digitized sputum smear microscopic images: A systematic review. Prog. Biophys. Mol. Biol. 2022;171:4–16. doi: 10.1016/j.pbiomolbio.2022.03.004. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Chen R.J., Lu M.Y., Williamson D.F.K., Chen T.Y., Lipkova J., Noor Z., Shaban M., Shady M., Williams M., Joo B., Mahmood F. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell. 2022;40:865–878.e6. doi: 10.1016/j.ccell.2022.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Mobadersany P., Yousefi S., Amgad M., Gutman D.A., Barnholtz-Sloan J.S., Velázquez Vega J.E., Brat D.J., Cooper L.A.D. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl. Acad. Sci. USA. 2018;115:E2970–E2979. doi: 10.1073/pnas.1717139115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Chen R.J., Lu M.Y., Wang J., Williamson D.F.K., Rodig S.J., Lindeman N.I., Mahmood F. Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis. IEEE Trans. Med. Imaging. 2022;41:757–770. doi: 10.1109/TMI.2020.3021387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Courtiol P., Maussion C., Moarii M., Pronier E., Pilcer S., Sefta M., Manceron P., Toldo S., Zaslavskiy M., Le Stang N., et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 2019;25:1519–1525. doi: 10.1038/s41591-019-0583-3. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Lipkova J., Chen T.Y., Lu M.Y., Chen R.J., Shady M., Williams M., Wang J., Noor Z., Mitchell R.N., Turan M., et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat. Med. 2022;28:575–582. doi: 10.1038/s41591-022-01709-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Shamai G., Livne A., Polónia A., Sabo E., Cretu A., Bar-Sela G., Kimmel R. Deep learning-based image analysis predicts PD-L1 status from H&E-stained histopathology images in breast cancer. Nat. Commun. 2022;13:6753. doi: 10.1038/s41467-022-34275-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM. 2017;60:84–90. [Google Scholar]

[bib11] 11.Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. 2017. Densely Connected Convolutional Networks; pp. 4700–4708. [Google Scholar]

[bib12] 12.He K., Zhang X., Ren S., Sun J. 2016. Deep Residual Learning for Image Recognition; pp. 770–778. [Google Scholar]

[bib13] 13.Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv. 2014 doi: 10.48550/arXiv.1409.1556. Preprint at. [DOI] [Google Scholar]

[bib14] 14.Hou L., Samaras D., Kurc T.M., Gao Y., Davis J.E., Saltz J.H. Patch-based Convolutional Neural Network for Whole Slide Tissue Image Classification. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016;2016:2424–2433. doi: 10.1109/CVPR.2016.266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Coudray N., Ocampo P.S., Sakellaropoulos T., Narula N., Snuderl M., Fenyö D., Moreira A.L., Razavian N., Tsirigos A. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 2018;24:1559–1567. doi: 10.1038/s41591-018-0177-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Wei J.W., Tafe L.J., Linnik Y.A., Vaickus L.J., Tomita N., Hassanpour S. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci. Rep. 2019;9:3358. doi: 10.1038/s41598-019-40041-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Su Z., Tavolara T.E., Carreno-Galeano G., Lee S.J., Gurcan M.N., Niazi M.K.K. Attention2majority: Weak multiple instance learning for regenerative kidney grading on whole slide images. Med. Image Anal. 2022;79 doi: 10.1016/j.media.2022.102462. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Lu M.Y., Williamson D.F.K., Chen T.Y., Chen R.J., Barbieri M., Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 2021;5:555–570. doi: 10.1038/s41551-020-00682-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Ilse M., Tomczak J.M., Welling M. Attention-based Deep Multiple Instance Learning. ArXiv. 2018 doi: 10.48550/arXiv.1802.04712. Preprint at. [DOI] [Google Scholar]

[bib20] 20.Lu M.Y., Chen T.Y., Williamson D.F.K., Zhao M., Shady M., Lipkova J., Mahmood F. AI-based pathology predicts origins for cancers of unknown primary. Nature. 2021;594:106–110. doi: 10.1038/s41586-021-03512-4. [DOI] [PubMed] [Google Scholar]

[bib21] 21.Kather J.N., Pearson A.T., Halama N., Jäger D., Krause J., Loosen S.H., Marx A., Boor P., Tacke F., Neumann U.P., et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 2019;25:1054–1056. doi: 10.1038/s41591-019-0462-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Lou J., Xu J., Zhang Y., Sun Y., Fang A., Liu J., Mur L.A.J., Ji B. PPsNet: An improved deep learning model for microsatellite instability high prediction in colorectal cancer from whole slide images. Comput. Methods Programs Biomed. 2022;225 doi: 10.1016/j.cmpb.2022.107095. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Echle A., Grabsch H.I., Quirke P., van den Brandt P.A., West N.P., Hutchins G.G.A., Heij L.R., Tan X., Richman S.D., Krause J., et al. Clinical-Grade Detection of Microsatellite Instability in Colorectal Tumors by Deep Learning. Gastroenterology. 2020;159:1406–1416.e11. doi: 10.1053/j.gastro.2020.06.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Yamashita R., Long J., Longacre T., Peng L., Berry G., Martin B., Higgins J., Rubin D.L., Shen J. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. 2021;22:132–141. doi: 10.1016/S1470-2045(20)30535-0. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Wang T., Lu W., Yang F., Liu L., Dong Z., Tang W., Chang J., Huan W., Huang K., Yao J. IEEE; 2020. Microsatellite Instability Prediction of Uterine Corpus Endometrial Carcinoma Based on H&E Histology Whole-Slide Imaging; pp. 1289–1292. [Google Scholar]

[bib26] 26.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:15–26. [Google Scholar]

[bib27] 27.Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. 2020 doi: 10.48550/arXiv.2010.11929. Preprint at. [DOI] [Google Scholar]

[bib28] 28.Kotei E., Thirunavukarasu R. A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning. Information. 2023;14:187. [Google Scholar]

[bib29] 29.Gao X., Qian Y., Gao A. Covid-vit: Classification of covid-19 from ct chest images based on vision transformer models. arXiv. 2021 doi: 10.48550/arXiv.2107.01682. Preprint at. [DOI] [Google Scholar]

[bib30] 30.Karimi D., Vasylechko S.D., Gholipour A. Springer; 2021. Convolution-free Medical Image Segmentation Using Transformers; pp. 78–88. [Google Scholar]

[bib31] 31.Chen J., Lu Y., Yu Q., Luo X., Adeli E., Wang Y., Lu L., Yuille A.L., Zhou Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv. 2021 doi: 10.48550/arXiv.2102.04306. Preprint at. [DOI] [Google Scholar]

[bib32] 32.Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020;33:1877–1901. [Google Scholar]

[bib33] 33.Zellers R., Holtzman A., Rashkin H., Bisk Y., Farhadi A., Roesner F., Choi Y. Defending against neural fake news. Adv. Neural Inf. Process. Syst. 2019;32:9054–9065. [Google Scholar]

[bib34] 34.Caron M., Misra I., Mairal J., Goyal P., Bojanowski P., Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020;33:9912–9924. [Google Scholar]

[bib35] 35.Bulten W., Kartasalo K., Chen P.H.C., Ström P., Pinckaers H., Nagpal K., Cai Y., Steiner D.F., van Boven H., Vink R., et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat. Med. 2022;28:154–163. doi: 10.1038/s41591-021-01620-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018 doi: 10.48550/arXiv.1810.04805. Preprint at. [DOI] [Google Scholar]

[bib37] 37.Maron O., Lozano-Pérez T. A framework for multiple-instance learning. Adv. Neural Inf. Process. Syst. 1997;10:570–576. [Google Scholar]

[bib38] 38.Chévez-Barrios P. In: Uveal Melanoma: Biology and Management. Bernicker E.H., editor. Springer International Publishing; 2021. Pathology of Uveal Melanoma; pp. 37–51. [DOI] [Google Scholar]

[bib39] 39.Shah J.P. Thyroid carcinoma: epidemiology, histology, and diagnosis. Clin. Adv. Hematol. Oncol. 2015;13:3–6. [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Magi-Galluzzi C. Prostate cancer: diagnostic criteria and role of immunohistochemistry. Mod. Pathol. 2018;31:12–21. doi: 10.1038/modpathol.2017.139. [DOI] [PubMed] [Google Scholar]

[bib41] 41.Diebold J., Anderson J.R., Armitage J.O., Connors J.M., Maclennan K.A., Müller-Hermelink H.K., Nathwani B.N., Ullrich F., Weisenburger D.D. Diffuse large B-cell lymphoma: a clinicopathologic analysis of 444 cases classified according to the updated Kiel classification. Leuk. Lymphoma. 2002;43:97–104. doi: 10.1080/10428190210173. [DOI] [PubMed] [Google Scholar]

[bib42] 42.Addis B., Roche H. Problems in mesothelioma diagnosis. Histopathology. 2009;54:55–68. doi: 10.1111/j.1365-2559.2008.03178.x. [DOI] [PubMed] [Google Scholar]

[bib43] 43.Ho J., Kalchbrenner N., Weissenborn D., Salimans T. Axial attention in multidimensional transformers. arXiv. 2019 doi: 10.48550/arXiv.1912.12180. Preprint at. [DOI] [Google Scholar]

[bib44] 44.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. 2017. Attention Is All You Need; pp. 5998–6008. [Google Scholar]

[bib45] 45.Ba J.L., Kiros J.R., Hinton G.E. Layer normalization. arXiv. 2016 doi: 10.48550/arXiv.1607.06450. Preprint at. [DOI] [Google Scholar]

[bib46] 46.Zhang Z., Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018;31:11–25. [Google Scholar]

[bib47] 47.Sing T., Sander O., Beerenwinkel N., Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. doi: 10.1093/bioinformatics/bti623. [DOI] [PubMed] [Google Scholar]

[bib48] 48.Clopper C.J., Pearson E.S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26:404–413. [Google Scholar]

PERMALINK

An efficient context-aware approach for whole-slide image classification

Hongru Shen

Jianghua Wu

Xilin Shen

Jiani Hu

Jilei Liu

Qiang Zhang

Yan Sun

Kexin Chen

Xiangchun Li

Summary

Graphical abstract

Highlights

Introduction

Results

An overview of WIT

Figure 1.

High performance of WIT in tissue-of-origin localization

Figure 2.

High performance of WIT in cancer diagnosis

Figure 3.

Model interpretability

Figure 4.

Discussion

Conclusion

Limitations of the study

STAR★Methods

Key resources table

Resource availability

Lead contact

Materials availability

Data and code availability

Experimental model and study participant details

WSI datasets

TCGA dataset

CPTAC dataset

The PANDA dataset

Method details

Whole-slide image (WSI) preprocessing

WIT architecture

Embedding layer

The transformer encoder

Model training

Different WIT models

Baseline method

Visualization of attention map

Quantification and statistical analysis

Model evaluation

Statistical and software

Additional resources

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Contributor Information

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases