Abstract
In medical image diagnosis, identifying the attention region, i.e., the region of interest for which the diagnosis is made, is an important task. Various methods have been developed to automatically identify target regions from given medical images. However, in actual medical practice, the diagnosis is made based on both the images and various clinical records. Consequently, pathologists examine medical images with prior knowledge of the patients and the attention regions may change depending on the clinical records. In this study, we propose a method, called the Personalized Attention Mechanism (PersAM) method, by which the attention regions in medical images according to the clinical records. The primary idea underlying the PersAM method is the encoding of the relationships between medical images and clinical records using a variant of the Transformer architecture. To demonstrate the effectiveness of the PersAM method, we applied it to a large-scale digital pathology problem involving identifying the subtypes of 842 malignant lymphoma patients based on their gigapixel whole-slide images and clinical records.
Keywords: Digital pathology, Multimodal analysis, Personalized attention, Transformer, Whole slide image
Introduction
Medical images are often diagnosed on the basis of specific regions of interest in the images rather than their entirety. For example, cancer pathologists typically focus on specific tumor regions rather than the entire pathological tissue specimen. In this study, we refer to such regions as attention regions. Developing computational methods to estimate the attention regions is an important task in medical image analysis to obtain high performance and explainability. In existing methods, the attention regions are predominantly estimated based solely on the images themselves.1, 2, 3 However, in clinical practice, pathologists use both the imaging information and various clinical records (including basic demographic details such as patient age and gender, the results of various medical examinations, and genetic information). It is well-recognized among pathologists that patient-specific information can help them focus on specific tissues in the specimens or narrow the diagnostic target classes. In practice, the region to be focused on in a tissue slide changes depending on the type of organs from which a tissue specimen is sliced, or the results of a medical interview and some medical tests narrow down suspected diseases. It is known that the additional use of clinical record information can enhance the performance also in medical image analysis.4, 5, 6, 7 In this study, we introduce a framework, called the Personalized Attention Mechanism (PersAM) framework, that adaptively changes the attention regions in medical images according to patient-specific information. The PersAM framework mimics pathologists' decision-making and provides high explainability by modeling the relationship between medical images and clinical records.
In this paper, we focus on the PersAM framework in the context of digital pathology. Particularly, we deal with malignant lymphoma as the target disease, whereas the proposed PersAM framework can be applied to other images in similar problem settings. In digital pathology, whole slide images (WSIs) are used as image data, which are large digital images scanned by a scanner. The image size of WSIs can be up to 100000×100000 pixels and the tissue regions of WSIs have both tumor and normal regions mixedly. Therefore, it is especially important in digital pathology to identify the attention regions in a vast image and diagnose focusing on some areas in the tissue specimen. For examples of malignant lymphoma, pathologists diagnose diffuse large B-cell lymphoma (DLBCL) focusing on large cells in a tissue specimen, while they diagnose follicular lymphoma (FL) focusing on follicular structures in a tissue specimen. In practical diagnosis, as mentioned above, pathologists observe such regions considering a patient's clinical record information that includes basic profiles and results of some examinations. As the main target problem, we are concerned in this paper with the PersAM framework in a digital cancer pathology task, where clinical records can be used together with the WSIs of tissue specimens as patient-specific information.
The problem of attention region estimation in digital cancer pathology can be formulated as a weakly supervised learning problem because only the class label for the entire image is given—the annotations for the attention region are not. Some public database has pathologists' annotations for tumor regions, but most problem settings that employ other private datasets have no patch-level annotations and only patient-level annotations. Hence, in digital pathology using WSIs, attention region estimation and each machine learning task should be generally performed with only patient-level annotations. Multiple Instance Learning (MIL)1,8,9 is one method used for such weakly supervised attention region estimation problems. In MIL, an image patch is considered an instance and the entire image (or set of a large number of patches) is considered a bag. The problem then reduces to estimating the label of each image patch given the label of the bag (e.g., tumor or normal), where the image patches estimated to be tumors are interpreted as the attention regions. In the context of MIL in digital pathology, attention-based MIL is well-known as a successful method.1, 2, 3 An attention-based MIL can compute attention weights that indicate how each instance contributes to the classification result. Instances that have higher attention weights in WSI are interpreted as tumor regions in attention-based MIL for digital pathology. In this study, we would like to introduce the PersAM framework that can adaptively change attention regions depending on different clinical record information even if input WSI is the same. In the case of the aforementioned attention-based MIL, there is no mechanism to change the attention regions when different clinical record information is input to the same WSI since attention weights are calculated independently for each instance without considering the relationship between medical images and clinical records. The proposed PersAM method employs a variant of the Transformer architecture to encode the relationships between the medical images and corresponding clinical records. In the Transformer architecture, the relationships between multiple components are expressed in the form of attentions.10 The application of Transformer architecture to computer vision can calculate attention regions calculated by encoding the relationship between image patches.11 The Transformer architecture could be expanded even into multimodal inputs such as images and table data, which encodes the relationship between each instance of multimodal inputs.12,13 By combining the medical images and clinical records and then computing the attentions, the proposed method enables personalized attention, which represents the strength of the relationship between each clinical record and each region (patch) in the image.
In this study, we present a weakly supervised attention region estimation problem formulated as an MIL problem and propose the PersAM method to obtain attention regions that can be adaptively changed according to patient clinical records. Fig. 1 illustrates the concept of the proposed PersAM method, where different clinical record information is given to the same WSI as inputs. Regions with red color in Personalized attention represent attention regions in each output, and the PersAM method can provide different attention regions depending on different clinical records even if input WSIs are the same. This mimics a pathologist's decision-making where he/she observes tumor-specific regions in the tissue specimen considering the corresponding patient's clinical record. With a slight abuse of terminologies, we refer to both the framework and our proposed method as PersAM in this work. The proposed PersAM method enables us to provide 2 types of personalized attentions: exploratory and explanatory attentions. Exploratory attention is a class-independent attention that is determined solely by a WSI and a clinical record, i.e., the first regions of interest to the pathologist when observing a tissue specimen. On the other hand, explanatory attention is a class-dependent attention that is determined by a WSI, a clinical record, and class information, i.e., the regions of interest to the pathologist when predicting a disease. To obtain these attentions, the proposed model has a Transformer architecture that can encode the relationship among images, clinical records, and class information.
Fig. 1.
Overview of the proposed PersAM method. A WSI and a clinical record are fed into the model together. Regions with red color in Personalized attention represent attention regions in each output. The PersAM method provides us the personalized attention according to the clinical record, where attentions change depending on clinical factors, even if the same WSI is input. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
To demonstrate the effectiveness of the proposed PersAM method, we applied it to the pathological subtype classification of 842 patients with malignant lymphoma. The training dataset consisted of WSIs and clinical records, where each WSI was a gigapixel image of an entire pathological tissue slide: the clinical record included the age, gender, target organ of the tissue section, interview with a doctor, and blood test results. By combining pathological images and clinical records, the proposed method performed better than several baseline methods. Furthermore, we confirmed that the proposed PersAM method can successfully provide personalized attention in the Transformer architecture.
The main contributions of this work are summarized as follows.
-
1.
Inspired by medical image diagnoses by pathologists in clinical practice, we introduce a framework for a personalized attention mechanism in which the attention is determined on the basis of patient-specific information.
-
2.
For the problem of weakly supervised attention region estimation based on MIL, we propose a variant of the Transformer architecture.
-
3.
We apply the proposed model to a large-scale digital pathology task to demonstrate the effectiveness of the proposed framework and method.
Preliminaries
Problem setup
In this paper, we focus on the PersAM framework in the context of subtype classification for digital cancer pathology. Let [N] = {1, …, N} be the set of natural numbers up to N. The training dataset is denoted as , where N is the number of patients and each of , , and represents the pathological image, clinical record, and subtype class label of the nth patient, respectively. The image is a digitally scanned WSI of the entire pathological specimen. Because the WSI is usually a huge image of gigapixel size, it is too large to be directly fed into the model. Therefore, image patches extracted from are used as the inputs to the model, and we write where is the th image patch taken from the nth patient and Ln is the number of image patches taken from . The clinical record is represented as a set of numerical vectors and denoted as , where tn,m is the mth clinical factor represented as a vector and M is the number of clinical factors. For example, in Experimental evaluation, we consider the case with M = 2 where the first clinical factor is the patient profile (such as age and gender) and the results of a medical interview, whereas the second clinical factor is a set of blood test results. In our clinical records, the patient profiles are represented by integer values or binary labels, the results of a medical interview are represented by binary labels, and the blood test results are represented by continuous values, respectively. If a clinical record has text information as findings, it can be used as a clinical factor t after vectorizing in some manner. The details of clinical record information is explained in Experimental evaluation. The subtype class label is represented as a C-dimensional one-hot vector.
In digital cancer pathology, a WSI includes both tumor cells and normal cells, and subtype diagnosis is conducted on the basis of a subset of the tumor cells. This means that, among the image patches taken from a WSI, only some of them are considered to contain useful information for subtype classification. We regard that image patch subset as the attention region. We represent the attention degree of each image patch as an attention weight. Given a pathological WSI and a clinical record, our model provides the attention weights—which each represents the importance of each image patch—and then makes a subtype classification based on the attention weights. As an example of the application of the PersAM framework in the digital pathology problem, we consider the case where the attention weights vary according to the clinical records. Clinical records possibly contain 2 types of information: (i) the parts of the pathological specimen that should be observed and (ii) which subtype it is likely to be classified under. In this study, we consider 2 types of personalized attentions, called exploratory and explanatory attentions, each of which is respectively obtained from each of these 2 types of information. We introduce a variant of the Transformer architecture that can provide both the exploratory attention weight and the explanatory attention weight of each image patch.
Related works
Digital pathology. Pathological diagnosis plays an important role in medicine. Various computer-aided diagnosis methods for pathological images have been developed for various problems, such as classification,2,14,15 tumor region identification,16, 17, 18 segmentation,19, 20, 21, 22 survival prediction,23, 24, 25 and similar image retrieval.26, 27, 28 In digital pathology, a digital scan of the entire pathology specimen, called a WSI, is used as the target image. Because a WSI is usually huge (e.g., 100000×100000 pixels), it cannot be directly fed into a model. Therefore, image patches extracted from the WSI are often used as the inputs to a model. In pathological diagnosis based on WSIs, it is important to note that WSIs contain both tumor cells and normal cells. Therefore, if there is no annotation of the tumor cell region, it is necessary to first identify the tumor cell region and then make a pathological diagnosis. This problem is a weakly supervised learning problem in the sense that only the WSI is labeled and the tumor cell region is not.
Multiple instance learning (MIL). MIL is a weakly supervised learning problem in which labels are not given for instances but for a group of instances called a bag. In an MIL formulation of a binary classification problem, it is assumed that a positive bag contains at least 1 positive instance, whereas a negative bag contains only negative instances. By considering a WSI as a bag and an image patch as an instance, the subtype classification problem can be interpreted as an MIL problem in which class-specific image patches (e.g., tumor patches) are considered positive instances. Several MIL approaches have been developed for digital pathology tasks.1,8,9 Among them, attention-based MIL1, 2, 3 is particularly useful because the identified attention regions can be interpreted as class-specific image patches.
Attention and explanation. Although the development of deep learning techniques has dramatically improved the accuracies of many medical image analysis tasks, it is critically important for medical practice to develop techniques that provide explanation of the results. Various visualization methods, such as Grad-CAM,29 have been proposed to interpret and explain the rationale for classification results. Singla et al.30 proposed a method that visualizes the medical image regions that serve as the basis for the classification of diseases for each concept derived from clinical report analysis. However, most visualization methods visualize the regions that contribute to the classification results after the classifier is applied to given images (rather than using clinical reports for finding and visualizing the image regions that contribute to the classification results, as in our proposed method). The attention-based MIL described above can also be considered as a method to provide explanation of the results because attention can be visualized as an informative region for making decisions. In this study, we employ the Transformer architecture10 as a basis for estimating the attention region in medical images. Although the Transformer was originally developed for natural language processing (NLP) tasks, it has been demonstrated to be effective for general computer vision tasks,11 including medical image analysis.31, 32, 33, 34 In particular, the Transformer has been effectively used to aggregate bag features in an MIL setting.32 The Transformer architecture can encode the relationship among a pair of components in input data, e.g., between 2 words in the case of NLP tasks, and between 2 image patches in the case of computer vision tasks. This study was inspired by the use of the Transformer in the context of vision and language,12,13 where it has been demonstrated that a change in the language token can change the attention of the image token. The Transformer has a mechanism to quantify the relevance of multimodal information in the form of attentions. The encoding mechanism for the relationship of input data can be expanded into multimodal input including image patches and table data, which enables computing attention from an image patch to table data, or attention from table data to an image patch. We expect that encoding the relationship between image patches and clinical factors can provide more appropriate personalized attentions based on the clinical record.
Multimodal learning. In clinical practice, doctors obtain additional information from patients' clinical records and image diagnoses are performed by considering such clinical factors as prior information. In multimodal analysis, clinical records that include the basic information of patients and some examination results can often be used in addition to digital images.4, 5, 6, 7 Yala et al.4 used a combination of mammography images and patient data (basic information, medical history, etc.) and demonstrated that the performance of breast cancer risk prediction could be improved. Multimodal analysis of dermoscopic images and patient data (age, gender, and body location) has also been studied to enhance the accuracy of skin lesions classification and melanoma detection.5
Multimodal analyses of medical images and clinical records have primarily been performed on radiology images, but recent works have reported that pathological images can also be combined with clinical records to improve the task performance. Li et al.9 combined tabular clinical data with histological images in an MIL setting, where 18 attributes, including age, genes, and tumor location, were used as inputs with multiscale histological images. Additional clinical factors have also been used in a mixture-of-experts model as inputs of a gating network.8 Chen et al.35 proposed a Transformer-based multimodal model for survival prediction using images and genetic data. Their method could visualize co-attentions between images and genetic information. However, these previous works primarily focused on performance improvement by using multimodal inputs, and detailed effects on attention regions by additional clinical factors have not been reported. The advantage of the proposed PersAM method is that it can provide personalized attention regions according to the clinical records, which mimics the actual pathological practice of human expert pathologists.
Proposed method
This study was conducted to develop an AI system for pathological diagnosis that mimics the actual diagnosis process of human pathologists. When a pathologist makes a diagnosis, they have patient information based on the clinical record, which is used as prior knowledge of the parts of the pathological image on which to focus. We call such an attention region in the early exploratory phase of the diagnosis exploratory attention. Exploratory attention is a class-independent attention that is determined solely by a WSI and a clinical record. Furthermore, after a pathologist has made a diagnosis, they should be able to explain which part of the pathological image they focused on. We call such an attention region in the later explanatory phase of the diagnosis explanatory attention. Explanatory attention is a class-dependent attention that is determined by a WSI, a clinical record, and class information.
Given the WSI and the clinical record of a patient, the proposed PersAM method can identify both exploratory and explanatory attention regions and make a diagnosis based on those identified attention regions. In Proposed network structure, we first introduce a Transformer-based network structure that enables pathological diagnosis based on the 2 types of attentions. Then, in Exploratory/explanatory attentions and subtype classification, we describe how the network can be used to identify both exploratory and explanatory attention regions to make a pathological diagnosis. The key property of the proposed PersAM method is that both exploratory and explanatory attentions can be adaptively changed according to patient clinical records even if the same WSI is given to the model. This mimics the actual diagnostic process employed by human pathologists, making the AI system highly explainable.
In this section, for ease of notation, when there is no ambiguity, we omit the subscript n in referring to the nth patient.
Proposed network structure
To realize pathological diagnosis based on exploratory and explanatory attentions, as shown in Fig. 2, we propose a new network structure consisting of 3 components: (i) feature extractor, (ii) multimodal encoder, and (iii) multimodal aggregator. We describe each of these 3 components below.
Fig. 2.
Illustration of the proposed network structure. The network consists of 3 components: (i) feature extractor, (ii) multimodal encoder, and (iii) multimodal aggregator. The feature extractors compute feature vectors H for each of the image patches and clinical factors to be fed into the Transformer architecture. The multimodal encoder has a role to characterize the relationship among multimodal information consisting of image patches, clinical factors, and class information. The multimodal aggregator aggregates the Transformer-encoded tokens for obtaining exploratory/explanatory attentions and subtype classification results.
Feature extractor
In this section, to simplify the notations, we describe the MIL setting where the entire WSI is considered as a bag and each of the multiple patches taken from the WSI as an instance. In Experimental evaluation, we consider the MIL setting where a WSI contains multiple bags. See experimental evaluation for details. Let be the WSI and be the clinical record of a patient, where xℓ, ℓ ∈ [L], is the ℓth patch image taken from the WSI , whereas tm, m ∈ [M], is the mth clinical factor. The role of the feature extractor is to compute a feature vector for each of the image patches and clinical factors so that input data can be used in the Transformer architecture. For image patches, we employ a convolutional neural network (CNN) f: xℓ ↦ hℓp that maps an image patch xℓ to a feature vector hℓp ∈ ℝR, where R is the dimension of the feature vector. For clinical factors, we employ a simple multi-layer perceptron (MLP) gm: tm ↦ hmt that maps the vector of the mth clinical factor into a feature vector hmt ∈ ℝR that has the same dimension as hℓp. We denote the sets of trainable parameters for f and {gm}m∈[M] as θf and θg, respectively. We denote the combined feature vectors as
| (1) |
Multimodal encoder
The multimodal encoder characterizes the relationship between multimodal information. We implement this component using a Transformer. The Transformer was initially developed for NLP tasks, where each feature vector is called a token. In the proposed network structure, in addition to the image patch tokens {hℓp}ℓ∈[L] and clinical factor tokens {hmt}m∈[M], we introduce class tokens {hccls}c∈[C], which are considered trainable parameters. Let
| (2) |
where Etype ∈ ℝR×(L+M+C) is called token type embedding and is used to characterize the type of tokens. Here, the token type embedding is defined as
| (3) |
where 0 denotes a C-dimensional zero vector. Each token in is characterized as either of image patch, clinical factor, or class token by this token embedding. Note that the same type token ep is used for all L image patches because the image patches are randomly sampled from WSI. Token type embedding parameters ep and {emt}m∈[M] are considered trainable parameters.
We denote the Transformer as a function , where is the collection of the outputs of the Transformer called Transformer-encoded tokens, denoted as
| (4) |
We denote the set of trainable parameters for the Transformer as
| (5) |
where θtf is the other general parameters in the Transformer. By feeding tokens into the Transformer encoder several times repeatedly, can encode the relationship among image patches, clinical factors, and class tokens. Here, each element of the self-attention map for corresponds to the relationship between 2 tokens, and represents how a token is focused when the other token is given together as input data.
Multimodal aggregator
The multimodal aggregator aggregates the Transformer-encoded tokens for obtaining exploratory/explanatory attentions and subtype classification results. Let
| (6) |
| (7) |
| (8) |
where {qccls}c∈[C] and {qmt}m∈[M] are called queries for class tokens and clinical factor tokens, respectively; {kℓp}ℓ∈[L] and {kmt}m∈[M] are called keys for image patch tokens and clinical factor tokens, respectively; and {vℓp}ℓ∈[L] is called values for image patch tokens in the context of the Transformer. The matrices Wq, Wv, Wk ∈ ℝR×R are trainable parameters. We denote these 3 matrices collectively as θagg = {Wqq, Wv, Wk}.
The inner product between a query and a key represents the relevance between the corresponding multimodal information. First, the relevance between each image patch xℓ and each class c is written as
| (9) |
where σ(⋅) is the sigmoid function. Next, the relevance between each image patch xℓ and the set of M clinical factors {tm}m∈[M] is written as
| (10) |
Furthermore, the relevance between each class and the set of M clinical factors {tm}m∈[M] is written as
| (11) |
The 3 types of relevance information in (9)–(11) are used to obtain exploratory/explanatory attentions and subtype classification results.
Exploratory/explanatory attentions and subtype classifications
Based on the relevance information in (9)–(11), we obtain 3 types of attentions: (i) class-wise attentions, (ii) exploratory attentions, and (iii) explanatory attentions. Fig. 3 illustrates the 3 types of attentions. We call {,c}(ℓ,c)∈[L]×[C] class-wise attentions because they are obtained as the relevance between the ℓth image patch and the cth class token without clinical factors. We regard {ψℓ}ℓ∈[L] as exploratory attentions because they are obtained as the relevance between the ℓth image patch and the set of clinical factors {tm}m∈[M]. Note that the exploratory attentions are class-independent; thus, they can be considered as the attention regions in the WSI in the early exploratory phase of the diagnosis. The explanatory attentions are obtained by combining the class-wise attentions and the exploratory attentions as follows:
| (12) |
Fig. 3.
The 3 types of attentions considered in the proposed method. The class-wise attentions are obtained based on the relevance between image patches and class tokens. For this example, the model focuses on the almost entire WSI to identify the case as class 2 when only the WSI is used for the class prediction. The exploratory attentions are obtained based on the relevance between image patches and clinical factors and given to the regions focused on regardless of which class the WSI belongs to. The explanatory attentions are obtained by filtering the class-wise attentions with the exploratory attentions and are provided as a reason for the final determination by considering the relationship among multimodal information.
It can be interpreted that the explanatory attentions are obtained by filtering the class-wise attentions with the exploratory attentions.
The subtype classification results are obtained based on a linear combination of the aggregate feature vector
| (13) |
where . Then, the class-wise probabilities are obtained by using a neural network (NN) with the softmax operator as follows:
| (14) |
where is a C-dimensional vector whose cth element represents the probability that the subtype of the patient is class c and θclf is the set of trainable parameters.
When the network is trained, the loss function consists of 2 loss components. The first loss component is simply the cross-entropy loss between the true one-hot class vector and the predicted class probability vector . The second loss component is considered to take into account the specific property of the MIL setting, where the bag (WSI) is positive if any of the instances (image patches) is positive. To formulate this specific property, we consider
| (15) |
where πc is close to 1 if there exists at least 1 image patch with an attention value close to 1. The second loss component is defined as the binary cross-entropy between , the cth element , and πc. This loss function is inspired by the probability aggregation approach studied in Zhe et al.36
The proposed network contains trainable parameters θf, θg, θenc, θagg, and θclf. All the parameters in the model are simultaneously trained by minimizing the following loss function:
| (16) |
The operations performed in the network structure are illustrated in Fig. 4.
Fig. 4.
Operations performed in the proposed network structure. Given a set of image patches and a set of clinical factors, the network predicts the subtype based on exploratory attentions and explanatory attentions. Class-wise and explanatory attentions are computed for each class, whereas exploratory attention is calculated for a case since it is a class-independent attention based solely on a WSI and a clinical record. Both exploratory and explanatory attentions can vary depending on clinical records even when an input WSI is the same. This mimics the actual pathological diagnosis by human pathologists.
Experimental evaluation
In the experiments, we first compared the proposed PersAM MIL with several baseline methods to confirm the improvement of classification performance. Then, the effectiveness of exploratory and explanatory attentions in the proposed method was evaluated.
Experimental setting
Dataset. Our database of malignant lymphoma was composed of N = 842 clinical cases with three subtypes: 277 DLBCL, 270 FL, and 295 reactive lymphoid hyperplasia (Reactive). Fig. 5 shows sample image patches for typical DLBCL, FL, and Reactive cases. DLBCL has large tumor cells over a wide region in the tissue specimen, and FL has follicular structures which have tumor cells. In contrast, Reactive is classified as non-lymphoma, which has diverse cell structures but no tumor cells. All the patient data were clinically diagnosed by expert hematopathologists and a WSI of a hematoxylin-and-eosin (H&E)-stained tissue specimen and a clinical record were given for each case. A gigapixel digitized WSI of the entire H&E-stained tissue slide, was used as an input image . All the glass slides were digitized using a WSI scanner (Aperio GT 450; Leica Biosystems, Germany) at 40× magnification (0.26 μm/pixel), where the maximum image size was approximately 100000×100000 pixels. The OpenSlide37 software was used for handling WSIs and extracting image patches from . An original clinical record includes the definitive subtype and clinical factors . Note that we cannot use patch-level annotations, and the class label is given only to a WSI , not image patches. The clinical factors consist of 28 elements that are summarized by M = 2 clinical factors: an 18-dimensional vector with patient basic information and interview results and a 10-dimensional vector with blood test results. The details of items included in each clinical factor are listed in Table 1, Table 2.
Fig. 5.
Samples of typical image patches for lymphoma cases. Each subtype has individual histological features; DLBCL has large tumor cells over a wide regions in the tissue specimen, FL has follicular structures which have tumor cells, and Reactive has diverse cell structures but no tumor cells.
Table 1.
The detailed items in clinical factor t1 consisting of patient basic information and interview results. Other items than age are represented as binary labels, e.g., “fever” indicates 1 if a patient had a high fever in the interview.
| Item | Value type |
|---|---|
| Age | Non-negative integer |
| Gender, organ (lymph node, tonsil, others), fever, weight loss, hepatomegaly, splenomegaly, swelling (none, whole body, neck, armpit, deep abdominal cavity, mouse diameter, septum, others) | Binary |
Table 2.
The detailed items in clinical factor t2 consisting of blood test results. All items are represented as continuous values indicating amount or percentile.
| Item | Value type |
|---|---|
| RBC, WBC, plt, LDH | Amount |
| Stab, seg, eosino, baso, mono, lympho | Percentile |
Implementation details. In the experiment, 224 × 224-pixel image patches were randomly extracted from an entire WSI and 100 image patches were used as a bag due to the amount of computation and memory. The corresponding label was assigned to a bag generated from a WSI , e.g., a bag generated from image patches of a WSI of DLBCL was labeled as DLBCL. A maximum of 30 bags were generated from a single WSI in our experiment. The length of the feature vector was set to R = 512 inspired by TransMIL.32 To obtain feature vector of image patches hℓp, f employed an ResNet5038 pre-trained with ImageNet and a two-layer NN that had 1024 hidden units, 512 output units, and ReLU as its activation function, where a 2048-dimensional vector after global average pooling layer in ResNet50 was converted into the 512-dimensional feature vector. Clinical factors were mapped to hmt by a 2-layer NN that had 256 hidden units, 512 output units, and ReLU as its activation function, where both 18-dimensional and 10-dimensional clinical factors were converted into 512-dimensional feature vectors. Class tokens {hccls}c∈[C] were designed as three 512-dimensional vectors, and then we could obtain input data for the Transformer architecture . In (15), to prevent underflow computing, 1 − was normalized to [0.95,1.0]. The dataset was divided into training, validation, and testing data in the ratio of 3:1:1, and the models were evaluated via 5-fold cross-validation where the model that had the smallest validation loss after third epoch was used for testing.
For the setting on Transformer, the number of layers and heads were set to 2 and 8, respectively, where dropout rate was set to 0.1. The classifier gclf was an NN that had a hidden layer with 256 units and an output layer with 3 units to compute the class probability from a 512-dimensional aggregated feature vector z.
For stability in the optimization, label smoothing was applied as a regularization technique in calculating the loss function , where the label for a correct class was set to and the labels for incorrect classes were set to . As an optimization method, momentumSGD (nesterov, weight decay=10−4) was employed and the training of the model was performed in nine epochs. Learning rates were determined as 10−4 for optimizing the parameter θf, 2 × 10−4 for the parameters θenc, θagg and θclf, and 4 × 10−6 for the parameter θg in which the learning rate was multiplied by 0.1 every 3 epochs. Random horizontal flip and random rotations (0°, 90°, 180°, 270°) were applied to the input image patches as the data augmentation. All the parameters of the model were simultaneously optimized in the above setting. It took about 20 h to perform 5-fold cross-validation by a computer with 8 Quadro RTX 5000 (NVIDIA, U.S.). Our source code is available from https://github.com/PersAM-MIL/PersAM.
Subtype classification
We performed the 3-class classification experiment using the dataset outlined above.
Baseline methods. The proposed PersAM model was compared with the following baseline models:
1. MLP using clinical factors as input (clinical MLP)
Clinical MLP employs a three-layer NN that has hidden layers with 256 and 512 units and uses a 28-dimensional vector indicating clinical factors as input data. The training of clinical MLP was performed in 500 epochs, where the learning rate was set to 10−3 without scheduling.
2. Attention-based MIL using images1 (img MIL)
Img MIL employs an attention-based MIL that aggregates 2048-dimensional feature vectors in a bag and predicts the class label from an aggregated feature vector using the classifier gclf with the hidden layer having 1024 units.
3. Attention-based MIL using images and clinical factors (img-clinical MIL)
In img-clinical MIL, in addition to a 512-dimensional aggregated feature vector computed from image patches by the attention-based MIL, a 28-dimensional clinical factor is also used as an input for computing the 512-dimensional feature. By concatenating the aggregated feature vector for images in a bag and the computed feature vector for clinical factors, the classifier gclf predicts the class using the 1024-dimensional concatenated feature vector through the hidden layer with 512 units.
4. Transformer-based MIL using images (img Transformer)
In img Transformer, only 1 class token was concatenated to feature vectors for L image patches. The classifier gclf predicts the class using the encoded class token (it is a common technique in the Transformer-based classification model).
5. Transformer-based MIL using images and clinical factors (img-clinical transformer)
Similar to img Transformer, img-clinical Transformer uses only 1 class token, and it is concatenated to feature vectors for L image patches and M clinical factors. Img-clinical Transformer predicts the class using the encoded class token as an input for the classifier gclf.
The setting for an optimization method and learning rates are the same as above except for clinical MLP.
Results. The classification results are shown in Table 3, where each row shows the mean accuracy and standard error in 3-class classification by 5-fold cross-validation. The results show that the proposed method achieved the highest accuracy compared to all the baseline methods. In particular, whereas the baseline methods using image and clinical factors showed low accuracy, our proposed method classified the subtype more accurately by properly aggregating image and clinical features through the multimodal aggregator.
Table 3.
Comparison of mean accuracy and standard error in 3-class classification by 5-fold cross-validation. The proposed method achieved the highest classification accuracy.
| Method | Accuracy |
|---|---|
| Clinical MLP | 0.5795 ± 0.0071 |
| Img MIL | 0.8195 ± 0.0090 |
| Img-clinical MIL | 0.8230 ± 0.0085 |
| Img Transformer | 0.8219 ± 0.0140 |
| Img-clinical Transformer | 0.8147 ± 0.0141 |
| Proposed | 0.8313 ± 0.0149 |
Attention visualization
Class-wise, exploratory, and explanatory attentions. We also performed visualization experiments to demonstrate that the proposed personalized attentions could be adaptively changed according to input clinical records. In the visualization results, attention weights ranging from 0 to 1 were assigned in the range blue to red. Fig. 6, Fig. 7 show the visualization results of class-wise attentions, exploratory attentions, and explanatory attentions, where the images on the right are thumbnails of the original WSIs. In the matrices on the left, the columns show the class-wise attentions {aℓ, c}(ℓ,c)∈[L]×[C], the rows show exploratory attentions {ψℓ}ℓ∈[L], and each element shows explanatory attentions {}(ℓ,c)∈[L]×[C], where clinical records sampled from other cases of 3 different subtypes were input with the WSI of a patient instead of the original clinical factor of that patient. Fake clinical records were used to confirm that exploratory and explanatory attentions changed when different clinical records were input with the same WSI.
Fig. 6.
Three types of attentions for an FL case. The image on the right is a thumbnail of the original WSI. Regions with red color in each attention represent attention regions according to the color bar. The left column, top row, and each element of the matrix show the class-wise attentions, exploratory attentions, and explanatory attentions, respectively. Note that the first (DLBCL) and the third (Reactive) rows are the results when fake clinical records are provided for confirming the change of attentions. The explanatory attention for FL is enhanced when a clinical record of the FL case is input (2B), and the explanatory attention for Reactive is enhanced when a clinical record of the Reactive case is input (3C), because the case should have some difficulty in the diagnosis using only image. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 7.
Three types of attentions for a Reactive case (see the caption in Fig. 6). Note that the first (DLBCL) and the second (FL) rows are the results when fake clinical records are provided for confirming the change of attentions. Class-wise attentions for FL focus on the follicular regions (B) and those for Reactive focus on the outside follicular regions (C). With a Reactive clinical factor, explanatory attention for FL no longer has high values in the follicular region (3B) to strongly predict the subtype as Reactive (3C).
Fig. 6 shows the results for an FL case. It is known that the follicular structure, a subtype-specific region for FL, is important in the diagnosis of FL cases. This case has large follicular structures in the tissue, and we can confirm that exploratory attention {ψℓ}ℓ∈[L], i.e., the follicular region on which focus is placed, changes depending on the clinical records. This case should have some difficulty in the diagnosis using only images, and explanatory attentions change according to the input clinical records.
Fig. 7 shows the result for a Reactive case. It is known that some Reactive cases have a similar appearance to FL cases. This case also has small follicular structures in the tissue, which is similar to FL, and exploratory attentions {ψℓ}ℓ∈[L] are enhanced in those regions. Class-wise attentions for FL focus on the follicular regions and those for Reactive focus on the outside follicular regions. Detailed discussions for these results will be done with magnified image patches later.
The visualization results whose attentions were changed are observed for parts of the dataset, and not all cases changed their attentions depending on input clinical records. From the above results, it is expected that the changes of attentions are caused by whether the input WSI is pathologically typical or not; a case whose disease can clearly be determined only from a WSI does not change its attentions if a different clinical record is input together, and on the other hand, a case whose WSI has ambiguous features to identify the subtype has the possibility of changes of attentions depending on the input clinical record. To qualitatively confirm this, an expert hematopathologist (1 of the authors, who is an institution member with over 15 years of experience diagnosing more than 10000 cases of lymphoma) investigated whether each case was typical or not in both cases whose attention were changed and not. We targeted FL cases to easily interpret the observation results. The pathologist observed many WSIs of FL cases that changed attentions and did not change attentions when the different (fake) clinical records were input and evaluated their typicalness of FL with blind whether the attentions of each case changed or not. A case that was determined as FL only from a WSI was evaluated as a typical FL case, and a case that cannot be determined as FL only from a WSI and requires immunohistochemical (IHC) stains was evaluated as an atypical case. The results are discussed in Pathological viewpoint on attention.
Clinical-record-to-patch attentions. We call self-attentions in the bag representation which indicates attentions from each clinical factor to image patches “clinical-record-to-patch attentions”. As an additional experiment, we visualized how clinical-record-to-patch attentions changed according to input clinical factors. Fig. 8 shows the visualization result of clinical-record-to-patch attentions for an FL case, where the clinical factor of an original (real) case is replaced with the those of representative (fake) case in the clustering results. Fake cases were used to confirm that clinical-record-to-patch attentions changed when different clinical records were input with the same WSI similarly to Fig. 6, Fig. 7.
Fig. 8.
The visualization result of clinical-record-to-patch attentions for an FL case, where the clinical factor of the original case is replaced with those of the representative cases of k-medoids clustering result using 2-dimensional t-SNE embedded features. The plots on the left represent the embedded clinical factors, where ⋆ are the representative cases in each cluster. The images in the middle are the visualization results of clinical-record-to-patch attentions corresponding to blood test for each representative case. The images on the right are the visualization results of clinical-record-to-patch attentions corresponding to interview for each representative case. We can confirm that the proposed PersAM could adaptively change clinical-record-to-patch attentions depending on the input clinical records.
In the clustering, k-medoids method was applied to 2-dimensional t-SNE embedded features that were calculated from 28-dimensional clinical factors. We determined the number of clusters by looking at t-SNE embedded features and set to k = 7. Instead of the original clinical factors, the clinical factors of the representative cases in each cluster were input with a WSI of the original case into the PersAM model. The plots on the left represent the embedded clinical factors, in which ⋆ are the representative cases in each cluster. The images on the middle and right are the visualization results of clinical-record-to-patch attentions corresponding to blood test and interview, respectively. Attention weights of image patches are normalized in each case for visualization.
We can confirm that the proposed PersAM could adaptively change clinical-record-to-patch attentions depending on the clinical records that were input with an original WSI. The visualization results for the representative case 1, 2, 4, and 7 are similar to each other because the embedded clinical factors were located close, but the result for case 3 focuses on the follicular structures and the result for case 5 focuses on the outside follicular structures. We confirmed that clinical-record-to-patch attention also changed depending on clinical factors by effectively encoding the relationship between image patches and clinical factors.
Pathological viewpoint on attention
Here, we discuss the detailed results for the experiment of attention visualization with the expert hematopathologist's comments. For Fig. 6, the hematopathologist made a comment on this result that the change in the explanatory attentions in this case is reasonable because pathologists need to focus more on the follicular regions to identify FL cases (Fig. 9(a)) and on the outside follicular regions to identify Reactive cases (Fig. 9(b)). In general, as mentioned above, follicular region is important to identify FL cases, but the case of Fig. 6 has a lot outside follicular regions compared to typical FL cases.When a case has a large part of such outside follicular regions in the WSI, it is expected that the classification model focuses on outside follicular regions when the clinical factor of Reactive cases was input with the WSI.
Fig. 9.
The magnified image patches of attention regions in Fig. 6, Fig. 7. (a) and (b) are high-resolution image patches in Fig. 6(2B) and Fig. 6(3C). (c) and (d) are high-resolution image patches in Fig. 7(B) and Fig. 7(3C). Changed attention regions depending on different clinical records show typical image patches as seen in the corresponding subtype.
For Fig. 7, the hematopathologist made a comment on this result that the change in the explanatory attentions in this case is also reasonable because the model focuses less on the follicular structure to identify Reactive cases. This case has follicular regions as shown in Fig. 9(c), where histological features of the entire tissue specimen were not typical Reactive case. In such cases, the pathologist can not identify Reactive case only from a WSI with confidence even if the outside follicular regions has typical features of Reactive cases (Fig. 9(d)).
Furthermore, we discuss the results of the investigation of typicalness with magnified images in Fig. 10, Fig. 11. Fig. 10 shows cases, which were evaluated as typical FL cases by the pathologist, in the cases where attentions did not change regardless of input clinical records. All these cases were evaluated as typical FL cases since follicular regions exist in the entire tissue specimens, which enables pathologists to identify them as FL cases only from WSIs. Fig. 11 shows cases, which were evaluated as atypical FL cases by the pathologist, in the cases where attentions changed depending on input clinical records. Most cases have less follicular regions in low magnification and the pathologist can not determine them as FL cases due to the lack of definitive FL features. In Fig. 11(d), there are a lot of nodes in the tissue specimen, and the pathologist expects other diseases and has to require IHC stains. The proposed PersAM method can provide a reasonable explanation that is similar to pathologists' decision-making where the subtype of typical cases can be identified only from tissue specimens regardless of clinical records and attention regions of atypical cases are affected by clinical records.
Fig. 10.
Low- and high-resolution image patches of the cases evaluated as typical FL cases. The attention of all the cases did not change when the different clinical factors were input with the original WSIs in the attention visualization. All WSIs have typical FL features and the pathologist can identify them as FL cases only from WSIs.
Fig. 11.
Low- and high-resolution image patches of the cases evaluated as atypical FL cases. The attention of all the cases changed when the different clinical factors were input with the original WSIs in the attention visualization. All WSIs do not have definitive features to identify them as FL.
Conclusion
In this study, to develop an AI system that mimics the diagnosis process of human pathologists, we proposed the PersAM method, which adaptively changes the attention regions according to patient clinical records. Our proposed method provided 3 types of attention regions, which were calculated considering the relationship among multimodal information. The results of experiments conducted with 842 malignant lymphoma cases verify the effectiveness of the PersAM method.
Funding
This work was partially supported by MEXT KAKENHI (20H00601), JST CREST (JPMJCR21D3), JST Moonshot R&D (JPMJMS2033-05), JST AIP Acceleration Research (JPMJCR21U2), NEDO (JPNP18002, JPNP20006) and RIKEN Center for Advanced Intelligence Project.
Ethical approval
Approval was obtained from the ethics committee of Kurume University, Nagoya Institute of Technology, and RIKEN.
Conflict of interest
The authors declare that they have no conflict of interest.
References
- 1.Maximilian I., Jakub T., Max W. International Conference on Machine Learning:2127–2136PMLR. 2018. Attention-based deep multiple instance learning. [Google Scholar]
- 2.Noriaki H., Daisuke F., Ryoichi K., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:3852–3861. 2020. Multi-scale domain-adversarial multiple-instance cnn for cancer subtype classification with unannotated histopathological images. [Google Scholar]
- 3.Bin L., Yin L., Eliceiri Kevin W. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning; pp. 14318–14328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Adam Y., Constance L., Tal S., Tally P., Regina B. A deep learning mammography-based model for improved breast cancer risk prediction. Radiology. 2019;292:60–66. doi: 10.1148/radiol.2019182716. [DOI] [PubMed] [Google Scholar]
- 5.Jordan Y., William Y., Philipp T. Multimodal skin lesion classification using deep learning. Exp Dermatol. 2018;27:1261–1267. doi: 10.1111/exd.13777. [DOI] [PubMed] [Google Scholar]
- 6.Kim-Han T., Pew-Thian Y., Shen D. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer; 2017. Multi-stage diagnosis of Alzheimer’s disease with incomplete multimodal data via multi-task deep learning; pp. 160–168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nie Dong L., Junfeng Z.H., et al. Multi-channel 3D deep feature learning for survival time prediction of brain tumor patients using multi-modal neuroimages. Scient Rep. 2019;9:1–14. doi: 10.1038/s41598-018-37387-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mihir S., Pierre S., Zacharaki Evangelia I., et al. Deep multi-instance learning using multi-modal data for diagnosis of lymphocytosis. IEEE J Biomed Health Inform. 2020;25:2125–2136. doi: 10.1109/JBHI.2020.3038889. [DOI] [PubMed] [Google Scholar]
- 9.Hang L., Yang F., Xiaohan X., et al. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information; pp. 529–539. [Google Scholar]
- 10.Ashish V., Noam S., Niki P., et al. Advances in Neural Information Processing Systems. 2017. Attention is all you need; pp. 5998–6008. [Google Scholar]
- 11.Alexey D., Lucas B., Alexander K., et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of International Conference on Learning Representation. 2021 [Google Scholar]
- 12.Zhou Y., Yuhao C., Jun Y., Dacheng T., Qi T. Multimodal unified attention networks for vision-and-language interactions. arXiv preprint. 2019 arXiv:1908.04107. [Google Scholar]
- 13.Yen-Chun C., Linjie L., Licheng Y., et al. European Conference on Computer Vision. Springer; 2020. Uniter: universal image-text representation learning; pp. 104–120. [Google Scholar]
- 14.Le H., Dimitris S., Kurc Tahsin M., Yi G., Davis James E., Saltz Joel H. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Patch-based convolutional neural network for whole slide tissue image classification; pp. 2424–2433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Seyed M.H., Vishal M., Ganesh R., Rao Arvind U.K. Automated discrimination of lower and higher grade gliomas based on histopathological image analysis. J Pathol Inform. 2015:6. doi: 10.4103/2153-3539.153914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cireşan Dan C., Alessandro G., Gambardella Luca M., Jürgen S. International Conference on Medical Image Computing and Computer-assisted Intervention. Springer; 2013. Mitosis detection in breast cancer histology images with deep neural networks; pp. 411–418. [DOI] [PubMed] [Google Scholar]
- 17.Angel C.-R., Ajay B., Fabio G., et al. Medical Imaging 2014: Digital Pathology; 9041:904103. International Society for Optics and Photonics; 2014. Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. [Google Scholar]
- 18.Ehteshami B.B., Mitko V., Johannes V.D.P., et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama. 2017;318:2199–2210. doi: 10.1001/jama.2017.14585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yan X., Zhipeng J., Yuqing A., et al. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE; 2015. Deep convolutional activation features for large scale brain tumor histopathology image classification and segmentation; pp. 947–951. [Google Scholar]
- 20.Peter N., Marick L., Fabien R., Thomas W. Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Trans Med Imag. 2018;38:448–459. doi: 10.1109/TMI.2018.2865709. [DOI] [PubMed] [Google Scholar]
- 21.Hiroki T., Yuki T., Akihiko Y., Ryoma B. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology; pp. 12597–12606. [Google Scholar]
- 22.Kosuke T., Noriaki H., Yu I., Hidekata H., Ichiro T. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Computing valid p-values for image segmentation by selective inference; pp. 9553–9562. [Google Scholar]
- 23.Xinliang Z., Jiawen Y., Feiyun Z., Junzhou H. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Wsisa: making survival prediction from whole slide histopathological images; pp. 7234–7242. [Google Scholar]
- 24.Ellery W., Steiner David F., Zhaoyang X., et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS One. 2020;15 doi: 10.1371/journal.pone.0233678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ziwang H., Hua C., Ruoqi W., Haitao W., Yuedong Y., Hejun W. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. Integration of patch features through self-supervised learning and transformer for survival analysis on whole slide images; pp. 561–570. [Google Scholar]
- 26.Narayan H., Hipp Jason D., Yun L., et al. Similar image search for histopathology: SMILY. NPJ Digit Med. 2019;2:1–9. doi: 10.1038/s41746-019-0131-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shivam K., Tizhoosh Hamid R., Charles C., et al. Yottixel–an image search engine for large archives of histopathology whole slide images. Med Image Anal. 2020;65 doi: 10.1016/j.media.2020.101757. [DOI] [PubMed] [Google Scholar]
- 28.Noriaki H., Yusuke T., Hiroki M., et al. Case-based similar image retrieval for weakly annotated large histopathological images of malignant lymphoma using deep metric learning. arXiv preprint. 2021 doi: 10.1016/j.media.2023.102752. arXiv:2107.03602. [DOI] [PubMed] [Google Scholar]
- 29.Selvaraju Ramprasaath R., Michael C., Abhishek D., Ramakrishna V., Devi P., Dhruv B. Proceedings of the IEEE International Conference on Computer Vision. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization; pp. 618–626. [Google Scholar]
- 30.Sumedha S., Stephen W., Sofia T., Kayhan B. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. Using causal analysis for conceptual deep learning explanation; pp. 519–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dawid R., Adriana B., Jacek T., Bartosz Z. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021. Kernel self-attention for weakly-supervised image classification using deep multiple instance learning; pp. 1721–1730. [Google Scholar]
- 32.Zhuchen S., Hao B., Yang C., et al. TransMIL: transformer based correlated multiple instance learning for whole slide image classification. Proceedings of Advances in Neural Information Processing Systems. 2021:2136–2147. [Google Scholar]
- 33.Mengkang L., Yongsheng P., Dong N., et al. MICCAI Workshop on Computational Pathology. PMLR; 2021. SMILE: sparse-attention based multiple instance contrastive learning for glioma sub-type classification using pathological images; pp. 159–169. [Google Scholar]
- 34.Zeyu G., Bangyang H., Xianli Z., et al. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. Instance-based vision transformer for subtyping of papillary renal cell carcinoma in histopathological image; pp. 299–308. [Google Scholar]
- 35.Chen Richard J., Lu Ming Y., Wei-Hung W., et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images; pp. 4015–4025. [Google Scholar]
- 36.Zhe L., Chong W., Mei H., et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Thoracic disease identification and localization with limited supervision; pp. 8290–8299. [Google Scholar]
- 37.Adam G., Benjamin G., Jan H., Drazen J., Mahadev S. OpenSlide: a vendor-neutral software foundation for digital pathology. J Pathol Inform. 2013:4. doi: 10.4103/2153-3539.119005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kaiming H., Xiangyu Z., Shaoqing R., Jian S. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Deep residual learning for image recognition; pp. 770–778. [Google Scholar]











