Skip to main content
iScience logoLink to iScience
. 2023 Sep 29;26(11):108041. doi: 10.1016/j.isci.2023.108041

A multi-center performance assessment for automated histopathological classification and grading of glioma using whole slide images

Lei Jin 1,2,9,10,, Tianyang Sun 3,9, Xi Liu 1,2,9, Zehong Cao 3, Yan Liu 1,2, Hong Chen 2,4, Yixin Ma 1,2, Jun Zhang 5, Yaping Zou 5, Yingchao Liu 6,∗∗, Feng Shi 3,∗∗∗, Dinggang Shen 3,7,8,∗∗∗∗, Jinsong Wu 1,2
PMCID: PMC10590813  PMID: 37876818

Summary

Accurate pathological classification and grading of gliomas is crucial in clinical diagnosis and treatment. The application of deep learning techniques holds promise for automated histological pathology diagnosis. In this study, we collected 733 whole slide images from four medical centers, of which 456 were used for model training, 150 for internal validation, and 127 for multi-center testing. The study includes 5 types of common gliomas.

A subtask-guided multi-instance learning image-to-label training pipeline was employed. The pipeline leveraged “patch prompting” for the model to converge with reasonable computational cost. Experiments showed that an overall accuracy of 0.79 in the internal validation dataset. The performance on the multi-center testing dataset showed an overall accuracy to 0.73. The findings suggest a minor yet acceptable performance decrease in multi-center data, demonstrating the model’s strong generalizability and establishing a robust foundation for future clinical applications.

Subject areas: Cancer, Computational bioinformatics, Health informatics, Pathology

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Deep learning is applied to achieve accurate prediction of WSI images for glioma

  • Histopathological slides are digitalized and classified according to CNS5 standard

  • Multi-center validation research is conducted to evaluate the diagnostic performance

  • The future study will consider multi-center prospective clinical trials


Cancer; Computational bioinformatics; Health informatics; Pathology

Introduction

Glioma is known to be the most prevalent primary malignant brain tumors in adults. The classification of glioma subtypes is crucial in clinical settings.1,2,3 Previously, neuropathologists have relied on histologic diagnosis, where they visually examine histopathological slides under a microscope, especially taking hematoxylin and eosin (H&E) sections as the gold standard for classification. Histopathological diagnosis of glioma is a laborious process, including manually examining both coarse and fine resolutions of images covering large volumes of tissue samples. The pathologist also faces complex classification criteria,4 which calls for detailed and exhaustive analysis based on their experience.5,6 In recent years, slide scanners have been developed that digitize glass slides into images, mostly known as whole-slide images (WSIs).7 Advancement in scanning technologies over the past 2–3 years has permitted large numbers of slides to be scanned, forging the way for computational pathology and particularly artificial intelligence assisted diagnosis, which can help pathologists with their efficiency and accuracy in diagnosis.8,9,10

Yonekura et al. previously reported a classical CNN model for the binary classification of high-grade glioma (HGG) and low-grade glioma (LGG).11 Subsequently, Truong et al. developed a ternary glioma grading model based on ResNet18 using transfer learning, but this approach resulted in a lower accuracy.12 In our previous work, our team developed a squeeze-and-excitation-based DenseNet model with weighted cross-entropy to classify the five major histological subtypes of glioma.13 The model was trained on a dataset of over 79,990 histopathology image patches from 267 patients and achieved a patient-level accuracy of 87.5% on a testing dataset derived from 56 additional patients. Building upon this model, and according to the classification criteria recommended by the WHO CNS5 standard, we subsequently developed an automated diagnostic system named the histopathological auxiliary system for brain tumor (HAS-Bt) (see Video S1).14

Video S1. System demonstration of HAS-Bt
Download video file (32.1MB, mp4)

However, previous research only utilized data from a single center for training, validation, and testing. Single-center training and validation,15 where a deep learning diagnostic model is trained and tested on data from the same institution, has certain limitations that need to be addressed. First, limited generalizability and single-center training may result in a model that is overly optimized for the specific characteristics of the training dataset.16,17 As a result, the model may not generalize well to the different populations, imaging protocols, or equipment variations that exist in other institutions. This lack of generalizability could lead to poor performance when applied to new and diverse datasets. Second, due to bias and overfitting, single-center datasets can be inherently biased, as they often represent a specific patient population or imaging modality prevalent in that institution.18,19 The model may inadvertently learn and rely on these biases, leading to overfitting and poor performance on data with different characteristics. The overemphasis on local patterns may compromise the model’s ability to capture the broader patterns relevant to the diagnostic task. Finally, limited diversity and sample size, single-center datasets typically have limitations in terms of sample size and diversity.20 This can affect the robustness and reliability of the model’s performance evaluation. The lack of diversity may not adequately represent the full spectrum of clinical variations, limiting the model’s ability to handle unseen cases and increasing the risk of false-positive or false-negative in actual applications.

Therefore, to address the aforementioned limitations, in this study, we propose retrospective, multicenter data testing method for an existing model (HAS-Bt) in our previous work13,14 based on the WHO CNS5 standard (Figure 1). The proposed method involves the digitalization of WSIs, training the model on data from different institutions, and evaluating its performance on independent multi-center datasets. This approach provides a more comprehensive assessment of the model’s generalizability, robustness, and real-world applicability, offering stronger evidence for its diagnostic capabilities.

Figure 1.

Figure 1

Schematics of HAS-Bt

A diagnostic system for glioma integrating software and hardware components. The system includes digitalization, patch-level prediction, and slide-level CNS tumor classification.

Results

The rational clinical applicability of HAS-Bt14 arises from its subtask guided image-to-label multiple instance learning (MIL) classification model. In this study, we collected data from for four centers, referred to as institutes A, B, C, and D (please refer to STAR Methods for details). The performance of the model is reported with classwise and overall classification accuracy, and the generalizability of the model is evaluated on the multi-center testing dataset.

Subtask guided image-to-label MIL classification model

The two-stage MIL21,22 training pipeline demonstrated the feasibility of classifying digitalized histopathology of brain tumors. The general structure of the MIL was introduced in our previous study.14 Detailed contents and algorithms of the two stages can be found in STAR Methods.

In this work, we further extend the method to an end-to end training process such that the model performs online feature extraction during training. “Patch prompting” was designed such that the model would converge with reasonable computational cost (Figure 2).

Figure 2.

Figure 2

Result overview

(A) Image-to-label multi-instance learning framework. For online training, input bags of instances are patches proposed with the patch prompt logistic. Each instance is forward to the two-feature extraction network trained in our previous work, and feature bags are used as input to the MIL Classifier (B) Data distribution of collected glioma subtypes.

(C) Illustration of histopathological images under 20× magnification by two scanners of different models.

(D) Confusion matrix of internal validation (left) and multi-center testing (right). The rows are ground truth, and the columns are predictions. Diagonal entries count the number of correct predictions.

(E) Accuracy (mean and SD) and subtype prediction during test-time augmentation for validation set and testing set. Each grid counts the frequency of occurrence of 9 test-times.

We inherited the model structure of the patch-level feature extractor and slide-level multi-instance learning setup14 while modifying the MIL training strategy into an image-to-label process instead of the image to feature, then feature to label framework. We designed a patch proposal method based on the pre-trained patch-level classification for the two subtasks. An illustration of the model pipeline is shown in Figure 2A.

“Patch prompting” made it feasible to perform end-to-end training with reasonable computational power and time efficiency. For each epoch, patches from the slide were selected based on the output of the trained patch-level classification networks. This design regularized the slide-level classification in the way that, for each training observation, only a small proportion of patch samples with representative features were used as input. This method provided a variety of combinations of patches for each slide. Moreover, the dimension compression of the pipeline decreased significantly since we did not take full bags of images for input, thus leading to better generalizability.

For each training observation, 1% of the tiled image patches classified as tumor (excluding background, normal brain tissue, bleeding, empyrosis, etc.) were selected from the WSI. An illustration of patch prompting is shown in Figure 3.

Figure 3.

Figure 3

Patch prompting for the image-to-label training pipeline

The bottom layer shows organized tiled image patches of a subject. The middle layer demonstrates the two subtasks of celltype prediction (left) and risk patch prediction (right). The top layer is an illustration of selected patches for one observation.

To evaluate the effectiveness of the model, a total of 733 WSIs of H&E stained slides were collected from glioma patients, including those with oligodendroglioma grade 2 (O2), oligodendroglioma grade 3 (O3), astrocytoma grade 2 (A2), astrocytoma grade 3 (A3), and astrocytoma/glioblastoma grade 4 (A4). A total of 606 slides were used for model development, including 456 for training and 150 for validation. A dataset of the other 127 slides from 4 institutes was reserved for multi-center testing. A detailed dataset and distribution are illustrated in Figure 2B. Table 1 gives the partition of glioma subtypes used for training, validation, and testing. The glass slides were digitalized using a slide scanner by two manufacturers under 20× magnification. Figure 2 gives examples of the collected data.

Table 1.

Data distribution of five glioma subtypes in training, validation, and testing sets

Training set Validation set Testing set
A2 96 40 28
O2 100 44 25
A3 88 28 23
O3 67 20 26
A4 105 18 25
Total 456 150 127

As an auxiliary task, we further simplified the classification tasks into cell-type classification and risk identification. The performance of the system on the internal validation and multi-center testing sets is reported in terms of accuracy, sensitivity (with mean ± standard deviation), and specificity for the glioma-subtyping task and the auxiliary tasks. Please refer to the STAR Methods for the relevant formula.

Performance assessment

The network was validated on 150 internal WSIs and independently tested on data from 127 WSIs from four different centers. The validation set using only internal data from a single center shows an average slide level accuracy across five classes of 0.79 with a sensitivity of 0.79. The multi-center testing set shows an overall accuracy of 0.73 and a sensitivity of 0.72. The detailed results of each classification are displayed in Table 2. The performance of the model of each subtype with the corresponding validation set and testing set are described as a confusion matrix in Figure 2D. The results show a 6% decrease in performance between internal validation and multi-center testing sets.

Table 2.

Accuracy, sensitivity, specificity of validation and testing sets among three tasks

Validation set
Testing set
Acc Sensitivity
(TTA)
Sensitivity
(mean ± std)
Specificity
(TTA)
Acc Sensitivity
(TTA)
Sensitivity
(mean ± std)
Specificity
Risk HGG 0.87 0.86 0.85 ± 0.03 0.88 0.86 0.86 0.85 ± 0.03 0.89
LGG 0.88 0.88 ± 0.02 0.86 0.86 0.85 ± 0.02 0.84
Cell
Type
O 0.94 0.92 0.89 ± 0.0.2 0.95 0.92 0.94 0.93 ± 0.02 0.91
A 0.95 0.96 ± 0.02 0.92 0.91 0.91 ± 0.02 0.94
Glioma A2 0.79 0.85 0.82 ± 0.02 0.73 0.86 0.85 ± 0.06
O2 0.80 0.78 ± 0.02 0.84 0.78 ± 0.04
A3 0.71 0.67 ± 0.05 0.43 0.39 ± 0.07
O3 0.80 0.73 ± 0.06 0.77 0.80 ± 0.05
A4 0.78 0.73 ± 0.08 0.72 0.78 ± 0.06

In addition, two sets of results related to subtasks need to be reported. The first subtask involves risk evaluation for distinguishing between HGG and LGG. High-risk areas indicate histologically malignant features for HGG. The results show that the accuracy for both validation and testing reached 0.87 and 0.86, respectively. The second subtask involves differentiating between cell morphologies of oligodendrocytes (O) and astrocytes (A), with accuracies of 0.94 and 0.92, respectively. These two sets of results, as binary classifications, demonstrate strong performance in both tasks.

We introduced minor modifications to the patch-prompting method in a deterministic testing procedure. We leveraged the Test-Time-Augmentation (TTA)23 technique for a more robust model prediction, and aggregated outputs through soft-voting., the formula is shown in STAR Methods.

For each testing subject, patch-level predictions and corresponding probabilities were used for patch prompting. For celltype, patches predicted as tumor (predicted as O and A) and with a probability larger than 0.6 were selected. For risk proposal, high-risk patches with probabilities larger than 0.6 were selected. The selected patches of each subtask were evenly split into 9 different bags as inputs for testing. The plots demonstrate that for most cases, each run of the TTA process makes a correct prediction, which shows the robustness of our model. A comparison of class-wise sensitivity with TTA process and the average sensitivity of 9 individual testing (Table 2) suggest a performance improvement through the test-time-augmentation strategy. Furthermore, we claim that with a proportion of representative patches, the model can make reasonable prediction results (Figure 2E).

Regarding to the patch-level results, we compare the input testing subject, and the corresponding attention maps. The attention map shown in Figure 4 can accurately highlight the tumor cell clusters, which further ensures the efficacy and robustness of our deep learning model. The image patch prediction results were then combined to complete the final classifications of WSIs and produce the slide-level diagnosis.

Figure 4.

Figure 4

WSI Overlayed with subtasks prediction

(A) Subtask of celltype identification demonstrate an accurate cluster of celltype of O for a subject with O3.

(B) Patches with high-risk biomarkers are identified.

(C) Demonstrate a patient with class O-A, with is not fully covered in our training samples, our model surprisingly demonstrates strong generalizability to celltype classification that correctly clustered both cell types in the subject.

In summary, through this independent multi-center assessment utilizing a combination of internal and external datasets, we have established that the performance of the glioma pathological classification model we developed has stable generalizability and substantial clinical utility.

Discussion

The main purpose of this study was to examine the performance of our AI diagnosis model for histopathologically classifying 5 types of glioma on multi-center data. We collected 127 slides from four institutions in China as a multi-center dataset. On the single-center internal validation set, the overall accuracy of the model reached 0.79. However, the model showed a 6% reduction in accuracy on the multi-center testing set, which is in line with our expectation that reasonable decrease in performance may occur.

In addition to the five-classification task, our model also performs two important subtasks: cell subtyping and high-grade feature (risk) locating. The cell subtyping network classifies patches into astrocytoma or oligodendroglioma. The second subtask is high-grade feature locating, where microvascular proliferation (MVP) and necrosis (NEC) are the main common features used in glioma grading.24,25 Considering the markers’ association with increased malignant behavior, MVP and NEC are considered key features in the grading network. These tasks guided the model to better perform the subtyping tasks. Within this network, a feature extractor is employed to reduce dimensions, disassemble patches, and embed them into feature maps. Subsequently, we leveraged a multi-head attention MIL pooling module to aggregate the feature to a specific category. This process generates a decision that ensures permutation invariance. The results demonstrate excellent generalizability in glioma subtyping, the accuracy of cell morphology classification remains high and the differentiation between high-risk and low-risk cases is equally impressive.

In contrast to the fine classification performance of the previous tasks, we observed a less satisfactory results for category A3, where subjects of this class were misclassified as A2 and A4. The main difference between A3 and A2 lies in nuclear division, which is information that is relatively difficult to capture, while MVP and NEC could be used to differentiate between A3 and A4, the distinction between A2 and A3 was not as clear. For the model, it seems that only a rough judgment based on cell density can be made, but there is no established quantification standard for this. Consequently, the classification results for A3 suffered the most.

Nevertheless, there is still a slight decrease in performance when transitioning from single-center validation to multi-center testing. Cao et al. developed a weakly supervised deep convolutional network for lung cancer classification from WSIs.26 When tested on a dataset identical to the training set, the model achieved performance with AUCs of 0.95–0.97. However, when tested on real-world external heterogeneous cohorts, the AUCs dropped to 0.94. Similar to lung cancer, hepatic cancer has also been studied using deep learning for tumor diagnosis based on multi-phase contrast-enhanced CT and clinical data.27 In a single-center test, the diagnostic accuracy was 86.2%. However, when external data from another center were used, the accuracy decreased to 82.9%. A literature search reveals that the decline in performance after multi center data testing is a common issue that has been reported.28

This phenomenon can be interpreted in several ways such as dataset bias and diversity. Single-center training may inadvertently introduce biases specific to that institution’s dataset. When transitioning to multi-center testing, the model encounters new data with different characteristics, potentially highlighting the limitations of the training data and resulting in a decrease in performance. Increased complexity and variability are also key factors. When involving data from different institutions, potential batch effects need to be considered. There are inherent variations in imaging techniques, data acquisition protocols. Unlike radiological medical images with standardized DICOM format, the storage of whole-slide images varies across scanners. In our study, the WSIs used in the internal and external datasets were scanned by different scanners which gave different image outputs. Therefore, it is essential to harmonize data acquisition and preprocessing methods including segmentation and color adjustments to minimize variations introduced by different centers. Failure to adequately address these differences can negatively impact the model’s performance, resulting in a decrease in accuracy compared to the single-center scenario. These differences can pose challenges for the model’s generalizability, leading to a decrease in performance compared to the single-center setting.

Limitations of the study

There are three major limitations of this study to discuss. First, the poor performance of A3 has addressed that there should be individual tasks for A3. Therefore, we propose to use IHC as a valuable additional source of information,29,30 especially Ki67 for the model to learn mitosis. We planned to effectuate the conception based on an improved UNet structure to identify prime IHC parameters. Second, the current input for our model consists of 20× magnification images. It may be worth considering using multi-resolution images as input for the model in future iterations. In this way, the images can capture not only cell-level features but also learn more global patterns. Finally, this study has a relatively small sample size, which may not comprehensively reflect the model’s capabilities. We plan to address this issue by incorporating a larger number of patients in future iterations. These future improvements will be merged with the output of multi-head attention functions to create an AI brain tumor pathological diagnosis model that better fits to the reality of clinical work.

Conclusion

Overall, through the multi-center data testing of our existing diagnostic model, we experienced the expected decrease in accuracy. However, we remain confident that the model’s performance will be improved through further research, paving the way for model’s practical implementation in clinical settings in the future.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

Analyzed data This paper Original WSI data are available from the corresponding authors upon reasonable request

Software and algorithms

Affinity Designer2 Serif Ltd https://affinity.serif.com/en-gb/learn/designer/desktop/
Origin OriginLab www.originlab.com
Visual Studio Code Microsoft github.com/microsoft/vscode
Original code about deep learning This paper https://github.com/Tianyangg/HAS-Bt

Other

NVIDIA Tesla V100 GPU NVIDIA TESLA V100 GPU https://www.nvidia.com/en-gb/data-center/tesla-v100/
NVIDIA RTX A4000 GPU NVIDIA RTX A4000 GPU https://www.nvidia.com/en-gb/design-visualization/rtx-a4000/

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Lei Jin (ozlei91@126.com).

Materials availability

This study did not generate new unique reagents.

Experimental model and study participant details

Ethics statement

The study was ethically approved by the Huashan Hospital Institutional Review Board (HIRB), Shanghai, China (No. KY2015-256). Informed consent was waived because of the retrospective nature of the study. The related systems of slide scanners have been approved by the Chinese National Medical Products Administration (NMPA, with certificate numbers of 20222223642, and 20212220117, respectively).

Data collection

In this study, we recruited an internal dataset from Huashan Hospital Fudan University (Institute A), scanned with CytoExplorer ZJ300-CS3, Zhongji Biological, and collected external datasets from The Affiliated Hospital of Qingdao University (Institute B), The Provincial Hospital Affiliated to Shandong First Medical University (Institute C), and The Affiliated Hospital of Southwest Medical University (Institute D). Slides from Institute B to D are scanned with Pannoramic MIDI Ⅱ, 3DHISTECH.

A total of 733 WSIs of H&E stained slides were collected of glioma patients. These patients come from the pathological databases of Institutes A-D, with random gender and age greater than 18 years old. Furthermore, the included cases met the following criteria:

Inclusion criteria: i) Comprehension of the research process, agreement to use the patient data, and signing of the informed consent form by the patient or his or her guardian; ii) Tumor surgery performed (resection or biopsy); iii) Postsurgical pathological diagnosis of A2, O2, A3, O3, or A4.

Exclusion criteria: i) Unavailable slides (missing or broken); ii) Unqualified slide or insufficient tissue; iii) A controversial diagnosis from an experienced neuropathologist.

The training data includes 388 slides from Institute A, together with 68 slides from Institute B, C, and D to finetune the model. A single-center 150 slides dataset from Institute A served as the validation set. For testing, 127 slides of multi-center dataset were used for independent testing, of which 43 from Institute A, 33 from Institute B, 29 from Institute C and 22 from Institute D.

Method details

Study design

The study is designed to be a retrospective, multicenter data testing of a glioma classification model based on the WHO CNS5 standard. It is divided into two parts involving model training and performance assessment. Data are collected from for four centers, all pathological slides were digitalized using slide scanners. The performance of the model is reported with class-wise and overall classification accuracy, and the generalizability of the model is evaluated on the independent multi-center testing dataset.

Digitalization of WSIs

Industrial-scale whole-slide imaging scanners feature a high-precision electrical displacement platform that ensures precise slide positioning. For image acquisition, a high-resolution camera with advanced, three-dimensional focusing control technology was utilized. This technology was specifically developed to enable rapid and accurate digital scanning of WSIs. The scanned patches from multiple perspectives are stitched and fused together to create a comprehensive WSI. WSIs were stored using TIFF (Tagged Image File Format). We leverage OpenSlide,31 a widely used open-source library for WSI manipulation, to tile the images into resolution of 1842x2740 pixels per image with 10 pixels’ overlap under 200× magnification for better computational efficiency, and to fit our experimental setup. Figure S1 introduces the Image Tiling process.

Review of the labels

To ensure accurate and reliable diagnostic labels for all retrospective data from multiple centers, we conducted a review of all included labels.32 In the first round of review, a junior pathologist diagnosed all WSIs. In addition to H&E-stained slides, we also incorporated the positive detection rates of immunohistochemical markers such as Ki67 into the diagnostic criteria. The diagnostic results were compared with the original center’s labels. Consistent labels were considered as the final diagnostic results and treated as the ground truth. In cases where inconsistencies were found, a second round of review was conducted by a senior pathologist over 15 years of experience. The diagnostic results from this round were considered as the ground truth for those particular cases.

Model details

Feature extractor

We built the feature extractors with DenseNet121 as the backbone, and the output layer was modified to suit the number of classification targets. The trained networks were adopted as feature extractors to embed the original, large pathology slide X into a compressed representation X:

X=f(X)

Considering a WSI composed of n patches, one feature vector x can be obtained from one image patch thorough the deep model, and the feature map XRn×m can be extracted to represent the WSI.

Let X denote a set of image patches:

X={x1,x2,,xn}

Compressor f embeds the original pathology slide into a matrix Xn×m, where each row vector (each instance) xi=f(xi) corresponds to the image set and m is the embedded dimension.

Xn×m={xi=f(xi)|i{1,2,,n}},xR1×m
Feature aggregator

We introduced a multi-head attention pooling block to compress the feature map XRn×m across the instance dimension. Conventionally, the self-attention operation compressed representation X is defined as:

S=(tanh(XW1))W2
Y=softmax(ST)X

Here, W1Rm×p and W2Rp×1 are weighting parameters that can be optimized during training. SRn×1 is the attention scoring vector obtained from the self-attention mechanism, the entries of which indicate the importance value of each instance. YR1×m is the pooled feature map that is invariant to instance wise permutation. indicates that the features of each instance are weighted-averaged by self-attention. This method not only improves the feature aggregation performance but also helps to identify the representative regions during prediction.

A single channel of attention is not sufficient to describe the broad hidden space. Clinically, for example, different types of glioma are distinguished utilizing different aspects of H&E stained sections; colors, textures, and the occurrence of important image markers are different aspects that contribute to the diagnosis. Thus, various activation patterns can be triggered under a multiclass classification task. Herein, we introduced a multihead attention mechanism to obtain a comprehensive attention activation pattern. A schematic illustration of the inner workings of this mechanism is shown (Figure S2C). Considering a multihead attention block with h heads, the feature maps generated from the block are:

Y=Concat[softmax(S1T)X,,softmax(ShT)X]

We designed h parallel attention pooling blocks and concatenated the pooled feature maps. It should be noted that a multihead attention block can also be interpreted as a Bayesian evaluation of attention maps. We added dropout layers and batch normalization operations inside the attention block. Thus, the variance of the attention maps measures the uncertainty of the attention activations, and the average of the maps serves as a stable pattern for activation.

Cell-subtyping network

We choose weighted cross-entropy loss to guide the optimizer. For a classification problem with T classes, the conventional cross-entropy loss is defined as:

CELoss=Σj=1Tyjlogpj

L indicates the loss function, yj is the true label, and pj is the probability value produced from the last layer of the deep learning model. We train one of the feature aggregation network to classify NT such as gliosis and inflammation verses the five CNS tumors (O, A, MET, LYM and EPE) using the module we introduced previously. We set the number of heads = 8 for the multihead attention mechanism and backpropagate the cross-entropy loss during optimization. We randomly drop 0–20% of the patches from a patient during training for better generalizability, simulating the variability of data acquisition.

Grading network

Training and augmentation strategy

The pretrained models were used as feature extractor with no error propagation, so that the input images update the MIL model. To enhance the diversity of the training data, we applied various augmentations to the patches, including random rotation, shifting, flipping, and shear transform. Additive Gaussian noise and Poisson noise were randomly added, further increasing the variability of the training samples.

We leveraged gradient descent to optimize the parameters of the neural networks. For classification tasks, we used a weighted Cross Entropy (CE) Loss with class labels as supervision. The formula of Weighted CE loss is described as:

MulticlassCELoss=1NΣi=1NΣj=1Cwj×yij×log(pij)

Where N is the number of samples in the dataset, C is the number of class of the training target. w is the weight of each class, y is the binary indicator whether the sample i belongs to the class j and p is the predicted probability of the sample i of class j.

We leveraged a balanced-class-weighting during training where weight is the inverse of number of samples in each class. This method mitigates the class imbalance in the collected dataset.

Test-time-augmentation (TTA)

For results aggregation, 9 probability vectors obtained by the MIL network were averaged, and the final label was assigned to the class with highest averaged probability.

The soft label voting equation for test time augmentation can be represented as follows:

yj=1Mm=1Mf(x,θm)j

Where: yj represents the predicted probability of class (j) for a given input sample. M is the number of augmented samples generated by test time augmentation. f(x,θm)j is the prediction of the model f for class j, which takes an augmented input sample x and its corresponding model parameters θm.

Quantification and Statistical analysis

Multi-class classification

Accuracy=TruePositiveNumberofSamples

For both binary and multi-class classification:

Sensitivityi=TPiTPi+FNi
Specificityi=FPiTNi+FPi

Mean and standard deviation

To calculate the mean and standard deviation of the model, we calculate the class-wise sensitivity of each run of the TTA process.

The formula of the mean of class-wise sensitivity is:

Meanc=1NΣi=1i=NSensitivityci

The formula of standard deviation calculation is:

STDc=Σ(SensitivityciMeanc)N

Where N is the number of times to test the model, in our case N = 9, and c is the class index and i is the index of testing run.

Acknowledgments

The authors would like to express their gratitude to the following three institutions: The Affiliated Hospital of Qingdao University, The Provincial Hospital Affiliated to Shandong First Medical University, and The Affiliated Hospital of Southwest Medical University, for sharing their digital pathological data, which enabled the progress of this multi-center study.

This project was supported by Shanghai Municipal Science and Technology Commission “Science and Technology Innovation Action Plan” Project (No.22S31905400), and Shanghai Municipal Health Commission “Clinical Research (Youth)” Project (No.20234Y0308).

Author contributions

Conceptualization, L.J., Yingchao.Liu., and J.W.; Methodology, T.S., Z.C., and F.S.; Investigation, X.L., Yan.Liu., H.C., Y.M., J.Z., and Y.Z.; Writing – Original Draft, L.J., T.S., X.L., and Z.C.; Writing – Review & Editing, F.S.. and Yingchao.Liu..; Funding Acquisition, J.W. and D.S.; Resources, Yingchao.Liu.; Supervision, J.W. and D.S.

Declaration of interests

T.S., Z.C., F.S., and D.S. are employees in United Imaging Intelligence. Y.Z. and J.Z. are employees in Zhongji Biotechnology. These two companies have no role in designing and performing the surveillances and analyzing and interpreting the data. All other authors report no competing interest relevant to this article.

Inclusion and diversity

We support inclusive, diverse, and equitable conduct of research.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the author(s) consulted ChatGPT in order to improve language and readability. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Published: September 29, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2023.108041.

Contributor Information

Lei Jin, Email: ozlei91@126.com.

Yingchao Liu, Email: 13805311573@126.com.

Feng Shi, Email: feng.shi@uii-ai.com.

Dinggang Shen, Email: dinggang.shen@gmail.com.

Supplemental information

Document S1. Figures S1 and S2
mmc1.pdf (642.8KB, pdf)

Data and code availability

  • All data reported in this study including WSIs, patch images are available from the lead contact upon reasonable request.

  • The custom code in this study for training deep learning and iterative reconstruction models were written in Python with PyTorch. The code is available publicly on github: https://github.com/Tianyangg/HAS-Bt.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

References

  • 1.Zheng R.S., Zhang S.W., Sun K.X., Chen R., Wang S.M., Li L., Zeng H.M., Wei W.W., He J. [Cancer statistics in China, 2016] Zhonghua Zhongliu Zazhi. 2023;45:212–220. doi: 10.3760/cma.j.cn112152-20220922-00647. [DOI] [PubMed] [Google Scholar]
  • 2.Ostrom Q.T., Cioffi G., Waite K., Kruchko C., Barnholtz-Sloan J.S. CBTRUS Statistical Report: Primary Brain and Other Central Nervous System Tumors Diagnosed in the United States in 2014-2018. Neuro Oncol. 2021;23:iii1–iii105. doi: 10.1093/neuonc/noab200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sung H., Ferlay J., Siegel R.L., Laversanne M., Soerjomataram I., Jemal A., Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA. Cancer J. Clin. 2021;71:209–249. doi: 10.3322/caac.21660. [DOI] [PubMed] [Google Scholar]
  • 4.Louis D.N., Perry A., Wesseling P., Brat D.J., Cree I.A., Figarella-Branger D., Hawkins C., Ng H.K., Pfister S.M., Reifenberger G., et al. The 2021 WHO Classification of Tumors of the Central Nervous System: a summary. Neuro Oncol. 2021;23:1231–1251. doi: 10.1093/neuonc/noab106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yu K.H., Beam A.L., Kohane I.S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018;2:719–731. doi: 10.1038/s41551-018-0305-z. [DOI] [PubMed] [Google Scholar]
  • 6.Ramesh A.N., Kambhampati C., Monson J.R.T., Drew P.J. Artificial intelligence in medicine. Ann. R. Coll. Surg. Engl. 2004;86:334–338. doi: 10.1308/147870804290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Niazi M.K.K., Parwani A.V., Gurcan M.N. Digital pathology and artificial intelligence. Lancet Oncol. 2019;20:e253–e261. doi: 10.1016/S1470-2045(19)30154-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Coudray N., Ocampo P.S., Sakellaropoulos T., Narula N., Snuderl M., Fenyö D., Moreira A.L., Razavian N., Tsirigos A. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 2018;24:1559–1567. doi: 10.1038/s41591-018-0177-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lu M.Y., Williamson D.F.K., Chen T.Y., Chen R.J., Barbieri M., Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 2021;5:555–570. doi: 10.1038/s41551-020-00682-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Saillard C., Delecourt F., Schmauch B., Moindrot O., Svrcek M., Bardier-Dupas A., Emile J.F., Ayadi M., Rebours V., de Mestier L., et al. Pacpaint: a histology-based deep learning model uncovers the extensive intratumor molecular heterogeneity of pancreatic adenocarcinoma. Nat. Commun. 2023;14:3459. doi: 10.1038/s41467-023-39026-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yonekura A., Kawanaka H., Prasath V.B.S., Aronow B.J., Takase H. Automatic disease stage classification of glioblastoma multiforme histopathological images using deep convolutional neural network. Biomed. Eng. Lett. 2018;8:321–327. doi: 10.1007/s13534-018-0077-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Truong A.H., Sharmanska V., Limbӓck-Stanic C., Grech-Sollars M. Optimization of deep learning methods for visualization of tumor heterogeneity and brain tumor grading through digital pathology. Neurooncol. Adv. 2020;2:vdaa110. doi: 10.1093/noajnl/vdaa110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Jin L., Shi F., Chun Q., Chen H., Ma Y., Wu S., Hameed N.U.F., Mei C., Lu J., Zhang J., et al. Artificial intelligence neuropathologist for glioma classification using deep learning on hematoxylin and eosin stained slide images and molecular markers. Neuro Oncol. 2021;23:44–52. doi: 10.1093/neuonc/noaa163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ma Y., Shi F., Sun T., Chen H., Cheng H., Liu X., Wu S., Lu J., Zou Y., Zhang J., et al. Histopathological auxiliary system for brain tumour (HAS-Bt) based on weakly supervised learning using a WHO CNS5-style pipeline. J. Neuro Oncol. 2023;163:71–82. doi: 10.1007/s11060-023-04306-6. [DOI] [PubMed] [Google Scholar]
  • 15.Sun L., Liu W., Li C., Zhang Y., Shi Y. Construction and internal validation of a predictive model for risk of gastrointestinal bleeding in children with abdominal Henoch-Schonlein purpura: A single-center retrospective case-control study. Front. Immunol. 2022;13:1025335. doi: 10.3389/fimmu.2022.1025335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chen G., Yifang B., Jiajun Z., Dongdong W., Zhiyong Z., Ruoyu D., Bin D., Sirong P., Daoying G., Meng C., et al. Automated unruptured cerebral aneurysms detection in TOF MR angiography images using dual-channel SE-3D UNet: a multi-center research. Eur. Radiol. 2023;33:3532–3543. doi: 10.1007/s00330-022-09385-z. [DOI] [PubMed] [Google Scholar]
  • 17.Bokhorst J.M., Nagtegaal I.D., Fraggetta F., Vatrano S., Mesker W., Vieth M., van der Laak J., Ciompi F. Deep learning for multi-class semantic segmentation enables colorectal cancer detection and classification in digital pathology images. Sci. Rep. 2023;13:8398. doi: 10.1038/s41598-023-35491-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yang H., Chen L., Cheng Z., Yang M., Wang J., Lin C., Wang Y., Huang L., Chen Y., Peng S., et al. Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study. BMC Med. 2021;19:80. doi: 10.1186/s12916-021-01953-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yan R., Shen Y., Zhang X., Xu P., Wang J., Li J., Ren F., Ye D., Zhou S.K. Histopathological bladder cancer gene mutation prediction with hierarchical deep multiple-instance learning. Med. Image Anal. 2023;87:102824. doi: 10.1016/j.media.2023.102824. [DOI] [PubMed] [Google Scholar]
  • 20.Li B., Keikhosravi A., Loeffler A.G., Eliceiri K.W. Single image super-resolution for whole slide image using convolutional neural networks and self-supervised color normalization. Med. Image Anal. 2021;68:101938. doi: 10.1016/j.media.2020.101938. [DOI] [PubMed] [Google Scholar]
  • 21.Yang M., Xie Z., Wang Z., Yuan Y., Zhang J. Su-MICL: Severity-Guided Multiple Instance Curriculum Learning for Histopathology Image Interpretable Classification. IEEE Trans. Med. Imag. 2022;41:3533–3543. doi: 10.1109/TMI.2022.3188326. [DOI] [PubMed] [Google Scholar]
  • 22.Xu Y., Li Y., Shen Z., Wu Z., Gao T., Fan Y., Lai M., Chang E.I.C. Parallel multiple instance learning for extremely large histopathology image analysis. BMC Bioinf. 2017;18:360. doi: 10.1186/s12859-017-1768-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Moshkov N., Mathe B., Kertesz-Farkas A., Hollandi R., Horvath P. Test-time augmentation for deep learning-based cell segmentation on microscopy images. Sci. Rep. 2020;10:5068. doi: 10.1038/s41598-020-61808-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Figarella-Branger D., Mokhtari K., Dehais C., Jouvet A., Uro-Coste E., Colin C., Carpentier C., Forest F., Maurage C.A., Vignaud J.M., et al. Mitotic index, microvascular proliferation, and necrosis define 3 groups of 1p/19q codeleted anaplastic oligodendrogliomas associated with different genomic alterations. Neuro Oncol. 2014;16:1244–1254. doi: 10.1093/neuonc/nou047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Aldape K., Zadeh G., Mansouri S., Reifenberger G., von Deimling A. Glioblastoma: pathology, molecular mechanisms and markers. Acta Neuropathol. 2015;129:829–848. doi: 10.1007/s00401-015-1432-1. [DOI] [PubMed] [Google Scholar]
  • 26.Cao L., Wang J., Zhang Y., Rong Z., Wang M., Wang L., Ji J., Qian Y., Zhang L., Wu H., et al. E2EFP-MIL: End-to-end and high-generalizability weakly supervised deep convolutional network for lung cancer classification from whole slide image. Med. Image Anal. 2023;88:102837. doi: 10.1016/j.media.2023.102837. [DOI] [PubMed] [Google Scholar]
  • 27.Gao R., Zhao S., Aishanjiang K., Cai H., Wei T., Zhang Y., Liu Z., Zhou J., Han B., Wang J., et al. Deep learning for differential diagnosis of malignant hepatic tumors based on multi-phase contrast-enhanced CT and clinical data. J. Hematol. Oncol. 2021;14:154. doi: 10.1186/s13045-021-01167-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Noguchi S., Nishio M., Sakamoto R., Yakami M., Fujimoto K., Emoto Y., Kubo T., Iizuka Y., Nakagomi K., Miyasa K., et al. Deep learning-based algorithm improved radiologists' performance in bone metastases detection on CT. Eur. Radiol. 2022;32:7976–7987. doi: 10.1007/s00330-022-08741-3. [DOI] [PubMed] [Google Scholar]
  • 29.Paulus W. GFAP, Ki67 and IDH1: perhaps the golden triad of glioma immunohistochemistry. Acta Neuropathol. 2009;118:603–604. doi: 10.1007/s00401-009-0600-6. [DOI] [PubMed] [Google Scholar]
  • 30.Deacu M., Docu Axelerad A., Popescu S., Topliceanu T.S., Aschie M., Bosoteanu M., Cozaru G.C., Cretu A.M., Voda R.I., Orasanu C.I. Aggressiveness of Grade 4 Gliomas of Adults. Clin. Pract. 2022;12:701–713. doi: 10.3390/clinpract12050073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Goode A., Gilbert B., Harkes J., Jukic D., Satyanarayanan M. OpenSlide: A vendor-neutral software foundation for digital pathology. J. Pathol. Inf. 2013;4:27. doi: 10.4103/2153-3539.119005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.van den Bent M.J. Interobserver variation of the histopathological diagnosis in clinical trials on glioma: a clinician's perspective. Acta Neuropathol. 2010;120:297–304. doi: 10.1007/s00401-010-0725-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Video S1. System demonstration of HAS-Bt
Download video file (32.1MB, mp4)
Document S1. Figures S1 and S2
mmc1.pdf (642.8KB, pdf)

Data Availability Statement

  • All data reported in this study including WSIs, patch images are available from the lead contact upon reasonable request.

  • The custom code in this study for training deep learning and iterative reconstruction models were written in Python with PyTorch. The code is available publicly on github: https://github.com/Tianyangg/HAS-Bt.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.


Articles from iScience are provided here courtesy of Elsevier

RESOURCES