Abstract
The recent advancements in computational pathology focus on extracting valuable prognostic insights from whole-slide images (WSIs). These methods primarily involve deep learning-based or handcrafted feature representations of the disease’s morphologic patterns associated with outcomes. However, determining the most prognostic regions within tumors remains challenging due to significant morphologic heterogeneity even within manually annotated tumor areas. In other words, the question is not simply what type of representation is appropriate to predict cancer outcomes, but specifically where to mine those representations. To address this issue, a deep learning framework to identify prognostically relevant (PR) cancer regions (DOVER) within WSIs is presented. DOVER leverages patterns mined from the tissue microarray (TMA) spots with the associated long-term clinical outcomes. The prognostic patterns learned from the individual spots of the TMA (morphologically consistent) are then mapped into larger WSIs to locate PR regions for subsequent feature representation and patient outcome prediction. DOVER improves prognostic prediction in terms of c-index over 20% (p<0.05) across 2,041 patients (NSCLC: n=1,141; OPSCC: n=900). Moreover, correlations with quantitative immunofluorescent (QIF) images reveal a diverse CD8+, CD20+, CD4+, and tumor cell distribution in DOVER-selected regions, reflecting a complex interplay between tumor and immune cells. DOVER identifies statistically significant differences between PR regions, both at the molecular and morphological levels. DOVER could help identify specific spatial locations on WSIs that could be used to mine prognostic feature representation for subsequent predictions of clinical outcomes. With additional validation, DOVER could also potentially help to guide AI-informed molecular profiling of tumors.
Keywords: Computational Pathology, Prognostic Prediction, Prognostically Relevance
1. Introduction
Computational pathology approaches have increasingly evolved from the detection of biological primitives or regions of interest (ROI) [1], [2] from whole-slide images (WSIs) of H&E-stained tissue specimens towards higher-level analysis to prognostication of clinically relevant outcomes across different cancer subtypes [3], [4], [5]. A substantial number of studies have focused on what to learn: the feature representation of images and biomarkers to associate with patient outcomes. The two broadly divergent feature representation methodologies utilized to accomplish this are deep learning-based (DL) [4] and domain-inspired handcrafted feature (HF)-based approaches [5], [6]. Handcrafting features are representations associated with specific primitives or morphologic structures on histopathology, especially those associated with cell nuclei and lymphocytes. Several studies have shown that these handcrafted features can be prognostic of clinically relevant outcomes in multiple cancer types [3], [6]. On the other hand, deep learning approaches rely on automatically learned feature representation from entire slides without explicit feature engineering [8], [9], [10]. Recently, numerous pretrained vision foundation models have been developed in computational pathology [7], [11], [12], [13] for diagnosis and prognosis tasks with promising results. These approaches have leveraged large-scale self-supervised learning to learn generic visual feature representations at either the patch or slide level [7], [14] for disease prognostication. While the type of feature representation to prognosticate outcome is important and has been widely studied [3], [6], [8], [15], what has been less studied is where in the gigapixel WSIs are the signals that contribute to outcome prediction [16]. Identifying regions for feature representation within tumors could refine biomarker discovery, as demonstrated in gastric cancer studies where multi-modal data improved the prediction of response to anti-HER2 therapy [5]. This clinical urgency arises because spatial heterogeneity within tumors impacts treatment efficacy by influencing biomarker expression and immune cell distributions, making precise localization of PR regions essential for advancing personalized therapy and improving patient outcomes. Several foundation models also utilize attention mechanisms to aggregate patch-level feature representations into slide-level representations. However, it is potentially suboptimal to utilize such attention-based slide-level interpretation directly for survival prediction, on account of (1) multiplicative attention-based methods (e.g., dilated attention in Prov-Gigapath [17]) learn the hidden pairwise relationship rather than individual importance of patches to the survival outcome, and (2) additive attention-based methods (e.g., CHIEF variants with non-gated attention [18]) assign scores to individual patches, yet there is no explicit interpretation of whether patches with higher scores contain actual prognostic information that aligns with patient outcomes. Therefore, closing this gap between model prediction and clinical relevance remains a key challenge.
In this context, we distinguish the concept of a variable being prognostic from a region being prognostically relevant (PR). A feature descriptor is prognostic if it is statistically associated with patient outcomes. For instance, the density of tumor-infiltrating lymphocytes could be prognostic in non-small cell lung cancer (NSCLC), wherein the higher density of TIL is often associated with better overall patient survivability [3], [19]. In contrast, a tissue region is prognostically relevant (PR) if it contains information that directly contributes to a model’s correct prediction of patient outcome. That is, not all regions containing prognostic features (e.g., high or low TIL density, etc.) necessarily aid in accurate prediction. For instance, consider a WSI corresponding to a specimen of prostate cancer with a large region corresponding to Gleason pattern 1 (3+3) versus a small focus of tumor that corresponds to Gleason pattern 5 (>4+4). Even though Gleason pattern 1 dominates the real estate of the tumor, the outcome of the patient is driven by the small focus of pattern 5. In this example, pattern 1 would correspond to a less prognostically relevant region, whereas pattern 5 corresponds to a highly prognostically relevant region. The combination of PR high and PR low regions will likely result in a dilution of prognostic signals for downstream outcome survival prediction. Moreover, in the case of WSIs data, a tumor resides in a complex environment with different components, such as immune cells and extracellular matrix. The interactions between these components and tumors are crucial for tumor development and progression [20]. This complex environment also implies significantly more morphological heterogeneity. Due to such substantial intra-tumoral (and potentially molecular) heterogeneity [21] [22], tiles from different spatial locations, even within the tumor ROI in a WSI, might harbor information corresponding to varying degrees of PR. However, it is unclear where to sample the most PR regions across the slides to train a downstream survival predictor from a prognostic prediction perspective, even if tumor annotations are provided.
We elaborate on the importance of PR in Figure 1, wherein downstream survival models could benefit from tiles with a high PR score compared to those tiles with a low PR score. As shown in Figure 1, tiles with low PR could represent parts of the tumors that are prognostically irrelevant or less relevant (e.g., the pattern of Gleason grade group 3 within the local region). From high-PR regions (e.g., patterns of the Gleason grade group 5), the extracted feature representations could be more precisely correlated with the end outcome, regardless of whether the outcome is favorable or not. Hence, it is crucial to identify where within the vast area of tissue in WSIs the most PR regions lie, to generate the most accurate prognostic predictions.
Figure 1.

The performance of the tile-based survival analysis model depends on whether the samples have prognostic relevance (PR), as such regions contain the relevant survival information and patterns correlated to the patient outcome. Conversely, tiles without PR mostly contribute to noise in training data, as such regions do not contain relevant survival information of patients. In reality, most random tile selection-based approaches may potentially end up with too much noise in training data and therefore fail to learn the survival model.
To mitigate the challenges above, in this work, we introduce a deep learning framework to identify prognostically relevant cancer regions (DOVER) from WSIs, which regardless of the type of feature representation, allows for assigning a PR score to each tile and allows for identification of the highest PR region within the tumor for downstream prognosis prediction (Figure 2). DOVER utilizes patterns extracted from tissue microarray TMA spots linked to long-term clinical outcomes, which are typically homogeneous in tissue morphology. These prognostic patterns, learned from the relatively homogeneous morphology of individual TMA spots, are then mapped onto larger WSIs via adversarial training, in order to improve outcome prediction using the same feature representation. We hypothesize that tiles with PR can be approximated by a pretrained TMA-based survival classifier, which is posed as a weak labeler. We validate the performance of DOVER in predicting the risk of overall survival (OS) for over 2000 non-small cell lung cancer (NSCLC) and p16+ oropharyngeal squamous cell carcinoma (OPSCC) patients. Particularly in the scope of this study, where to generate feature representations for prognostic models is guided by the PR pattern based on the prediction results of the referenced TMA-based survival classifier [6], [23]. To illustrate the importance of where to extract features over what the feature representation is, we couple DOVER with different downstream survival models: an HF-based and a foundation model-based, respectively. We demonstrate that regardless of the choice of feature representations (HF or foundation model-based), training the downstream survival classifier on the high-PR regions selected by DOVER consistently yields superior performance. Additionally, we compare DOVER with a range of tile selection and aggregation approaches, encompassing pathological attention-based multiple instance learning (ABMIL) methods, including CLAM, TransMIL, and DSMIL [24], [25], [26]. We further investigated the biological underpinnings of the DOVER-predicted region by leveraging quantitative immunofluorescent (QIF) images stained with immune and tumor markers. Cell-level composition from the tumor microenvironment based on QIF was analyzed to disentangle the complicated tumor-immune cell interactions captured by the DOVER approach.
Figure 2.

Comprehensive workflow of developing and validating DOVER: (a) preparing the training data based on the prediction correctness of referenced TMA-based survival classifier models: PR labels are assigned on account of whether or not a tile contributes to correct predictions of the TMA-based survival classifier. (b) predicting whole-slide heatmap of prognostic relevance (PR) by mapping the homogeneous pattern from TMA onto larger WSIs via adversarial training; (c) downstream prognostic validation based on morphological features extracted within DOVER high-PR regions; (d) molecular validation of cell types within DOVER high-PR regions.
2. Methods and Materials
Dataset Description
Formalin-fixed paraffin-embedded H&E-stained WSIs and tissue microarray (TMA) were collected from five independent cohorts, including both NSCLC and OPSCC, resulting in a total of 2,041 patients. A detailed consort diagram is illustrated in Figure 3. The NSCLC cohort collected from Yale University in the form of TMA was employed as a training and internal validation set (D1, N = 100). In addition to H&E slides, QIF with CD8, CD20, CD4, and PanCK markers was also available for adjacent tissue cuts on cohort D1. Another two NSCLC cohorts, comprised of publicly available diagnostic WSIs from The Cancer Genome Atlas (TCGA), including D2, which specifically contained lung adenocarcinoma (LUAD, N = 530), and D3, which specifically contained lung squamous cell carcinoma (LUSC, N = 511), served as the independent test set for NSCLC. Three cohorts of oropharyngeal squamous cell carcinoma (OPSCC) were retrospectively collected from Vanderbilt University (D4, N = 405), the Cleveland Clinic (CCF) (D5, N = 313), and Kaiser Clinical (D6, N = 182). For OPSCC, D4 was employed as a training (N=284) and internal validation (N=121) set, while D5 and D6 were used for independent testing. Likewise, tumor ROIs in the WSIs of D4 were pre-selected by pathologists to simulate the TMA data presentation in D1. These pre-selected regions with a relatively homogeneous tumor morphology presentation from D4 were employed to build the DOVER and associated downstream survival prediction models for OPSCC. The D5 and D6 in the format of WSIs were used to validate the OPSCC prognostic model. The WSIs and TMAs were digitally scanned at 20x with a micron-per-pixel of approximately 0.25. The quality control tool, HistoQC [27], was employed before the image processing steps to exclude those slides with major tissue artifacts (e.g., tissue folding and blurring larger than 50% pixel areas). Given the retrospective nature of the study, informed patient consent was waived, and approval for the study was obtained through the institutional review board.
Figure 3.

Consort diagram of how datasets were curated and assessed in this study with a cohort that comprises 2,041 patients in total, consisting of an NSCLC cohort (N=1141) and an OPSCC cohort (N=900). For the study of NSCLC, D1 was applied for training (N=70) and validation (N=30) for the DOVER, while D2 (N=529) and D3 (N=511), after tissue quality control were used as independent test sets. For the study of OPSCC, D4 was used for training (N=284) and validation (N=121) of DOVER, while D5 (N=307) and D6 (N=173) were applied as independent test sets.
PR Label Assignment for DOVER
For survival tasks, there are rarely accurate ground-truth labels and associated annotations to train a strongly supervised model to identify PR at the pixel level. Therefore, as an approximation, we define a tile as PR if it contributes to the correct prediction of a survival classifier. As described in supplemental material S1, we set up the pretrained TMA-based survival classifier as a weak labeler for generating low- and high-PR labels for the training data. This classifier is trained using handcrafted nuclei features extracted from TMA spots. Image tiles that lead to “correct prediction” by this classifier—meaning predictions aligned with the actual patient survival outcome or risk stratification—are labeled as high PR, while misclassified tiles were labeled low PR.
Patient’s actual survival was obtained from electronic health records during dataset curation. Due to the small and inherently consistent nature of individual TMA spots in terms of morphology, we attempt to establish correlations between morphological patterns within the TMA spots and the presence of PR regions for accurate prognostic prediction. Consequently, each spot can be viewed as an individual, distinctive, and consistent pattern that can be seamlessly mapped onto the WSIs. This approach enables us to pinpoint corresponding patterns on the WSIs at a spot-by-spot level. Each spot on the TMA is allocated a PR-level label (whether the TMA-based survival classifier correctly predicts survival) during the training phase. During the evaluation of WSIs, we discern spots with varying PR scores, distinguishing between those with low and high PR. Details of training data generation have been illustrated in supplemental material S1.
Identification of Prognostically Relevant Regions on WSIs
A fully convolutional network was trained with the given TMA tiles and PR labels. It utilized tissue tiles as input to generate the corresponding heat map of PR as the output. The training process was set as a weakly-supervised learning procedure for two reasons. First, the PR labels obtained by the pretrained TMA-based survival classifier (supplemental material S1) are merely an approximation. For instance, an input tile that leads to a TMA-based model prediction unaligned with the patient outcome does not necessarily indicate that the tile has no prognostic information relevant to the patient outcome. The PR information may require different feature representations for generating the correct prognosis. Second, the TMA-based survival classifier assigns a binary label to each tile, while the training is done at the pixel level. Similar to Xu et al.’s work [28] for weakly-supervised training, labels of all pixels were assigned as 1 for positive samples and 0 for negative samples. Both positive and negative samples may contain label noise. Naturally, not all pixels in positive sample tiles were PR. Subsequently, it led to false positives, which means the output heat map of PR might also include irrelevant tissue regions as PR. However, more important sources of label noise are PR regions missed by the model due to its limited performance, i.e., false negatives. False negatives for slide or tile labels would ultimately force the model to deemphasize prognostically relevant regions. To mitigate the impact of false positive and false negative labels, our training process incorporates an adversarial attack as a regularization technique by applying adversarial noise to input samples, which helps to push the model’s decision boundary and correct for potential noise with regard to the labels. Details of the training PR identification model have been illustrated in supplemental material S2.
Feature Representation from PR Regions for Prognostic Prediction
The general pipeline of DOVER is shown in Figure 2. The trained PR identification model subsequently scanned through each tile in the tissue region and output a pixel-level softmax score within the range [0, 1] to estimate the PR level of each pixel. For each tile, the average score of all pixels was calculated to represent the tile-level PR. For convenience, we use 0.5 as the threshold of high/low PR. Tiles with average scores larger than 0.5 were deemed as high PR, while those below the threshold were considered as low PR. A tile-level score could be determined by averaging the pixel-level PR scores. This was because the PR identification model was trained by weak pixel-level labels derived from the tile-level label, where label noise might be introduced, and averaging pixel-level scores could mitigate the impact of pixel-level misclassification due to label noise. Only high PR regions are used to generate feature representations for downstream empirical prognosis analysis. The downstream prognostic models were then trained and tested only in the DOVER-detected tissue region. The implementation detail of the training, downstream survival model selection, and deployment of DOVER is elaborated in supplemental material S3.
Statistical Analysis
We empirically evaluate whether DOVER-selected regions are PR by assessing the downstream survival prediction performance trained within DOVER-selected regions. Cox Proportional Hazards Models (CPHM) were trained and evaluated by DOVER-selected areas with high- and low-PR levels and each of the comparative tile selection strategies, as described in supplemental material Table S1. Feature representation within PR regions included two different categories, i.e., domain-inspired hand-crafted features and foundation model-derived feature embeddings. Referenced nuclei-based HFs [6], [23] previously used for prognostic analysis of NSCLC and OPSCC, were employed to build the HF-based CPHM. The HF-based CPHMs were utilized to predict the OS outcome of NSCLC and OPSCC patients from independent test sets, as shown in Figure 3.
3. Results
Prognostication of Survival in NSCLC
HF-based (e.g., cell, graph, nuclear shape, texture) feature representations were generated from DOVER high-PR regions to build a prognostic classifier for NSCLC from D1 [6]. The independent tests D2 and D3 were employed to evaluate 5-year OS outcome prediction among NSCLC patients of early-stage (ES) only (I and II) and all stages. Figure 4 illustrates the results of the log-rank test and the Kaplan-Meier (KM) survival analysis to stratify NSCLC patients into high- and low-risk groups, along with the hazard ratio (HR), 95% confidence interval (CI), and the corresponding log-rank p-values. The HF classifier based on DOVER selected tiles could stratify patients into high- and low-risk groups: all LUAD patients (Figure 4a) with HR (95% CI) = 1.62 (1.22 – 2.16), p<0.001; ES-LUAD patients (Figure S1a) with HR (95% CI) = 1.54 (1.08 – 2.18), p=0.013; and all LUSC patients (Figure 4b) with HR (95% CI) = 1.63 (1.18 – 2.24), p=0.001; ES-LUSC patients (Figure S1b) with HR (95% CI) = 1.72 (1.19 – 2.51), p=0.002. All of the predictions are statistically significant with p-values < 0.05 as the threshold. As a comparison, the HF-based classifier failed to obtain a statistically significant prognostic prediction on D2 and D3 (Figure S1c–d) when employing feature representations outside the DOVER high-PR regions.
Figure 4.

Comparison of prognostic prediction when employing DOVER high-PR (a-d) and low-PR regions (e-h): quantitative histomorphometric handcrafted features (a, b, e, and f) and features of the foundation model (c, d, g, and h) extracted from high-PR regions are able to significantly differentiate patients into different survival groups, not so when the features were extracted from low-PR tiles.
We further investigated the performance of the UNI feature-based NSCLC OS prediction model using DOVER-detected high-PR score tiles (Figure 4; c and d) and low-PR score tiles (Figure 4; g and h) respectively: all LUAD patients of high PR (Figure 4c) with HR (95% CI) = 1.40 (1.06 – 1.85), p=0.015; all LUSC patients of high PR (Figure 4d) with HR (95% CI) = 1.53 (1.03 – 2.09), p=0.016. On the other hand, UNI features failed to deliver prognostic prediction when the feature representations were generated from outside high-PR regions (HR (95% CI) = 1.04 (0.78 – 1.38), p=0.789, and HR (95% CI) = 1.20 (0.74 – 1.96), p=0.424, Figure 4g and h.)
Prognostication of Survival in OPSCC
We also evaluated DOVER’s performance in identifying regions with high PR in an OPSCC cohort. DOVER identified the top tiles with the high PR from D4 to train the HF classifier. Independent tests D5 and D6 are employed to evaluate the 5-year OS outcome prediction performance among all OPSCC cases. The corresponding KM curves and log-rank test p-values, along with the HR and 95% CI were reported in Figure 5, D5 with HR (95% CI) = 2.3 (1.5 – 3.53), p<0.001 (Figure 5a) and D6 with HR (95% CI) = 2.23 (1.07 – 4.65), p=0.009 (Figure 5b). On the other hand, the corresponding classifier based on low-PR regions failed to achieve a statistically significant prognosis prediction, as shown in the supplemental material S4, Figure S1 e–f. Hence, the results suggest that DOVER is able to identify tiles with high PR from OPSCC WSIs.
Figure 5.

The prediction of OS outcome of OPSCC in D5 (left) and D6 (right) by the DOVER approach and survival prediction by handcrafted features (a and b) and UNI features (c and d).
Evaluating Molecular Underpinnings of High- and Low-PR Tiles via Quantitative Immunofluorescence
To validate the biological underpinning of high- and low-PR tiles identified by DOVER, we employed QIF (CD8, CD20, CD4, and PanCK) stained images of continuous tissue cut on the same TMA cohorts D1. QIF helped evaluate cellular-level protein expression by measuring the fluorescence intensity on each channel. The QIF and H&E images were first aligned and registered by the SURF feature-matching algorithm to compensate for the tissue deformation and displacement between adjacent tissue cuts [29]. As illustrated in Figure 6, DOVER-selected high-PR regions were enriched with a more balanced distribution of immune cells, CD8 (16%), CD20 (25%), CD4 (19%) cells, and tumor cells (40%). On the other hand, Low-PR regions determined by DOVER were dominated by tumor cells (48%), and immune cells were scattered with an imbalanced composition (CD8, CD20, and CD4 are 10%, 22%, and 20%, respectively). When comparing the cell counts between DOVER low- and high-PR regions, both the median number and interquartile range were larger in DOVER high-PR regions. In further subtype analysis, lung squamous cell carcinoma was enriched with tumor epithelial cells (42 %) and CD8 (22%). On the other hand, immune cells CD20 (16%) and CD4 (20%) were obviously more abundant in DOVER high-PR regions. In contrast, lung adenocarcinoma, tumor epithelial cells (40%), the CD20 (26%), and CD4 (19%) were significantly higher compared with CD8 (15%) immune cells.
Figure 6.

Analysis of cell compositions by leveraging QIF images, between low-PR (coldspot) and high-PR (hotspot) regions: (a) cell composition from DOVER high-PR and low-PR regions; (b) comparison of cell subtype counting between high-PR and low-PR based on QIF.
Morphologic Interpretation of DOVER high-PR and low-PR regions
Quantitative differences of HFs between good and poor survival outcomes were tested (Wilcoxon Rank-Sum) for three different regions suggested by DOVER. The corresponding distributions of the selected features are illustrated in the violin plots in Figure 7 (a–i). Significant differences in cell morphology, nuclear staining patterns, and local cellular neighborhood features were observed between good and poor outcome groups. In contrast, these same features failed to demonstrate statistically significant discriminative power when evaluated in either low-PR or randomly sampled regions. This differential performance suggests that high-PR regions may harbor biologically and prognostically relevant cellular phenotypes that are attenuated in low-PR regions. To further assess the DOVER approach, we employed UMAP to project the high-dimensional nuclei-based HF to a 2D plane to visualize the separability and difference of low- and high-PR regions in terms of nuclear morphology. As shown in Figure 7 (j and k), the distribution of high-PR nuclei morphologic features (red) is separable from the low-PR features (green). This, in turn, reflects the ability of DOVER in identifying high-PR versus low-PR regions. Moreover, this illustration also reveals the distinctive morphological attributes between high- and low-PR regions. Ripley’s L-function was computed for immune cells (CD4, CD20, CD8) and tumor cells (PanCK) within high-PR regions (Figure S3), revealing significant clustering of all immune subtypes at short-to-moderate distances (12.5 to 50 μm), with CD4 showing the strongest clustering followed by CD20 and CD8. However, tumor cells showed clustering to a lesser degree compared with immune cells within high-PR. Clustering diminished toward randomness at larger scales (about 200 μm) among all cell types. These findings demonstrate nonrandom spatial organization and suggest proximity between immune and tumor cells in high-PR regions.
Figure 7.

Comparison between the dominant nuclei features between low-risk and high-risk patients from DOVER high-PR (a-c, hotspots), low-PR (d-f, coldspots), and entire tumor regions (g-i). UMAP projection of the nuclei-based HF from high-PR (hotspots) and low-PR (coldspots) to the 2D-plane for LUAD (j) and LUSC (k) patients. Embedding of red and green dots reflects the separability of high-PR (hotspots) and low-PR (coldspots) regions.
Comparison with Different Tile Sampling and Feature Representation Strategies
To elaborate that DOVER-selected high-PR regions are more relevant to patient survival outcomes than low-PR regions, we interrogated the downstream survival prediction performance between models trained respectively by hotspots (top-10 tiles with the highest PR scores rated by DOVER) and coldspots (bottom-10 tiles with the lowest PR scores). Details are included in supplemental material S4. Table 1 gives a comprehensive list of ablation studies to justify the added value of each component of DOVER. We verified whether adversarial attacks improve the robustness of DOVER’s training. Moreover, we assessed whether tile selection of DOVER was at all necessary by comparing it with random tile sampling and using all tiles for downstream survival classifier training. To further evaluate how DOVER-selected regions could improve the performance of the downstream survival model, regardless of the choice of feature representations (i.e., hand-crafted or deep learning), we also included a recently developed foundation model (UNI) as an alternative tile feature extractor [11]. 1024-dimensional feature vectors were extracted from the last layer of the UNI encoder for prognosis prediction across all validation sets. Moreover, we also included the ABMIL approach to automatically identify sub-regions for prognostic prediction. Specifically, we employed the CLAM, TransMIL, and DSMIL [24], [25], [26] approach with UNI-derived [11] features and replaced the cross entropy loss in the classifier head with a referenced negative log-likelihood survival loss to perform end-to-end survival prediction [21][28], tissue regions were assigned different weights based on attention scores before aggregating for slide level prediction. To further illustrate the effect size of survival prediction improvement using DOVER-selected regions, pairwise differences were assessed with 10,000-round paired permutation tests, and uncertainty was estimated via bias-corrected and accelerated (BCa) bootstrap (10000-round resampling) confidence intervals for the c-index differences.
Table 1.
Description of different comparison strategies for the ablation study.
| Name | Description |
|---|---|
| DOVER (HF) | Empirical evaluation of DOVER using HF as feature representation in downstream survival tasks. |
| DOVER (UNI) | Empirical evaluation of DOVER using UNI-encoder derived features in downstream survival tasks. Combining with DOVER (HF), it elaborates that the success of DOVER is not tied to a particular set of feature representations. |
| DOVER (No AD) | Train DOVER without adversarial noise. Use HF in downstream survival tasks. |
| ABMIL | Use ABMIL (CLAM, TransMIL, and DSMIL) to identify the PR regions and train an end-to-end survival classifier by replacing the cross-entropy loss with a referenced negative log likelihood loss. Showcasing the difference between DOVER and ABMIL methods. |
| Random (HF) | Train the downstream survival model using randomly sampled tiles from the WSIs, and elaborate on DOVER’s selection strategy is better than random guessing. |
| Entire Tumor (HF) | Train the downstream survival model using all tiles from the WSs, and illustrate why tile selection is necessary for compelling survival prediction performance. |
Table 2 shows the HR and P-value obtained from the different approaches to predict the OS outcome of NSCLC and OPSCC. Our presented DOVER achieved statistically significant results from WSIs among all validation cohorts. Compared with DOVER (No AD), DOVER with adversarial training outperformed in obtaining higher hazard ratios that were all statistically tested. Interestingly, neither random tile sampling nor exhaustively computing all the WSI regions could generate statistically significant separation in patients’ survival. Figure 7 further indicated the significant differences in cell morphology, staining, and local neighbors between good and poor outcome cases of NSCLC patients in DOVER-identified regions with high-PR levels. However, such differences are not observed in cold spots (low PR) and randomly selected regions. Features extracted from the foundation model also presented prognostic value in both lung and oropharyngeal cancer. DOVER could successfully differentiate the high-risk from low-risk patients (HR=1.4, 95%CI: 1.06–1.85, p=0.015 for LUAD; HR=1.53, 95%CI: 1.03–2.29, p=0.016 for LUSC, Figure 4 c–d) when the UNI features were extracted from DOVER-detected areas. However, it was hard to generate statistical separation in terms of survival based on the UNI model when the same type of features were from outside of DOVER-selected regions (HR=1.04, 95%CI: 0.78–1.38, p=0.789 for LUAD; HR=1.2, 95%CI: 0.74–1.96, p=0.424 for LUSC, Figure 4 g–h). Similar results were also observed within OPSCC datasets; the UNI model could accurately predict the prognosis when the underlying features were extracted from DOVER-selected regions (Figure S1). Therefore, it is safe to assume that DOVER delivers consistent performance for both the HF and foundation model features chosen in downstream tasks (Table 2). DOVER also outperformed ABMIL approaches, achieving 5.8% - 27.4% better survival prediction performance than compared methods in terms of c-index (Table S3). Specifically, c-indices improved by over 20% across all validation datasets (p < 0.05) when the prognostic models were trained on the hotspots versus the coldspots of DOVER (Table S4). The improvements over different ABMIL approaches are also statistically significant, except for the CLAM in D4 (p = 0.0918). Moreover, the forest plots (Figure S2) of BCa bootstrap 95% CI of the c-index difference (Δc-index = DOVERHot-baseline) were mostly positive (intervals do not cross 0), indicating a consistent positive gain of c-index in DOVERHot compared to baseline methods.
Table 2.
Comparison of the DOVER with Different Approaches. Both DOVER (HF) and DOVER (UNI) outperform other comparison strategies by delivering statistically significant and prognostic predictions.
| Tumor Type | Dataset | DOVER (HF) | Dover (UNI) | Dover (No AD) | CLAM | TransMIL | DSMIL | Random Sampling (HF) | Entire Tumor (HF) |
|---|---|---|---|---|---|---|---|---|---|
| NSCLC | D2 | HR=1.62, p<0.001 | HR=1.40, p=0.015 | HR=1.49, p=0.152 | HR=1.41, p=0.071 | HR=1.38, p=0.063 | HR=1.45, p=0.078 | HR=1.99, p=0.230 | HR=2.32, p=0.091 |
| D3 | HR=1.63, p=0.001 | HR=1.53, p=0.016 | HR=2.63, p=0.461 | HR=1.23, p=0.143 | HR=1.19, p=0.156 | HR=1.27, p=0.131 | HR=2.56, p=0.451 | HR=1.51, p=0.162 | |
| OPSCC | D5 | HR=2.3, p<0.001 | HR=1.99, p=0.010 | HR=1.51, p=0.091 | HR=2.17, p=0.312 | HR=2.05, p=0.334 | HR=2.29, p=0.287 | HR=3.08, p=0.721 | HR=2.61, p=0.521 |
| D6 | HR=2.23, p=0.009 | HR=3.31, p=0.016 | HR=2.01, p=0.160 | HR=1.31, p=0.097 | HR=1.28, p=0.085 | HR=1.34, p=0.104 | HR=2.14, p=0.750 | HR=1.24, p=0.096 |
4. Discussion
Recent advancements in computational pathology have transitioned from manually crafting histomorphometric features [31], [32], [33] to large-scale automatic feature learning, exemplified by the emergence of pathological foundation models [11], [34]. Determining what to extract from WSIs is crucial for developing prognostic models. However, feature representations from different locations within WSIs are not prognostically equivalent (Figure 1). Therefore, an inevitable question for developing prognostic classifiers [9] is where one should sample ROIs. For instance, should one sample within and/or outside the tumor to construct representative learning sets [35].
While existing computational prognostic approaches have typically focused on either scanning all of the tumor regions and then attempting to average the signal [36], or trying to screen tiles based on a certain histological index (e.g., density of nuclei or tissue staining intensity) [37], we posit that both of these approaches are suboptimal. ABMIL methods have recently been adopted for histopathology analysis under weak supervision, where detailed annotations are lacking [24]. TransMIL captures global instance relationships through transformer-based attention [25], and DSMIL uses cross-attention to emphasize critical patches [26]. Traditional ABMIL models, such as those leveraging Bahdanau attention [38], assign weights to each patch and aggregate them for slide-level predictions. These attention mechanisms primarily optimize for maximizing predictive confidence using slide-level loss functions. While this approach is highly effective for high-level classification, it presents trade-offs related to achieving fine-grained survival prediction and maintaining explicit interpretability, which are key considerations when analyzing heterogeneous tumor regions [39], [40]. The implicit method used to learn patch importance currently lacks explicit, predefined interpretability criteria, potentially constraining generalization when deployed on diverse test datasets.
In this work, we presented an automated pipeline, DOVER, which leverages an adversarially trained fully convolutional network with HFs to prognosticate OS in NSCLC and OPSCC. One of the important contributions of our work is leveraging the morphologic consistency within the individual TMA spots and transferring the prognostic information from TMA to WSIs based on a pretrained prognostic classifier. During training, each TMA spot was weakly supervised with a PR label. When evaluating the WSIs, DOVER is able to identify tiles with varying PR scores, distinguishing between those with low and high PR. The training of DOVER consisted of two interleaved optimizations: the conventional gradient-based propagation and adversarial noise-based model regularization. While the regular optimization stage improves the discriminative ability of the network, the adversarial training specifically makes the model aware of patterns that might be prognostically less or completely irrelevant and hence should be avoided during the inference phase. This makes DOVER stand out from existing workflows [15], [24], [41] for predicting prognosis based on WSIs processing. While visual attention mechanisms [24] could learn to assign attention weights to different parts of an image, they require a final signal aggregation from all image regions, even when some of those regions might be prognostically irrelevant. However, DOVER is trained to locate the high-PR regions and avoid those low-PR regions in advance of the feature representation step. This is also illustrated by the results of improved prognosis prediction based on feature representations derived from within DOVER high-PR regions, independent of whether those features emanated from hand-crafted approaches [6], [23] or from foundation models [11].
When compared with other tile selection strategies that are currently used in computational pathology, our presented DOVER could consistently separate NSCLC and OPSCC patients into different risk groups that are statistically different. Without adversarial training specifically designed for noise, none of the survival analysis results are statistically significant, even with a higher hazard ratio. These results suggest that adversarial training could make the overall model more robust and allow it to avoid the noise that is inherently embedded within the WSIs while maintaining the most prognostic value of tiles. Consistent with [42], both random and exhaustive tile sampling yielded non-significant survival predictions, which reaffirms our hypothesis that the entire tumoral region on the WSI is morphologically and molecularly heterogeneous. DOVER could provide a PR score at an individual tile level and enable the creation of detailed PR maps over the entire WSIs. Additionally, the large-scale pathological foundation model, UNI, also needed the assistance of DOVER for localizing high-PR regions to generate prognostically valid feature representations (Figure 4c–d). By using the DOVER-guided region selection strategy, the same HFs could better prognosticate cancer survival in different cancer types. When directly applying the ABMIL methods without any prior ROI selection, we observed consistently poorer performance within validation sets in terms of survival prediction. This could be because attention-based architectures struggle to learn PR patterns that are correlated with patient survival compared to diagnostically relevant information (e.g., lesion vs. background). However, our DOVER approach, coupled with adversarial training, demonstrates improved robustness in identifying high PR regions [43]. While both HFs and UNI features could prognosticate survival under the guidance of DOVER region selection, the HFs could consistently yield a higher hazard ratio compared with UNI features. This could be because HFs were previously validated on a delicately curated TMA dataset that is specifically designed to capture prognosis information [6], [31]. While the UNI model is a generalized feature extractor with a self-supervised pre-training manner, it potentially needs further knowledge distillation for individual pathological tasks [44].
Moreover, we employed QIF to evaluate the immune and tumor composition within the DOVER-identified high- and low-PR tiles in the context of NSCLC. The results indicate that DOVER is likely to detect regions with a more diverse and balanced distribution of different subtypes of immune cells. This also illustrates the tumor microenvironment as a sophisticated complexity with diverse immune cell types, cancer-associated cells, and additional supporting cells [45]. Our findings suggest a new way to interrogate giga-pixel pathological slides and provide an efficient pipeline for integrating convolutional neural networks with interpretable HFs for computational pathology, especially for higher-level tasks like prognostication. Moreover, cellular-level composition analysis demonstrates that DOVER focused more on complex immune cell cliques within the tumor epithelia cores, which in turn will favor the aggregation of immune cell diversities by downstream HF extraction and analysis. Specifically, the significant increase of CD8 and CD20 cells in DOVER regions probably suggests that anti-tumor immune ecology could be potentially highlighted through the adversarial semi-supervised training by our approach [45], [46].
Our work did have its limitations. While we collected a dataset with over 2,000 cases, all studies were performed retrospectively. In addition, only one of the cohorts comprised corresponding QIF images. A multi-center, prospective study coupled with comprehensive molecular profiling, such as spatial transcriptomics, would not only validate DOVER’s prognostic performance in real clinical settings but also illuminate the cellular and molecular mechanisms that differentiate high-PR from low-PR regions. In fact, with the advancement of spatial proteomics-based technologies [47], one could also potentially evaluate similarly the proteomic features and associated pathways that differ between high-PR and low-PR regions. The approach presented in this study was limited to two cancer types and was evaluated solely from the perspective of predicting prognosis. With recent advances in artificial intelligence for prognosis and treatment prediction across multiple cancer types, such as anti-HER2 therapy response prediction in gastric cancer [5], [48], and neoadjuvant chemotherapy outcomes in breast cancer [49], an important future direction is to investigate whether DOVER can identify tumor regions informative of response to therapies like chemotherapy and immunotherapy across a broader range of cancers [3]. Future endeavors could extend DOVER to a pan-cancer tool for localizing those regions that are mostly strongly associated not just with outcomes, but also potentially inform on treatment response. Efforts will also focus on testing broader integration with existing slide scanners, AI platforms, and digital pathology pipelines to facilitate seamless adoption in diverse clinical workflows and real-world settings.
5. Conclusion
DOVER is a computational pathology tool for identifying the most prognostically relevant regions within two different cancer types, though its applicability likely extends to multiple cancers and also other non-oncologic diseases. However, we do acknowledge the importance of prospective evaluation to be able to generate the higher levels of evidence needed to translate this algorithm into clinical practice. With additional prospective validation, this approach could also be used to help guide molecular profiling techniques like single-cell and genomics approaches [50], [51], [52] for careful spatial interrogation of the most relevant regions within the tumor.
Supplementary Material
Highlights.
DOVER objectively identifies prognostically relevant (PR) regions in WSIs.
DOVER maps prognostic patterns from homogeneous TMA spots onto larger WSIs.
Feature representation from high-PR regions improves survival prediction.
Validated DOVER on over 2,000 patients with NSCLC and OPSCC cancer.
Co-registered QIF shows high-PR regions have a mix of immune and cancer cells.
Acknowledgements
This work is made possible by the National Natural Science Foundation of China under award numbers 62301265, National Cancer Institute under award numbers R01CA249992-01A1, R01CA202752-01A1, R01CA208236-01A1, R01CA216579-01A1, R01CA220581-01A1, R01CA257612-01A1, U01CA239055-01, U01CA248226-01, U54CA254566-01, National Heart, Lung and Blood Institute R01HL151277-01A1, R01HL158071-01A1, National Institute of Biomedical Imaging and Bioengineering R43EB028736-01, National Center for Research Resources under award number C06 RR12463-01, VA Merit Review Award IBX004121A from the United States Department of Veterans Affairs Biomedical Laboratory Research and Development Service, the Office of the Assistant Secretary of Defense for Health Affairs, through the Breast Cancer Research Program (W81XWH-19-1-0668), the Prostate Cancer Research Program (W81XWH-15-1-0558, W81XWH-20-1-0851), the Lung Cancer Research Program (W81XWH-18-1-0440, W81XWH-20-1-0595), the Peer Reviewed Cancer Research Program (W81XWH-18-1-0404, W81XWH-21-1-0345), the Kidney Precision Medicine Project (KPMP) Glue Grant, DoD Prostate Cancer Research Program Idea Development Award W81XWH-18-1-0524, Clinical and Translational Science Collaborative (CTSC) Cleveland Annual Pilot Award 2020 UL1TR002548 and sponsored research agreements from Bristol Myers-Squibb, Boehringer-Ingelheim, EliLilly and Astrazeneca, DoD Peer Reviewed Cancer Research Program (W81XWH-22-1-0236), Winship Invest Pilot Grant, American Cancer Society Institutional Research Grant from the Winship Cancer Institute. Mike Slive Foundation for Prostate Cancer Research Grant, NCI U01CA113913, NIH AIM-AHEAD Consortium Development Program. K.Y. was supported by the RSNA Research Fellow Grant and ASTRO-LUNGevity Foundation Radiation Oncology Seed Grant. Research reported in this publication was supported in part by the Cancer Tissue and Pathology shared resource of Winship Cancer Institute of Emory University and NIH/NCI under award number P30CA138292.
Conflict of Interest
A.M. is an equity holder in Picture Health, Elucid Bioimaging, and Inspirata Inc. Currently, he serves on the advisory board of Picture Health, Aiforia Inc, and SimBioSys. He also currently consults for SimBioSys. He also has sponsored research agreements with AstraZeneca, Boehringer-Ingelheim, Eli-Lilly and Bristol Myers-Squibb. His technology has been licensed to Picture Health and Elucid Bioimaging. He is also involved in three different R01 grants with Inspirata Inc. S.E.V. has received research funding from Pfizer. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article, or the decision to submit it for publication. The remaining authors declare no potential conflict of interest.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Code and Data Availability
The codes for training and predicting based on the framework proposed in this work are available at Code Ocean capsule at https://codeocean.com/capsule/2945139. The TCGA histological images and their corresponding demographic data were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). All other data supporting the findings of this study are available from the corresponding authors upon reasonable request.
Reference
- [1].“A Deep Learning Approach for Histology-Based Nucleus Segmentation and Tumor Microenvironment Characterization,” Modern Pathology, vol. 36, no. 8, p. 100196, Aug. 2023, doi: 10.1016/j.modpat.2023.100196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Griem J et al. , “Artificial Intelligence–Based Tool for Tumor Detection and Quantitative Tissue Analysis in Colorectal Specimens,” Modern Pathology, vol. 36, no. 12, Dec. 2023, doi: 10.1016/j.modpat.2023.100327. [DOI] [Google Scholar]
- [3].Wang X et al. , “Spatial interplay patterns of cancer nuclei and tumor-infiltrating lymphocytes (TILs) predict clinical benefit for immune checkpoint inhibitors,” Science Advances, vol. 8, no. 22, p. eabn3966, Jun. 2022, doi: 10.1126/sciadv.abn3966. [DOI] [Google Scholar]
- [4].Song B et al. , “CT radiomic signature predicts survival and chemotherapy benefit in stage I and II HPV-associated oropharyngeal carcinoma,” NPJ Precision Oncology, vol. 7, no. 1, p. 53, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Chen Z et al. , “Predicting gastric cancer response to anti-HER2 therapy or anti-HER2 combined immunotherapy based on multi-modal data,” Sig Transduct Target Ther, vol. 9, no. 1, p. 222, Aug. 2024, doi: 10.1038/s41392-024-01932-y. [DOI] [Google Scholar]
- [6].Wang X et al. , “Prediction of recurrence in early stage non-small cell lung cancer using computer extracted nuclear features from digital H&E images,” Scientific Reports, vol. 7, no. 1, p. 13543, Oct. 2017, doi: 10.1038/s41598-017-13773-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Wang X et al. , “A pathology foundation model for cancer diagnosis and prognosis prediction,” Nature, pp. 1–9, Sep. 2024, doi: 10.1038/s41586-024-07894-z. [DOI] [Google Scholar]
- [8].Lu L, Dercle L, Zhao B, and Schwartz LH, “Deep learning for the prediction of early on-treatment response in metastatic colorectal cancer from serial medical imaging,” Nat Commun, vol. 12, no. 1, Art. no. 1, Nov. 2021, doi: 10.1038/s41467-021-26990-6. [DOI] [Google Scholar]
- [9].Qaiser T et al. , “Usability of deep learning and H&E images predict disease outcome-emerging tool to optimize clinical trials,” NPJ Precis Oncol, vol. 6, p. 37, Jun. 2022, doi: 10.1038/s41698-022-00275-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Song B et al. , “Deep learning informed multimodal fusion of radiology and pathology to predict outcomes in HPV-associated oropharyngeal squamous cell carcinoma,” eBioMedicine, vol. 114, Apr. 2025, doi: 10.1016/j.ebiom.2025.105663. [DOI] [Google Scholar]
- [11].Chen RJ et al. , “Towards a general-purpose foundation model for computational pathology,” Nat Med, vol. 30, no. 3, pp. 850–862, Mar. 2024, doi: 10.1038/s41591-024-02857-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Waqas A et al. , “Revolutionizing Digital Pathology With the Power of Generative Artificial Intelligence and Foundation Models,” Laboratory Investigation, vol. 103, no. 11, Nov. 2023, doi: 10.1016/j.labinv.2023.100255. [DOI] [Google Scholar]
- [13].Vorontsov E et al. , “Virchow: A Million-Slide Digital Pathology Foundation Model,” Jan. 18, 2024, arXiv: arXiv:2309.07778. doi: 10.48550/arXiv.2309.07778. [DOI] [Google Scholar]
- [14].Neidlinger P et al. , “A deep learning framework for efficient pathology image analysis.” 2025. [Online]. Available: https://arxiv.org/abs/2502.13027
- [15].Yu K-H et al. , “Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features,” Nature communications, vol. 7, p. 12474, 2016. [Google Scholar]
- [16].Mukashyaka P, Sheridan TB, Foroughi pour A, and Chuang JH, “SAMPLER: unsupervised representations for rapid analysis of whole slide tissue images,” eBioMedicine, vol. 99, p. 104908, Jan. 2024, doi: 10.1016/j.ebiom.2023.104908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Xu H et al. , “A whole-slide foundation model for digital pathology from real-world data,” Nature, vol. 630, no. 8015, pp. 181–188, Jun. 2024, doi: 10.1038/s41586-024-07441-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Wang X et al. , “A pathology foundation model for cancer diagnosis and prognosis prediction,” Nature, vol. 634, no. 8035, pp. 970–978, Oct. 2024, doi: 10.1038/s41586-024-07894-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Corredor G et al. , “Spatial Architecture and Arrangement of Tumor-Infiltrating Lymphocytes for Predicting Likelihood of Recurrence in Early-Stage Non-Small Cell Lung Cancer,” Clin. Cancer Res, vol. 25, no. 5, pp. 1526–1534, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Zhang S, Regan K, Najera J, Grinstaff MW, Datta M, and Nia HT, “The peritumor microenvironment: physics and immunity,” Trends in Cancer, vol. 9, no. 8, pp. 609–623, Aug. 2023, doi: 10.1016/j.trecan.2023.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Swanton C, “Intratumor heterogeneity: evolution through space and time,” Cancer research, vol. 72, no. 19, pp. 4875–4882, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Xie C et al. , “Beyond Classification: Whole Slide Tissue Histopathology Analysis By End-To-End Part Learning,” in Proceedings of the Third Conference on Medical Imaging with Deep Learning, PMLR, Sep. 2020, pp. 843–856. Accessed: Jan. 23, 2024. [Online]. Available: https://proceedings.mlr.press/v121/xie20a.html [Google Scholar]
- [23].Lu C, Lewis JS, Dupont WD, Plummer WD, Janowczyk A, and Madabhushi A, “An oral cavity squamous cell carcinoma quantitative histomorphometric-based image classifier of nuclear morphology can risk stratify patients for disease-specific survival,” Mod Pathol, vol. 30, no. 12, pp. 1655–1665, Dec. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Lu MY, Williamson DFK, Chen TY, Chen RJ, Barbieri M, and Mahmood F, “Data-efficient and weakly supervised computational pathology on whole-slide images,” Nat Biomed Eng, vol. 5, no. 6, Art. no. 6, Jun. 2021, doi: 10.1038/s41551-020-00682-w. [DOI] [Google Scholar]
- [25].Shao Z et al. , “TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2021, pp. 2136–2147. Accessed: Sep. 18, 2025. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/10c272d06794d3e5785d5e7c5356e9ff-Abstract.html [Google Scholar]
- [26].Li B, Li Y, and Eliceiri KW, “Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 14313–14323. doi: 10.1109/CVPR46437.2021.01409. [DOI] [Google Scholar]
- [27].Janowczyk A, Zuo R, Gilmore H, Feldman M, and Madabhushi A, “HistoQC: An Open-Source Quality Control Tool for Digital Pathology Slides.,” JCO clinical cancer informatics, vol. 3, pp. 1–7, 2019. [Google Scholar]
- [28].Xu G et al. , “CAMEL: A Weakly Supervised Learning Framework for Histopathology Image Segmentation,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, doi: 10.1109/iccv.2019.01078. [DOI] [Google Scholar]
- [29].Bay H, Ess A, Tuytelaars T, and Van Gool L, “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, Jun. 2008, doi: 10.1016/j.cviu.2007.09.014. [DOI] [Google Scholar]
- [30].Zadeh SG and Schmid M, “Bias in Cross-Entropy-Based Training of Deep Survival Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 3126–3137, 2021, doi: 10.1109/TPAMI.2020.2979450. [DOI] [PubMed] [Google Scholar]
- [31].Wang X et al. , “A prognostic and predictive computational pathology image signature for added benefit of adjuvant chemotherapy in early stage non-small-cell lung cancer,” EBioMedicine, vol. 69, p. 103481, Jul. 2021, doi: 10.1016/j.ebiom.2021.103481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Lu C et al. , “Feature-driven local cell graph (FLocK): New computational pathology-based descriptors for prognosis of lung cancer and HPV status of oropharyngeal cancers,” Med Image Anal, vol. 68, p. 101903, Feb. 2021, doi: 10.1016/j.media.2020.101903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].“Deep learning model improves tumor-infiltrating lymphocyte evaluation and therapeutic response prediction in breast cancer | npj Breast Cancer.” Accessed: Sep. 30, 2025. [Online]. Available: https://www.nature.com/articles/s41523-023-00577-4
- [34].Vorontsov E et al. , “Virchow: A Million-Slide Digital Pathology Foundation Model,” Jan. 17, 2024, arXiv: arXiv:2309.07778. doi: 10.48550/arXiv.2309.07778. [DOI] [Google Scholar]
- [35].Bilal M et al. , “An aggregation of aggregation methods in computational pathology,” Medical Image Analysis, vol. 88, p. 102885, Aug. 2023, doi: 10.1016/j.media.2023.102885. [DOI] [PubMed] [Google Scholar]
- [36].Tao W, Zhang Z, Liu X, and Yang M, “A fusion deep learning framework based on breast cancer grade prediction,” Digital Communications and Networks, Dec. 2023, doi: 10.1016/j.dcan.2023.12.003. [DOI] [Google Scholar]
- [37].Wetstein SC et al. , “Deep learning-based breast cancer grading and survival analysis on whole-slide histopathology images,” Sci Rep, vol. 12, no. 1, Art. no. 1, Sep. 2022, doi: 10.1038/s41598-022-19112-9. [DOI] [Google Scholar]
- [38].Bahdanau D, Cho K, and Bengio Y, Neural Machine Translation by Jointly Learning to Align and Translate. 2014.
- [39].Qu L, Ma Y, Luo X, Guo Q, Wang M, and Song Z, “Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier Is All You Need,” IEEE Trans. Cir. and Sys. for Video Technol, vol. 34, no. 10_Part_1, pp. 9732–9744, Oct. 2024, doi: 10.1109/TCSVT.2024.3400876. [DOI] [Google Scholar]
- [40].Wang Z et al. , “Targeting tumor heterogeneity: multiplex-detection-based multiple instance learning for whole slide image classification,” Bioinformatics, vol. 39, no. 3, Mar. 2023, doi: 10.1093/bioinformatics/btad114. [DOI] [Google Scholar]
- [41].Skrede O-J et al. , “Deep learning for prediction of colorectal cancer outcome: a discovery and validation study,” The Lancet, vol. 395, no. 10221, pp. 350–360, Feb. 2020, doi: 10.1016/S0140-6736(19)32998-8. [DOI] [Google Scholar]
- [42].Khan AM and Yuan Y, “Biopsy variability of lymphocytic infiltration in breast cancer subtypes and the ImmunoSkew score,” Sci Rep, vol. 6, p. 36231, Nov. 2016, doi: 10.1038/srep36231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Liu P, Ji L, Ye F, and Fu B, “AdvMIL: Adversarial multiple instance learning for the survival analysis on whole-slide images,” Med Image Anal, vol. 91, p. 103020, Jan. 2024, doi: 10.1016/j.media.2023.103020. [DOI] [PubMed] [Google Scholar]
- [44].Sun X, Zhang P, Zhang P, Shah H, Saenko K, and Xia X, “DIME-FM : DIstilling Multimodal and Efficient Foundation Models,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France: IEEE, Oct. 2023, pp. 15475–15487. doi: 10.1109/ICCV51070.2023.01423. [DOI] [Google Scholar]
- [45].De Visser KE and Joyce JA, “The evolving tumor microenvironment: From cancer initiation to metastatic outgrowth,” Cancer Cell, vol. 41, no. 3, pp. 374–403, Mar. 2023, doi: 10.1016/j.ccell.2023.02.016. [DOI] [PubMed] [Google Scholar]
- [46].Edin S et al. , “The Prognostic Importance of CD20+ B lymphocytes in Colorectal Cancer and the Relation to Other Immune Cell subsets,” Sci Rep, vol. 9, no. 1, p. 19997, Dec. 2019, doi: 10.1038/s41598-019-56441-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].“Spatial signatures for predicting immunotherapy outcomes using multi-omics in non-small cell lung cancer | Nature Genetics.” Accessed: Oct. 13, 2025. [Online]. Available: https://www.nature.com/articles/s41588-025-02351-7
- [48].“Artificial intelligence in gastrointestinal cancer research: Image learning advances and applications,” Cancer Letters, vol. 614, p. 217555, Apr. 2025, doi: 10.1016/j.canlet.2025.217555. [DOI] [PubMed] [Google Scholar]
- [49].Shi Z et al. , “MRI-based Quantification of Intratumoral Heterogeneity for Predicting Treatment Response to Neoadjuvant Chemotherapy in Breast Cancer,” Radiology, vol. 308, no. 1, p. e222830, Jul. 2023, doi: 10.1148/radiol.222830. [DOI] [PubMed] [Google Scholar]
- [50].Jia Q, Chu H, Jin Z, Long H, and Zhu B, “High-throughput single-сell sequencing in cancer research,” Sig Transduct Target Ther, vol. 7, no. 1, pp. 1–20, May 2022, doi: 10.1038/s41392-022-00990-4. [DOI] [Google Scholar]
- [51].Malone ER, Oliva M, Sabatini PJB, Stockley TL, and Siu LL, “Molecular profiling for precision cancer therapies,” Genome Medicine, vol. 12, no. 1, p. 8, Jan. 2020, doi: 10.1186/s13073-019-0703-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Matsubara J et al. , “First-Line Genomic Profiling in Previously Untreated Advanced Solid Tumors for Identification of Targeted Therapy Opportunities,” JAMA Netw Open, vol. 6, no. 7, p. e2323336, Jul. 2023, doi: 10.1001/jamanetworkopen.2023.23336. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The codes for training and predicting based on the framework proposed in this work are available at Code Ocean capsule at https://codeocean.com/capsule/2945139. The TCGA histological images and their corresponding demographic data were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). All other data supporting the findings of this study are available from the corresponding authors upon reasonable request.
