Abstract
Artificial intelligence shows promise for evaluating primary breast cancer, including nodal status and molecular subtype. Here, we present a resource-aware deep learning pipeline that combines a Vision Transformer (ViT) feature extractor with an attention-based multiple instance learning (MIL) aggregator to predict pathological tumor-node-metastasis (pTNM) stage from hematoxylin and eosin (H&E) whole-slide images (WSIs). Motivated by deployment in constrained settings, we operate at 2.5× magnification (≈4.0 μm/pixel), well below the 20–40× typically used in computational pathology. For feature extraction, we evaluated three backbones: (1) the UNI foundation model, (2) UNI fine-tuned on the BReAst Carcinoma Subtyping (BRACS) dataset, and (3) a ResNet-50 fine-tuned on BRACS. The embeddings from the best-performing UNI fine-tuned network were used as input to the MIL model, which was trained and validated on 247 WSIs from 214 patients in the Semmelweis cohort. Performance was assessed on three test sets: an internal hold-out from the Semmelweis cohort (82 WSIs from 72 patients), curated subsets of the Nightingale High-Risk Breast Cancer Prediction (NG) dataset (9489 WSIs from 574 patients), and TCGA-BRCA (731 WSIs from 678 patients), all preprocessed identically at 2.5×. The pipeline achieved area under the receiver operating characteristic curve values of 0.663, 0.672, and 0.632 for the internal, NG, and TCGA-BRCA test sets, respectively. Whereas operating at 2.5× may limit access to fine cellular cues, our results indicate that stage-relevant information can still be captured at this resolution. This study provides a transparent, compute-efficient WSI-only baseline for pTNM stage prediction from WSIs, supporting feasibility in resource-constrained environments.
Keywords: Breast cancer, Pathological TNM (pTNM), Whole-slide images, Digital pathology, Foundation models, Vision transformer (ViT), Multiple instance learning (MIL), Weakly supervised learning, Low magnification
Highlights
-
•
Demonstrated WSI-only pTNM stage prediction at 2.5× for stages I–III.
-
•
Delivered a transparent, compute-efficient baseline for low-resource settings.
-
•
Combined UNI foundation model with BRACS fine-tuning and attention MIL.
-
•
Externally validated on Nightingale (NG) and TCGA-BRCA with consistent 2.5× preprocessing.
-
•
Achieve ROC AUCs: 0.663 (internal), 0.672 (NG), 0.632 (TCGA-BRCA).
Introduction
In the USA, in 2025, approximately 316,950 women are expected to be diagnosed with invasive breast cancer, 16% of whom will be younger than 50 years of age. In addition, 59,080 women will be diagnosed with ductal carcinoma in situ (DCIS).1 Worldwide, 2.3 million new cases and 685,000 deaths were estimated in 2020.2, 3 Despite advances in breast cancer detection, accurate staging, which is crucial for guiding treatment decisions, remains a significant challenge. The widely used tumor-node-metastasis (TNM) system classifies tumors based on size, lymph node involvement, and the presence of distant metastasis.4 Pathological TNM (pTNM) staging, determined by histopathological examination of surgically removed tissue, typically provides a more definitive prognostic assessment than clinical TNM (cTNM) staging, which is based on data from imaging, physical examination, and needle biopsy findings. Accurate TNM staging requires a multimodal, multidisciplinary approach involving both radiology and pathology. Traditional radiological imaging methods, such as mammography, ultrasound, magnetic resonance imaging, computed tomography, and positron emission tomography, have limitations in precisely capturing the TNM stage. Analysis of surgically removed breast tissue can provide valuable information on the microscopic extent of the tumor, and pathological evaluation of sentinel or axillary lymph nodes further improves staging accuracy. However, such microscopic evaluation is time-consuming and requires significant sub-specialty expertise.5, 6 This reliance on manual histopathological examination highlights the need for more effective diagnostic techniques7, 8, 9 that can reduce the burden on pathologists while supporting accurate staging. Recent advances in deep learning have shown promise in detecting both well-known patterns and subtle features in histopathology images that may escape human detection,10, 11 leading to increased integration of AI models into clinical workflows.12, 13, 14, 15, 16
Several studies have explored the use of deep learning in breast cancer diagnostics,17, 18, 19, 20, 21, 22, 23, 24 building on previous work using classical machine learning approaches.25, 26, 28, 29, 30 Examples include quality control (QC) in breast cancer biopsies,31 automated grading,32 and prediction of molecular subtypes and biomarkers from digital slides of hematoxylin and eosin (H&E)-stained tissue sections.33 As these modeling endpoints are related to breast cancer diagnosis, predicting the pTNM stage directly from whole-slide images (WSIs) remains underexplored. Most prior histopathology-based staging studies focus on isolated components, such as lymph node metastasis detection,34, 35, 36 or incorporate additional genomic, demographic, and clinical data.14, 37, 38
WSI-based staging pipelines generally operate at 20–40× objective magnification (≈0.50–0.25 μm/pixel) to capture cell-level morphology. The large pixel numbers of these high-resolution images substantially increase computational, storage, and processing time requirements, limiting the clinical applications of AI models in low-resources settings or with large-scale patient cohorts. The scalability of utilizing WSI datasets for training of AI models is a further challenge, particularly when aiming to include large, diverse cohorts that would benefit from low-resolution image inputs. One strategy to address this bottleneck is to use domain-specific pretrained encoders for feature extraction. These vision models are trained on large and diverse histopathology datasets without manual annotations,39 thereby learning feature representations for use in multiple downstream tasks. However, it is not well understood how well such encoders perform for predicting breast cancer pTNM stage. This motivates a systematic evaluation of pretrained encoding frameworks for staging tasks, alongside assessment of whether low-magnification WSIs can be used for staging predictions.
Here, we investigate whether pTNM stage prediction is feasible at 2.5× (≈4.0 μm/pixel) magnification, which significantly reduces computational cost, storage footprint, and processing time. With fixed-size tiling, the number of patches (and thus, the dominant driver of patch-level compute and input/output (I/O)) scales approximately with tissue pixel-area and therefore quadratically with the magnification ratio. Compared with 20× and 40× pipelines, operating at 2.5× requires ≈64× and ≈256× fewer patches, respectively, to cover the same tissue area (Fig. 2C), providing a hardware-agnostic proxy for reductions in disk I/O, feature-extraction forward passes, and intermediate embedding storage40; end-to-end runtime additionally depends on storage bandwidth, image decoding, batching/parallelism, and available CPU/GPU resources. All datasets in this study—training on the Semmelweis cohort and external testing on curated subsets of Nightingale (NG)41, 42 and TCGA-BRCA43—were processed identically at 2.5× to ensure methodological consistency and comparability. The goal is to mimic a deployment scenario where high-magnification scanning and large-scale compute resources are unavailable. We adopt a WSI-only design to simplify the deployment of algorithms under conditions where additional clinical and demographic data may not be readily available, and facilitate algorithmic training in low-resource environments. Accordingly, we primarily envision this approach for retrospective and real-world cohorts where digitized primary-tumor H&E slides are available but structured clinicopathological metadata are incomplete or heterogeneous. In this setting, a WSI-derived pTNM stage estimate can help mitigate missingness and support cohort stratification or flag potentially discordant records for targeted review; given the primary-tumor–only input, it is not intended to replace conventional staging performed within routine clinical workflows.
Fig. 2.
Tissue masking and patch extraction at 2.5×, with illustrative patch-count scaling across magnifications. (A) Representative whole-slide image (WSI) thumbnail at 2.5× (effective resolution 4.0 μm/pixel; scale bar = 1 mm) from the TCGA-BRCA cohort (slide: TCGA-A2A0YH-01Z-00-DX1). (B) Non-overlapping patch grid at 2.5× showing the extracted patches within the tissue mask (tissue retained; background excluded), yielding N patches for this slide. Patches are 224 × 224 pixels, corresponding to 224 × 4.0 μm ≈ 0.90 mm per side at 2.5×. (C) Conceptual illustration of patch-count scaling with magnification for fixed-size 224 × 224-pixel patches: relative to 2.5× (N patches), 10× requires 16 × N patches and 40× requires 256 × N patches to cover the same tissue area, implying comparable fold-increases in storage and patch-level computational workload. Panel (C) is shown for conceptual comparison only; all patch extraction in this study was performed at 2.5×.
Recent advancements in digital pathology have introduced foundation models such as UNI,44 Virchow,39 and Prov-GigaPath,45 which differ in parameter count, training data scale, and architectural efficiency. UNI, with 0.3 billion parameters trained on 100 million images, offers a favorable balance between performance and efficiency, as supported by recent benchmarking.46 In this work, we evaluate three backbone networks for feature extraction in a multiple instance learning (MIL) pipeline: (1) UNI, (2) UNI fine-tuned on the BReAst Carcinoma Subtyping (BRACS) dataset, and (3) a BRACS-fine-tuned ResNet-50. The fine-tuned UNI (UNI-FT) generated the best embeddings of breast pathology images in the initial testing phase and was therefore used for prediction of pTNM stages.
In summary, we present a resource-aware deep learning pipeline for pTNM breast cancer stage prediction from H&E WSIs, trained on the Semmelweis cohort and externally validated on NG and TCGA-BRCA, with the aim of establishing a transparent, compute-efficient baseline demonstrating feasibility in constrained environments rather than delivering a top performing model. Our working hypothesis is that features learnable from primary-tumor WSIs at 2.5× magnification are sufficient to capture proxies of tumor extent (pT; T1–T3) and nodal involvement (N0 vs N+), enabling case-level classification of pTNM stages I–III without additional clinical variables. Operating at 2.5× may limit access to nuclear and subcellular morphology, a limitation that should be considered when interpreting the algorithmic results. Unless otherwise specified, all references to “stage” or “staging” in this article refer to the pTNM stage as determined by histopathological evaluation.
Methods
Data
We used four datasets in this study, each serving a distinct role in model development and evaluation. The BRACS dataset was employed for encoder fine-tuning and testing. The Semmelweis cohort was used for model training and internal validation. For external validation, we relied on two independent cohorts: the NG dataset and TCGA-BRCA. A comprehensive summary of dataset characteristics—including cohort size, specimen type, staging manual, scanner type, and other key attributes—is provided in Appendix Table B.1. This table highlights heterogeneity across datasets and potential sources of domain shift. All slides were processed at a downsampled resolution corresponding to a target magnification of 2.5×, the micrometers-per-pixel (mpp) value varies slightly by scanner (3.68–4.00 mpp). Below, we describe each dataset in the order of its role in the study. Across all cohorts, we used case-level pathological pTNM labels restricted to breast cancer stages I–III. Where lymph node WSIs were available (notably in NG), we intentionally excluded lymph node tissue to simulate a scenario in which only primary-tumor slides are digitized under resource constraints and to avoid trivial leakage of nodal information into the stage labels. We excluded de-novo metastatic cases with distant metastasis and thus our models do not predict stage IV. We also excluded cTNM-only cases, and harmonized stage groups as described for each dataset. Specimen-type annotations such as surgical resection or core needle biopsy and tissue preparation annotations, such as formalin-fixed paraffin-embedded (FFPE) and frozen were not uniformly available across cohorts, and we were not able to evaluate model performance stratified by specimen type or tissue preparation.
BReAst Carcinoma Subtyping (BRACS) dataset
The publicly available BRACS dataset9 was used to fine-tune our backbone networks. It comprises 547 WSIs from 189 patients, annotated by 3 expert pathologists across 7 lesion categories: normal (N), pathological benign (PB), usual ductal hyperplasia (UDH), atypical ductal hyperplasia (ADH), flat epithelial atypia (FEA), DCIS, and invasive carcinoma (IC) (see Fig. 3). WSIs were scanned at 40× (0.25 mpp) using an Aperio AT2 scanner and include both resection specimens and core needle biopsies.
Fig. 3.
Visualization of BRACS lesion annotations and patch exemplars. Two example WSIs from the BRACS dataset with annotations of seven different breast lesions with increasing structural complexity and atypia made by expert pathologists (top), along with a visualization of patch samples from each lesion class (bottom). Annotation colors on the WSIs match the font colors of the abbreviations below the patch samples. Abbreviations: normal (N), pathological benign (PB), usual ductal hyperplasia (UDH), flat epithelial atypia (FEA), atypical ductal hyperplasia (ADH), ductal carcinoma in situ (DCIS), and invasive carcinoma (IC).
Following the original patient-level train/validation/test split, we read each WSI at a downsampled resolution corresponding to 2.5× target magnification (4.00 mpp; 16× downscale from 40×/0.25 mpp). Annotation polygons (defined at 40×) were scaled into the 2.5× coordinate space before patching. Non-overlapping 224 × 224-pixel patches were systematically extracted from annotated regions, each patch inheriting the expert label. This yielded 7763 patches in total: 863 N, 1,733 PB, 759 UDH, 598 ADH, 575 FEA, 1355 DCIS, and 1900 IC. Patches were divided into a training set (6067), validation set (638), and test set (1058) stratified by histological entity. These patch-level lesion labels supervised fine-tuning of all encoders as a seven-class classification task. BRACS does not include TNM staging information of cases and was used exclusively for backbone fine-tuning and comparative benchmarking.
Semmelweis dataset
The Semmelweis cohort was used to train and internally evaluate the staging pipeline. This dataset was provided by the Department of Pathology, Forensic and Insurance Medicine, Semmelweis University and includes 286 female patients diagnosed and treated for invasive breast carcinoma (stages I–III) between 1999 and 2014. All histopathology samples were derived from resection specimens (mastectomy, lumpectomy). According to the 8th edition of UICC TNM staging, the distribution of cancer stages was as follows: 134 pT1; 126 pT2; 17 pT3; 9 pT4; 151 pN0; 85 pN1; 32 pN2; and 18 pN3. All patients were M0. Since explicit stage I–III groupings were not available, pTNM stage labels were constructed from pT, pN, and M values using anatomical rules aligned with AJCC 7th edition criteria to ensure comparability with the other datasets. Importantly, the Semmelweis cohort inherently satisfied the inclusion criteria also applied to external datasets (e.g., exclusion of stage IV cases and patients with neoadjuvant therapy, which can distort morphology and correlate with stage), so no additional filtering was required. Patients' records were collected from the electronic medical databases of the Semmelweis University (MedSolution, Medrec) in accordance with Ethical Approval: ETT TUKEB: BM/27896-3/2024. The collected and anonymized clinicopathological data pertinent to this project included age, occurrence of distant metastasis during the follow-up period, pT category, and pN category. The H&E-stained representative tumor slides were scanned using a Pannoramic 1000 scanner (3DHISTECH Ltd., Budapest, Hungary) at 40× magnification (0.24 mpp). Most patients had a single WSI. Cases with multiple WSIs were grouped under the same patient-level case. Consistent with the preprocessing pipeline, WSIs were read at a downsampled resolution corresponding to a target magnification of 2.5× (3.84 mpp; a 16× downscale from 40×/0.24 mpp), and non-overlapping 224 × 224-pixel patches were extracted following the tissue segmentation described in the Image segmentation and patching section. All Semmelweis WSIs were from surgical resection specimens and no images from core needle biopsies were included.
A stratified patient-level split was applied to construct training, validation, and internal test sets, using stratification on pTNM stage group (I–III) and age tertiles (three equally populated bins derived from cohort age quantiles) to balance the folds. The internal test set comprised 25% of patients (72 patients, 82 WSIs), whereas the remaining cases were used for model development with 5-fold stratified cross-validation. Class distribution details are provided in Appendix Table C.2.
Nightingale high-risk breast cancer prediction (NG) dataset
The NG dataset, used exclusively as an external test set in this study, was released during Phase I of the High-Risk Breast Cancer Prediction Contest.41
The public cohort included 2567 breast cancer patients with 3255 tissue samples and 52,262 WSIs collected between 2014 and 2020 at 2 major hospitals and smaller healthcare facilities in Portland, Oregon. The dataset was curated by experts within the Providence Hospital network, based at Saint Joseph Medical Center in Burbank, California. All images were de-identified to ensure patient privacy, and a substantial amount of clinical metadata was provided for each case. WSIs were digitized at 40× magnification (0.23 mpp) using a Hamamatsu NanoZoomer S360 scanner, yielding images typically on the order of 100,000 × 150,000 pixels. Both surgical resections and core needle biopsies were included, all stained with H&E, with a median of 13 slides per tissue sample. For downstream analysis, we read each WSI at a target magnification of 2.5× (3.68 mpp; a 16× downscale from 40×/0.23 mpp) and extracted non-overlapping 224 × 224-pixel patches following the tissue segmentation described in the Image segmentation and patching section.
The dataset includes both clinical (cTNM) and pathological (pTNM) staging information for many cases, depending on data availability and timing of diagnosis. For this study, we relied exclusively on pTNM labels. TNM staging (T - tumor size; N - lymph node involvement; M - metastasis) ranges from stage 0 to IV, where stage 0 indicates in-situ, pre-invasive breast cancer (DCIS), and stages I–IV describe extents of invasive disease, including tumor size, involvement of the chest wall and skin, inflammatory breast cancer subtype, and the extent of metastatic spread. Whereas the T and N staging components can be determined through pathological data from histological examinations, the M stage requires whole-body imaging and, in some cases, confirmation through image-guided needle biopsy of lesions identified in the scans.
To construct a cohort suitable for staging prediction, we applied a multi-step filtering procedure (Fig. 1). We excluded patients without stage labels, as well as those with evidence of neoadjuvant therapy or biopsy after treatment. We excluded neoadjuvant-treated cases because therapy can substantially alter tumor histomorphology and correlates with advanced stage, risking label leakage. We also excluded TNM stage IV cases and patients with cTNM staging and retained those with pTNM stages I–III. This stringent curation yielded a study cohort of 575 patients with stages I–III breast cancer. A single additional patient was later excluded after slide-level QC (see the Data quality control and embedding aggregation section), resulting in a final NG study cohort of 574 patients with 9489 WSIs. Predictions for each case were aggregated across all available WSIs (resections and/or core needle biopsies), and we did not perform a biopsy-only or single-WSI analysis in this study.
Fig. 1.
Cohort selection flowcharts for Nightingale (NG) and TCGA-BRCA. Inclusion and exclusion steps used to derive the external test cohorts. Left (NG): retained pTNM stage I–III cases and excluded neoadjuvant or post-treatment sampling, stage IV disease, and cTNM-only cases. Slide-level quality control using t-SNE and DBSCAN removed lymph node and artifact slides, yielding 574 patients (9489 WSIs). Right (TCGA-BRCA): retained pTNM stage I–III cases and excluded missing or ambiguous staging, stage IV disease, unknown or confounded treatment history, neoadjuvant therapy, post-treatment sampling, male patients, and prior cancer. Only diagnostic slides (DX1–DX4) were kept, yielding 678 patients (731 WSIs).
Whereas the dataset does not explicitly document the AJCC edition used for TNM stage groupings, alignment between provided pT, pN, and M values with manually reconstructed stage groups indicated that approximately 80% of cases matched AJCC 7th edition criteria. We therefore used the supplied pTNM labels without modification, acknowledging minor edition-related variability as a limitation unlikely to impact interpretation.
Detailed demographic, clinical, and stage distribution statistics for the final NG study cohort are reported in Appendix Tables C.1 and C.2.
TCGA-BRCA dataset
The TCGA-BRCA dataset, serving as an additional external test set in this study, includes 1084 breast cancer patients with FFPE resection specimens (mastectomy, lumpectomy) scanned at 20–40× magnification (0.50–0.25 mpp) across multiple contributing institutions.43 To construct a study cohort comparable to the other datasets, we applied filtering criteria analogous to those used for the NG dataset (Fig. 1). Patients with missing or ambiguous pTNM staging (AJCC 7th edition) were excluded, as were those with stage IV disease. We further removed cases with unknown or confounded treatment history, including neoadjuvant therapy and post-treatment sampling, to avoid therapy-induced morphological confounding and potential label leakage. Male patients and those with prior cancer diagnoses were excluded, and only high-quality diagnostic slides (DX1–DX4) were retained for analysis.
The final TCGA-BRCA study cohort comprised 678 female patients with stages I–III breast cancer, represented by 731 diagnostic WSIs. All TCGA-BRCA slides were read at a target magnification of 2.5× (4.00 mpp) and processed using the same tissue segmentation (Image segmentation and patching section) and non-overlapping 224 × 224-pixel patch extraction as for the other cohorts, ensuring comparability. Class distribution details are provided in Appendix Table C.2. TCGA-BRCA comprises resection WSIs (FFPE with some frozen sections), we did not stratify performance by preparation type.
Image segmentation and patching
For the BRACS dataset, WSIs were analyzed at 2.5× magnification, and 224 × 224-pixel patches were systematically extracted from annotated regions. Each patch inherited the label of the corresponding expert annotation. To standardize appearance, Macenko stain normalization47 was applied during fine-tuning and evaluation of the backbone networks (UNI, UNI-FT, and ResNet-50-FT), using the median stain vector computed from BRACS patches.
For the Semmelweis dataset and the external test sets (NG and TCGA-BRCA), WSIs were preprocessed using our custom-developed method to handle large gigapixel dimensions efficiently.48, 49, 50 Preprocessing began with a segmentation step to exclude background areas and isolate tissue regions. This involved creating a low-resolution segmentation mask, typically a few hundred pixels in size, by leveraging RGB intensity differences between tissue and background. The mask was then resized to the target 2.5× magnification level for patch extraction. During patch extraction, we applied patch-level filtering to remove non-tissue patches and gross artifacts, including pen markings, air bubbles, tissue folds, slide-edge regions, foreign material (e.g., dust), and extensive blank glass/background. The remaining patches, along with their spatial coordinates within the tissue image, were prepared for subsequent analysis. Hyperparameters, such as histogram bins, and intensity thresholds, were tuned by visual inspection to optimize tissue outlining. The segmentation process was optimized to handle terabytes of WSIs efficiently by utilizing simple commands on low-resolution images and parallelizing the workflow, with storage read and write speed being the primary computational bottleneck. From the Semmelweis dataset and both external test sets, non-overlapping 224 × 224-pixel patches were extracted at a magnification level of 2.5× (3.84 mpp for Semmelweis, 3.68 mpp for Nightingale, and 4.00 mpp for TCGA-BRCA). Macenko stain normalization (using the median stain vector derived from BRACS patches) was applied to patches before feature extraction, and patch-level features were then extracted with the UNIFT encoder. For downstream analysis, WSIs from each patient in the Semmelweis, NG, and TCGA-BRCA cohorts were grouped into patient-level “WSI bags,” with each bag containing all WSIs belonging to that patient and linked to the corresponding pTNM stage label.
BRACS patches carried patch-level lesion labels and were used exclusively for backbone fine-tuning and patch-classification benchmarking. By contrast, Semmelweis, NG, and TCGA-BRCA patches did not have per-patch labels, their patient-level pTNM labels supervised the MIL model in a weakly supervised setting.
Operating at 2.5× (≈4.0 mpp) reduces the linear resolution by 16× relative to 40× (≈0.25 mpp) and by 8× relative to 20× (0.50 mpp), implying ≈256× and ≈64× fewer pixels, respectively, for the same tissue area. With fixed 224 × 224-pixel tiling, the number of extracted patches (and thus I/O, embedding computation, and intermediate storage) decreases by the same factors. Quantitative patch-count summaries (median [Q1, Q3] patches per WSI and per patient, and total extracted patches at 2.5×) for the Semmelweis, NG, and TCGA-BRCA cohorts are reported in Appendix Table C.3. Whereas end-to-end gains depend on storage bandwidth and parallelism, this coarse-to-fine reduction materially improves throughput for preprocessing, feature extraction, and MIL training/inference at scale. A representative example of the tissue mask and non-overlapping patch grid at 2.5×, together with an illustration of patch-count scaling across magnifications, is shown in Fig. 2.
Data quality control and embedding aggregation
We performed a dataset-specific slide-level QC for the NG cohort. This QC was not necessary for the Semmelweis and TCGA-BRCA cohorts because these cohorts consist of high-quality images that, following the standard preprocessing (Image segmentation and patching section), could be used directly in our analyses.
For NG, we used a standardized embedding-and-clustering workflow to identify non-tumor or low-quality slides in a fully unsupervised, label-agnostic manner. Patch-level features were mean-pooled per WSI to form a 1024-dimensional slide embedding. These embeddings were projected with t-SNE for visualization, and density-based clustering with DBSCAN (ε = 5.5, MinPts = 14) was applied in the t-SNE space to group slides by shared morphology. Representative slides from each cluster were visually reviewed. Clusters corresponding to lymph node tissue; blurred/smeared tissue; and out-of-focus tissue and other artifacts were excluded, whereas a small DBSCAN noise set (Cluster−1, n = 56) was confirmed to contain good-quality tumor tissue and was retained (Fig. 7). This QC removed 1340 slides in total (completely excluding one patient and removing partial slides for others). QC clustering used fixed UNI-FT embeddings and did not use stage labels, the staging model itself was trained exclusively on Semmelweis data.
Fig. 7.
Slide-level quality control in the Nightingale (NG) dataset. (A) t-SNE projection of slide-level embeddings colored by cancer stage. (B) Same t-SNE projection, colored by DBSCAN cluster IDs, which separated the main tumor cluster (Cluster 0) and smaller clusters corresponding to non-tumor or low-quality slides (Clusters 1–3), with an additional small noise set (Cluster−1). (C) Representative slide thumbnails from Clusters 0–3: tumor slide (Cluster 0), blurred/smeared tissue (Cluster 1), out-of-focus and other artifacts (Cluster 2), and lymph node (Cluster 3). Slides in Clusters 1–3 were excluded, whereas Cluster−1 slides (n = 56, good-quality tumor tissue) were retained together with Cluster 0 for downstream analyses.
Following QC for NG, patient-level embeddings were computed by mean-pooling all valid slide embeddings for each patient. The same patient-level aggregation procedure was applied consistently across Semmelweis, NG (post-QC), and TCGA-BRCA. These 1024-dimensional patient embeddings were visualized with t-SNE to explore dataset similarity and stage-related separability (Appendix Fig. A.1).
Multiple instance learning
Using our backbone networks, the extracted patches were transformed into instance-level feature embeddings . The embeddings from each WSI bag pertaining to a patient were used as the input to a lightweight, scalable, and interpretable MIL model, trained to predict the pTNM stage groups (I, II, and III). The MIL model transformed the embeddings via fully connected (FC) layers into . The set of feature vectors from k instances in a WSI bag was represented as . To determine the significance of each instance in predicting the stage label of a WSI bag, an attention module was applied before the final classifier layer with Softmax activation over the k instances. This module learned a weight distribution quantifying each instance's contribution to the bag-level label y. The attention weights are defined as:
where and are learnable parameters, and is the hidden-layer dimension. The bag-level embedding was then obtained by weighting and summing instance features:
Bag-level predictions were generated by applying an FC layer to . During training, the backbone network weights were kept frozen, whereas the attention-based aggregation function and classifier were optimized via gradient descent. The overall architecture is illustrated in Fig. 5.
Fig. 5.
Architecture of the proposed breast cancer stage prediction pipeline. The pipeline consists of a backbone network (UNI-FT) and a MIL model with an attention module (UNI-FT-MIL). The UNI-FT backbone is used to generate feature embeddings from WSIs, which are then grouped into WSI bags. The MIL model applies an attention mechanism to weigh the embeddings and to generate a bag representation. A FC layer outputs the probability for each of the pathological TNM (pTNM) stages (I–II–III). Notably, the weights of UNI-FT remain frozen, whereas the MIL is trained de novo for stage prediction.
Training procedure
Backbone networks
The emergence of foundation models in medical artificial intelligence, such as the UNI model, has significantly advanced the field, particularly in digital pathology.44 Trained on large-scale datasets, these models represent a shift from task-specific systems toward more versatile, general-purpose networks that can be applied directly or fine-tuned for a wide range of downstream tasks. This paradigm offers a major advantage over traditional models, which are constrained by narrower training data and limited adaptability to diverse clinical contexts.
Among these, the UNI foundation model has emerged as a leading example. UNI is a Vision Transformer (ViT-Large) pretrained on more than 100 million image patches from 100,000 diagnostic H&E-stained WSIs, covering over 20 tissue types, including 3363 breast WSIs. It has demonstrated strong performance across more than 30 computational pathology tasks, including disease detection, cancer type classification, and rare disease analysis.44 Given its pretraining scale and reported benchmark performance, UNI is a strong feature-extraction choice for downstream tasks such as stage prediction in low-resource settings.
We employed two versions of the UNI model: (1) UNI with default pretrained weights, providing an out-of-the box baseline, and (2) obtained by supervised fine-tuning on the BRACS dataset, with the hypothesis that domain-specific fine-tuning would enhance breast cancer-specific feature representation and improve stage prediction. In addition to this domain adaptation, fine-tuning may also help align the encoder to the lower magnification (2.5×) used in our pipeline, given that UNI was predominantly pretrained on 20× WSIs.
Fine-tuning was performed on the BRACS training and validation subsets (total 6705 patches), whereas the held-out test subset was reserved for evaluation. The Adam optimizer51 was used with a differential learning rate schedule: 10−6 for the ViT-Large encoder and 10−4 for the prediction head. Data augmentation (random rotations, flips, zooms, and color channel perturbations) was applied to improve generalization. From both UNI and UNI-FT backbones, 1024-dimensional embeddings were extracted from the class token (CLS) output of the ViT-Large final layer.
For comparison, we also fine-tuned a ResNet-50 model, originally trained on natural images, as a widely used benchmark for feature extraction. ResNet-50 was initialized with ImageNet weights and fine-tuned on the BRACS training subset using the same protocol as UNI-FT. The Adam optimizer with the same differential learning rate schedule was applied using the same augmentation strategy as for UNI-FT. For ResNet-50-FT, 2048-dimensional embeddings were extracted from the model's final average pooling layer.
Multiple instance learning (MIL) model
The MIL model was trained on the Semmelweis cohort (214 patients, 247 WSIs) using 5-fold cross-validation. A feature-processing block preceding the attention module consisted of 3 FC layers with a hidden feature dimension of 256, LeakyReLU activations, and a dropout rate of 0.6. The attention module included 2 FC layers with an attention hidden dimension of 256 and a dropout rate of 0.6, followed by a Softmax operation to generate attention weights. Attention-based aggregation was chosen to allow the model to focus on the most discriminative regions within a WSI bag.
Training was performed for 50 epochs with a batch size of 1, using the AdamW optimizer52 with a weight decay of 0.01 and a fixed learning rate of 2 × 10−5. Model weights were initialized randomly. To handle class imbalance, we employed focal loss with focusing parameter γ = 1, to mildly downweight easy examples while avoiding over-penalizing harder ones such as stage II cases. Class-specific α weights were defined as the square root of the inverse class frequencies, ensuring that rarer classes contributed proportionally more to the loss. This loss formulation was combined with a square-root-based resampling strategy, scaling class sizes in proportion to the square root of class frequencies and anchoring them to the size of the minority class, in order to balance the contribution of different classes during training. Together, these approaches aimed to moderate imbalance by reducing the dominance of majority classes without overcompensating minority ones, promoting more stable learning dynamics. Hyperparameters were tuned by grid search using the same 5-fold cross-validation protocol, and the best model from each fold was selected based on the lowest validation loss.
Threshold tuning using precision-recall curves
The precision-recall (PR) curve and the area under the PR curve (PR AUC) are crucial metrics that offer a complementary view of model performance, especially in cases of significant class imbalance, as they emphasize a model's ability to identify positive samples while minimizing false positives and negatives. PR curves, which plot precision and recall at various decision thresholds, enable the determination of optimal thresholds using the fβ score, where β balances precision and recall.
In our analysis, threshold tuning was performed on the Semmelweis dataset, using the validation folds from 5-fold cross-validation. For each cancer stage, we examined PR curves in a one-vs-rest approach to identify optimal thresholds for converting probability outputs into class labels. We set β = 1.0 (corresponding to the F1-score) and selected threshold values that maximized the F1-score for each stage individually. To ensure robust estimation, we considered all points within the top 5% of the F1-score distribution across individual PR curves rather than relying solely on the single maximum. Thresholds were determined per stage across validation folds, and their means and standard errors, together with the best F1-scores at those thresholds, are presented in Table 1. These thresholds were fixed and consistently applied in all subsequent evaluations, including the Semmelweis internal test and the NG and TCGA-BRCA external test sets, to generate more meaningful confusion matrices.
Table 1.
PR-curve–derived thresholds and best F1-scores on the Semmelweis cohort. Optimal decision thresholds for stages I–III obtained on the Semmelweis cohort via one-vs-rest precision–recall analysis across 5-fold cross-validation, with corresponding best F1-scores at those thresholds. Thresholds were selected by maximizing the F1-score on validation folds and are reported as mean ± standard error across folds. These thresholds were subsequently fixed for generating class labels on the Semmelweis internal test set and the NG and TCGA-BRCA external test sets.
| Model | Stage I | Stage II | Stage III | |
|---|---|---|---|---|
| UNI-FT-MIL | Thresholds Best F1-scores |
0.387 ± 0.018 0.614 ± 0.019 |
0.290 ± 0.010 0.711 ± 0.007 |
0.374 ± 0.033 0.432 ± 0.011 |
Inspired by clinical decision-making—where advanced-stage diagnoses are prioritized when sufficient evidence is present—we derived class label predictions using a hierarchical approach. First, the predicted probability for stage III was compared against its threshold; if it exceeded this threshold, the sample was classified as stage III. Otherwise, the probability for stage II was compared against its threshold, and similarly for stage I, until classification was determined. This approach mitigated class imbalance and reduced false negatives, which are particularly costly in clinical applications. Appendix Table D.1 presents the best precision and recall scores achieved at these optimal thresholds, aggregated across the validation folds of the 5-fold cross-validation. For deployment, we recommend optional per-site recalibration on a small labeled tuning set with probability calibration and selection of an appropriate decision threshold aligned to local cost preferences.
Model evaluation and statistical analysis
The evaluation involved two distinct tasks.
First, the performance of the three backbone networks (UNI, UNI-FT, and ResNet-50-FT) was assessed on the BRACS dataset for classifying image patches into seven breast lesion categories. Key metrics for this patch-level classification task included F1-score, precision, recall, and the area under the receiver operating characteristic curve (ROC AUC).
Second, a patient-level (case-level) evaluation assessed the UNI-FT-MIL model for predicting pTNM cancer stages (I, II, and III) on the Semmelweis internal test set and the external test sets (NG and TCGA-BRCA). Stage predictions were made at the patient (case) level by aggregating all slides per case into a single bag. Performance metrics included accuracy, F1-score, precision, recall, ROC AUC, and PR AUC. ROC curves were generated to complement PR-based thresholding by evaluating model discrimination across all thresholds. For ROC curves, the reported AUC values represent class-specific one-vs-rest performance. Shaded areas indicate 95% confidence intervals (CIs) calculated via bootstrapping. Unless otherwise noted, all multiclass metrics (accuracy, F1-score, precision, recall, ROC AUC, and PR AUC) are macro-averaged across stages I, II, and III to assign equal weight to each class in the presence of class imbalance, per-class results and confusion matrices are also reported where applicable. Given differing class prevalences across externals (NG vs TCGA-BRCA), we foreground ROC AUC for cross-site comparability, PR AUC is emphasized for threshold selection and internal analyses.
To quantify the uncertainty of performance metrics for both the patch-level BRACS classification and the stage-prediction tasks, bootstrapping with 10,000 class-stratified resamples was performed. From the bootstrapped distributions, 95% CIs were derived using the 2.5th and 97.5th percentiles.
To assess the statistical significance of performance differences between backbone networks on BRACS, two-sided paired permutation tests with 10,000 permutations were used. A significance threshold of p < 0.05 indicated a statistically significant difference in performance, whereas p ≥ 0.05 indicated no significant difference.
Computational resources and software
Preprocessing (image segmentation and patching) for BRACS, Semmelweis, and TCGA-BRCA was executed on an on-premises server with 32 CPU cores, 256 GB RAM, and an NVIDIA RTX 4090 (24 GB VRAM) using multithreaded jobs, the primary bottleneck was solid-state-drive throughput. The NG preprocessing was run on the legacy Nightingale Cloud (before retirement) on a cpu.1× instance (16 cores, 124 GB RAM) with a similar multithreaded setup.
Feature extraction for BRACS, Semmelweis, and TCGA-BRCA was performed on the same RTX 4090 server, NG feature extraction used Nightingale Cloud gpu.nvidia-a10g.1× instances (8 CPU cores, 32 GB RAM, NVIDIA A10 with 24 GB VRAM).
Encoder fine-tuning (UNI, UNI-FT, and ResNet-50-FT) was carried out on the same RTX 4090 server.
MIL training and all evaluations (Semmelweis internal test, NG, and TCGA-BRCA external tests) were run on an institutional on-premises server (within the Nightingale Network access environment) equipped with 32 CPU cores, 198 GB RAM, and 3 NVIDIA V100 GPUs (32 GB VRAM each). All training runs used a single V100 GPU.
All code was implemented in Python with PyTorch as the deep-learning framework and the libraries NumPy, pandas, scikit-learn, h5py, OpenCV, Pillow, openslide-python, timm, and huggingface-hub for model backbones. Exact package versions and reproducible environments are provided in the public repository, enabling end-to-end replication of preprocessing, feature extraction, training, and evaluation. All figures were generated using Matplotlib, seaborn, Plotly, and Lucidchart.
As quantified in the Image segmentation and patching section, operating at 2.5× reduces patch counts by ≈64× relative to 20× (and ≈256× vs 40×) for the same tissue area. In practice, this yields proportional reductions in disk I/O, number of forward passes for feature extraction, and intermediate storage. Per-batch VRAM needs are governed by the encoder and patch size (224 × 224), not magnification, so inference comfortably runs on a single consumer GPU (8–12 GB VRAM) or, if necessary, CPU-only with longer runtimes. The MIL stage operates on slide-level bags of embeddings and is lightweight (CPU or modest GPU). Compared to a 20× pipeline, 2.5× typically enables shifting from multi-GPU or cloud instances to a single local GPU (or CPU-only) with substantial throughput and storage savings, subject to site-specific I/O and parallelism. A practical minimal setup for end-to-end inference at 2.5× is 8–16 CPU cores, 32 GB RAM, and a single 8–12 GB VRAM GPU. Fine-tuning encoders benefits from ≥16–24 GB VRAM but is optional for deployment. With embedding dimension D and d-type size b bytes, per-WSI embedding storage scales as Npatches × D × b; since Npatches drops by ≈64× at 2.5× vs 20×, storage and forward-pass counts drop by the same factor (e.g., D = 1024, float32 ≈ 4 kB/patch). We did not report pipeline run time or per-case inference time, because these measures depend strongly on storage bandwidth, I/O parallelism, and local GPU/CPU availability. To support site-specific estimates, the public codebase provides end-to-end scripts that can be readily instrumented (e.g., with standard Python timing utilities) to profile preprocessing, feature extraction/embedding, and MIL inference.
Results
We examined whether the feature representations learned by pretrained encoders could be improved through additional training on domain-specific histopathology images. To this end, we fine-tuned the UNI ViT and ResNet-50 on breast tissue patches from the publicly available BRACS dataset, yielding UNI-FT and ResNet-50-FT, respectively. The BRACS dataset was chosen because it provides comprehensive patch-level annotations spanning benign breast histologies, atypical changes of epithelia, and in-situ and invasive carcinoma. Tiles from the BRACS dataset were used to evaluate UNI, UNI-FT, and ResNet-50-FT under identical conditions using a cross-validation approach. Based on this evaluation, UNI-FT achieved the best overall performance and was therefore selected as the feature encoder for all subsequent breast cancer staging experiments.
Performance of backbone networks on the BRACS dataset
We tested the performance of the three encoder backbone networks—UNI, UNI-FT, and ResNet-50-FT—to classify breast tissue patches from the BRACS dataset into seven distinct classes.
Pathologists labeled polygons corresponding to normal breast tissue, various hyperplastic and metaplastic changes in benign breast epithelium, three types of premalignant lesions, DCIS, and invasive ductal carcinoma (see Fig. 3). Patches were extracted from annotated regions, each assigned a case and polygon label. To prevent data leakage, we adopted the original patient-level split from the BRACS study,9 ensuring that all WSIs and polygons from a given patient were assigned exclusively to the training, validation, or test set. The BRACS dataset, containing patches annotated by expert pathologists, was used for both fine-tuning the backbones and for testing their classification performance. The fine-tuning and evaluation architecture is shown in Fig. 4. During fine-tuning and testing, the UNI backbone was kept frozen and only the FC classification layers were updated, whereas for UNI-FT and ResNet-50-FT, both the backbone and FC layers were fine-tuned. To directly assess the quality of the learned embeddings, we passed them through the FC classification head to obtain per-class probabilities. For metrics requiring class labels (F1-score, precision, and recall), the predicted class was assigned by argmax over the seven-class probability vector.
Fig. 4.
Framework for BRACS patch-level classification with backbone fine-tuning. Fine-tuning and classification of BRACS patches using three backbone networks: UNI, UNI-FT, and ResNet-50-FT. For UNI, only the FC layers were fine-tuned, whereas for UNI-FT and ResNet-50-FT, both the backbone and FC layers were fine-tuned. Each backbone generated feature embeddings for patches extracted from WSIs, which were then passed through an FC head to output probabilities for each of the seven tissue classes: normal (N), pathological benign (PB), usual ductal hyperplasia (UDH), atypical ductal hyperplasia (ADH), flat epithelial atypia (FEA), ductal carcinoma in situ (DCIS), and invasive carcinoma (IC). The same architecture (backbone and FC layers) was used for both fine-tuning and testing.
The evaluation results are presented in Table 2. Whereas per-class scores differ from those reported in the original BRACS study,9 both UNI and UNI-FT achieved higher macro-averaged metrics than ResNet-50-FT. Across all metrics, UNI-FT consistently outperformed the other two backbones.
Table 2.
Patch-level classification performance on BRACS. Evaluation results for patch classification with seven labels on the BRACS dataset using UNI, UNI-FT, and ResNet-50-FT (Res-FT). Results are from the patient-level train-val-test split of the original BRACS study. For metrics requiring discrete labels (F1-score, precision, and recall), predictions were obtained by selecting the class with the highest predicted probability (argmax over the seven-class probabilities). Abbreviations: normal (N), pathological benign (PB), usual ductal hyperplasia (UDH), flat epithelial atypia (FEA), atypical ductal hyperplasia (ADH), ductal carcinoma in situ (DCIS), and invasive carcinoma (IC). The 95% confidence intervals (CIs) were calculated using bootstrapping on the test set.
| Benign |
Malignant |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| N | PB | UDH | FEA | ADH | DCIS | IC | Macro avg | ||
| UNI | 0.594 | 0.212 | 0.225 | 0.346 | 0.208 | 0.353 | 0.787 | 0.389 | |
| [0.532, 0.653] | [0.148, 0.278] | [0.157, 0.292] | [0.255, 0.433] | [0.138, 0.279] | [0.278, 0.423] | [0.748, 0.824] | [0.361, 0.417] | ||
| F1 | UNI-FT | 0.618 | 0.292 | 0.343 | 0.378 | 0.227 | 0.352 | 0.821 | 0.433 |
| [0.559, 0.673] | [0.221, 0.360] | [0.263, 0.418] | [0.286, 0.466] | [0.151, 0.303] | [0.275, 0.429] | [0.786, 0.854] | [0.404, 0.462] | ||
| Res-FT | 0.516 | 0.200 | 0.252 | 0.340 | 0.244 | 0.183 | 0.721 | 0.351 | |
| [0.447, 0.581] | [0.132, 0.269] | [0.183, 0.318] | [0.256, 0.419] | [0.187, 0.301] | [0.116, 0.254] | [0.675, 0.764] | [0.324, 0.377] | ||
| UNI | 0.482 | 0.254 | 0.209 | 0.337 | 0.216 | 0.333 | 0.906 | 0.391 | |
| [0.417, 0.548] | [0.177, 0.339] | [0.143, 0.279] | [0.242, 0.436] | [0.142, 0.294] | [0.257, 0.412] | [0.867, 0.941] | [0.364, 0.419] | ||
| Precision | UNI-FT | 0.484 | 0.328 | 0.342 | 0.357 | 0.277 | 0.381 | 0.868 | 0.434 |
| [0.421, 0.545] | [0.248, 0.412] | [0.257, 0.429] | [0.262, 0.453] | [0.182, 0.377] | [0.293, 0.472] | [0.825, 0.908] | [0.404, 0.464] | ||
| Res-FT | 0.465 | 0.329 | 0.220 | 0.288 | 0.181 | 0.230 | 0.918 | 0.376 | |
| [0.392, 0.539] | [0.221, 0.443] | [0.156, 0.287] | [0.210, 0.368] | [0.135, 0.230] | [0.145, 0.322] | [0.878, 0.955] | [0.348, 0.405] | ||
| UNI | 0.775 | 0.181 | 0.244 | 0.356 | 0.200 | 0.374 | 0.696 | 0.404 | |
| [0.702, 0.843] | [0.124, 0.243] | [0.169, 0.324] | [0.258, 0.456] | [0.132, 0.276] | [0.291, 0.459] | [0.645, 0.747] | [0.376, 0.432] | ||
| Recall | UNI-FT | 0.855 | 0.263 | 0.345 | 0.402 | 0.192 | 0.328 | 0.779 | 0.452 |
| [0.792, 0.912] | [0.194, 0.333] | [0.259, 0.430] | [0.300, 0.506] | [0.124, 0.266] | [0.248, 0.412] | [0.732, 0.825] | [0.424, 0.480] | ||
| Res-FT | 0.580 | 0.144 | 0.294 | 0.414 | 0.375 | 0.153 | 0.594 | 0.365 | |
| [0.497, 0.662] | [0.093, 0.201] | [0.214, 0.376] | [0.310, 0.519] | [0.289, 0.463] | [0.094, 0.217] | [0.539, 0.648] | [0.336, 0.393] | ||
| UNI | 0.913 | 0.702 | 0.729 | 0.860 | 0.656 | 0.728 | 0.953 | 0.792 | |
| [0.893, 0.932] | [0.663, 0.740] | [0.689, 0.767] | [0.828, 0.890] | [0.606, 0.707] | [0.683, 0.772] | [0.939, 0.965] | [0.776, 0.807] | ||
| ROC AUC | UNI-FT | 0.927 | 0.747 | 0.787 | 0.871 | 0.734 | 0.769 | 0.954 | 0.827 |
| [0.908, 0.943] | [0.709, 0.783] | [0.749, 0.823] | [0.840, 0.900] | [0.689, 0.777] | [0.726, 0.810] | [0.940, 0.966] | [0.812, 0.841] | ||
| Res-FT | 0.875 | 0.712 | 0.708 | 0.824 | 0.654 | 0.678 | 0.940 | 0.770 | |
| [0.847, 0.902] | [0.673, 0.751] | [0.665, 0.749] | [0.783, 0.861] | [0.606, 0.701] | [0.633, 0.724] | [0.924, 0.955] | [0.754, 0.786] | ||
Permutation-based statistical testing confirmed that UNI-FT significantly outperformed ResNet-50-FT across all evaluated metrics—accuracy, F1-score, precision, recall, specificity, ROC AUC, and PR AUC (all p < 0.05). UNI also achieved significantly higher accuracy and specificity than ResNet-50-FT (p < 0.05), but differences in other metrics were not statistically significant. Furthermore, UNI-FT achieved significantly higher scores than UNI across all metrics (all p < 0.05), indicating superior overall performance. Based on these results, UNI-FT was selected as the feature encoder for the downstream breast cancer staging experiments. Having selected UNI-FT, we next used its frozen 1024-dimensional embeddings as inputs to an attention-based MIL classifier to predict patient-level pTNM stage.
Breast cancer stage predictions on the Semmelweis dataset
Cross-validation and internal test set performance
To use UNI-FT for predicting pTNM breast cancer stages in the Semmelweis cohort, we replaced the FC layers with a MIL model, resulting in what we refer to as UNI-FT-MIL (see Fig. 5). By aggregating embeddings across patches from multiple slides, UNI-FT-MIL was trained to predict the stage of breast cancer for each patient. Since stage IV breast cancer, characterized by metastatic disease to distant organs (lung, liver, bone, brain, etc.), is rare in newly diagnosed patients, we focused our analysis on patients diagnosed with stage I, II, or III disease. We first set aside a hold-out test set comprising 25% of cases. Using the pTNM stage as the label for each case, we trained UNI-FT-MIL using a 5-fold cross-validation approach on the remaining training data. The performance of the UNI-FT-MIL, determined by the macro-average of each metric across the 5 folds, is shown in Table 3. For metrics requiring class labels—such as accuracy, F1-score, precision, and recall—predicted probabilities were converted to class labels based on optimal thresholds derived from PR curves (further details are provided in the Threshold tuning using precision-recall curves section).
Table 3.
Performance of UNI-FT-MIL on the Semmelweis cohort (cross-validation and internal test set). Performance metrics for pathological TNM (pTNM) stage prediction. For metrics requiring class labels (accuracy, F1-score, precision, and recall), predictions were converted to class labels using optimized thresholds. Cross-validation results are the mean metric values across the 5 folds, with 95% confidence intervals (CIs) computed from the standard error. Internal test set results were obtained by averaging predictions from an ensemble of the best models from each fold, with 95% CIs calculated using bootstrapping.
| Cross-validation |
Internal test set |
|||
|---|---|---|---|---|
| Metric (UNI-FT-MIL) | Mean | 95% CI | Mean | 95% CI |
| ROC AUC | 0.685 | [0.614, 0.755] | 0.663 | [0.560, 0.757] |
| PR AUC | 0.539 | [0.469, 0.609] | 0.543 | [0.445, 0.675] |
| Accuracy | 0.463 | [0.391, 0.535] | 0.556 | [0.444, 0.667] |
| F1 | 0.442 | [0.365, 0.519] | 0.544 | [0.416, 0.657] |
| Precision | 0.466 | [0.346, 0.587] | 0.539 | [0.421, 0.656] |
| Recall | 0.448 | [0.375, 0.521] | 0.569 | [0.440, 0.692] |
Next, we used the best models in the cross-validation to predict the stage of cases in the internal test set. The results are shown in Table 3. To illustrate the models' ability to distinguish between cancer stages, we generated ROC curves using a one-vs-rest approach for each stage, as well as a confusion matrix. Fig. 6A shows the bootstrapped ROC curves for the UNI-FT-MIL model, and Fig. 6B presents the confusion matrix based on threshold optimization. After row-wise normalization, each cell shows the percentage of predictions for each observed stage, alongside actual sample counts in parentheses. Serious under-staging (true stage III → predicted stage I) occurred in 7% of stage III cases (1/14), whereas minor misclassification (true stage II → predicted stage I) occurred in 27% of stage II cases (10/37). Of the 14 stage III cases, 9 (64%) were correctly classified as stage III. Overall, classification performance was better for stages I and III than for stage II.
Fig. 6.
Performance evaluation on the Semmelweis internal test set. (A) Bootstrapped one-vs-rest ROC curves for pathological TNM (pTNM) stages I, II, and III for the MIL model using UNI-FT embeddings. The ROC curves illustrate the model's ability to differentiate between cancer stages using a one-vs-rest approach. The area under the curve (AUC) values shown in the legend reflect the performance for each stage individually. Shaded areas represent the 95% confidence intervals (CIs) from bootstrapping. (B) Confusion matrix after threshold optimization (see the Threshold tuning using precision-recall curves section). Each cell contains the row-wise normalized percentage of predictions for each observed stage, with actual sample counts displayed in parentheses.
Fig. 6A shows that the UNI-FT-MIL achieved ROC AUC values of 0.666 for stage I, 0.534 for stage II, and 0.788 for stage III. Confidence intervals were moderately wide, reflecting variability due to limited case numbers.
In summary, the UNI-FT backbone provided feature representations that allowed the MIL model to predict the pTNM stage of breast cancer with moderate accuracy in the Semmelweis cohort (cross-validation ROC AUC = 0.685, 95% CI: 0.614–0.755; internal test set ROC AUC = 0.663, 95% CI: 0.560–0.757). Whereas performance on the internal test set was moderate, particularly for stage II, these experiments establish the feasibility of extracting stage-related information from primary tumor tissue alone. To ensure consistency in the external evaluation, we first performed dataset-specific slide-level QC for the NG cohort (Data quality control in Nightingale (NG) and patient-level embedding analysis section), we then assessed generalizability on two independent external cohorts (Evaluation on external datasets section).
Data quality control in Nightingale (NG) and patient-level embedding analysis
Before external evaluation, we conducted slide-level QC on the NG dataset. Patients in the NG dataset typically contributed more slides per case compared with the Semmelweis or the TCGA-BRCA datasets. Initial inspection suggested that several slides contained lymph nodes or were affected by severe artifacts such as image blurring or a different tissue preparation that causes the appearance of smears and loss of tissue architecture. To systematically identify such slides, we computed slide-level embeddings by mean aggregation of patch features (1024 dimensions per slide), visualized the embedding space with t-SNE, and applied DBSCAN clustering (Fig. 7A–B). This analysis revealed a large central cluster corresponding to tumor tissue and several smaller clusters corresponding to lymph nodes or artifacts (Fig. 7B–C). A small set of slides was assigned to DBSCAN Cluster−1. Visual review confirmed that these 56 slides were good-quality tumor tissue and were therefore retained together with the main tumor cluster (Cluster 0). In contrast, slides in the artifact clusters (Clusters 1–3) were excluded, corresponding to a total of 1340 slide removals. This affected only one patient completely, whereas all others retained at least one valid slide.
After QC, we aggregated slide embeddings to the patient level by mean pooling across all valid slides, yielding a single 1024-dimensional embedding per patient. We then visualized patient-level embeddings across Semmelweis, TCGA-BRCA, and NG datasets using t-SNE. The resulting projection showed three distinct clusters corresponding to dataset origin, illustrating the expected domain shift of features between the three cohorts (Appendix Fig. A.1). All subsequent analyses were performed on this QC dataset, and the impact of such dataset-specific variation on predictive performance is examined in the following subsection.
Evaluation on external datasets
We assessed the performance of UNI-FT-MIL on two independent external test sets: curated subsets of the publicly available NG and TCGA-BRCA datasets.
On the NG dataset, performance was comparatively strong. UNI-FT-MIL achieved a ROC AUC of 0.672 (95% CI: 0.613–0.728), which is slightly higher than the Semmelweis internal test set (0.663). This indicates that the model generalized well across institutions, maintaining strong performance on an independent cohort. To further contextualize these results, we examined stage-specific errors. Serious under-staging (true stage III → predicted stage I) occurred in 15% of stage III cases (4/26), whereas minor misclassification (true stage II → predicted stage I) occurred in 6% of stage II cases (4/70).
On TCGA-BRCA, performance was weaker, as expected. This cohort represents a multi-institution dataset with substantial stain variability and a mixture of FFPE and frozen sections, posing a considerable domain shift relative to the Semmelweis training data (Appendix Fig. A.1). UNI-FT-MIL achieved an ROC AUC of 0.632 (95% CI: 0.602–0.662), which is 4.68% lower than the ROC AUC obtained on the internal test set. This suggests that while the UNI-FT encoder provides robust feature extraction, the MIL classifier struggles to handle the domain shift and to generalize across heterogeneous tissue preparation and staining protocols. We examined the corresponding stage-specific error patterns on TCGA-BRCA. Serious under-staging (true stage III → predicted stage I) occurred in 20% of stage III cases (30/152), whereas minor misclassification (true stage II → predicted stage I) occurred in 24% of stage II cases (95/404). From a clinical perspective, serious understaging is the more concerning direction of error because it could have higher clinical consequence; these results are reported to describe model error patterns rather than to support clinical staging.
Whereas further work is needed to confirm robustness, these findings highlight the potential of resource-aware pipelines to transfer across real-world datasets.
Discussion
This study assessed the feasibility of predicting pTNM stage in breast cancer directly from H&E WSIs using a resource-aware pipeline operating at 2.5× magnification corresponding to a pixel size of ≈4.0 μm. In the clinic, pathological staging often leads to upstaging of the initial cTNM staging. Microscopic evaluation may reveal lymph node involvement or increase the size of the primary tumor relative to the in-vivo imaging results. Therefore, refining the prediction of pathological stage from routine histopathology specimens using AI models may support clinical decision-making and treatment planning. AI algorithms have shown promise in identifying histopathological features that may escape human observation and that accelerate diagnostic workflows,25, 53 with potential benefits for speed, cost, and earlier detection of premetastatic disease markers.10, 54 Accordingly, we investigated whether WSIs alone contain sufficient morphological information to predict pathological pTNM stage in patients with stages I–III. Operationally, we trained models to learn morphology linked to tumor extent (pT; T1–T3) and nodal risk (N0 vs N+), yielding a case-level pathological stage group (I–III) prediction from primary-tumor WSIs alone. Here, we present a transparent, compute-efficient WSI-only baseline, using H&E slides without supplementary clinical data, rather than a baseline from a model trained on 20×/40× images: operating at 2.5× yields ≈256× fewer patches to process, embed, and aggregate than 40×, aligning with a low-resource deployment. Our models trained on 2.5× images patches were still able to learn cues of tissue architecture relevant to staging. Collectively, the proposed WSI-only design to isolate morphology ensures that the model can easily be transferred across institutions as it can be implemented in low-resources environments where many H&E slides are produced.
Our approach combined a pretrained transformer-based pathology encoder with attention-based MIL in a weakly supervised pipeline using case-level pTNM labels. In the BRACS patch classification (seven tissue classes) use case, UNI fine-tuned on BRACS (UNI-FT) outperformed ResNet-50-FT, motivating its use for staging (Performance of backbone networks on the BRACS dataset section). Trained on Semmelweis and evaluated uniformly at 2.5× across all cohorts, UNI-FT-MIL achieved ROC AUCs of 0.663 on the Semmelweis internal test set, 0.672 on NG, and 0.632 on TCGA-BRCA (Evaluation on external datasets section). These results indicate that staging-relevant information is present at low magnification and that foundation-model embeddings can support downstream classification in a resource-aware setting, despite modest cohort sizes and anticipated cross-institution domain shift (see Appendix Table B.1).
Relative to prior work, our problem framing and constraints differ. Several studies have focused on predicting axillary lymph node metastasis, a task related to yet distinct from comprehensive cancer staging, often at higher magnification and/or with multimodal inputs. For example, attention-based MIL with CNN features reported ROC AUC of 0.816 on an independent cohort of 1058 breast cancer patients for axillary lymph node metastasis prediction.34 Multimodal frameworks integrating clinicopathological variables with WSI features have also shown promise,35 achieving an ROC AUC of 0.809 for binary lymph node metastasis classification across a multicenter cohort of 3701 patients using multiscale (5×, 10×, and 20×) analysis, outperforming clinicopathology-only (ROC AUC 0.770) and WSI-only (ROC AUC 0.709) models. Clinical-only models36 have also been explored for sentinel lymph node metastasis prediction on preoperative data from 1832 breast cancer patients, achieving an ROC AUC of 0.740. In contrast, our goal was a uniform 2.5× WSI-only feasibility baseline across multiple cohorts, leveraging a pretrained pathology foundation model for embeddings rather than assembling a high-magnification, multimodal, or fully optimized pipeline. Unlike much of the prior literature that relied on older CNNs and larger development cohorts, we explicitly evaluated the off-the-shelf utility of a ViT-based foundation encoder (UNI) for pTNM prediction. UNI underwent a lightweight breast-specific fine-tuning on BRACS to assess whether domain alignment improves representations. The fact that UNI feature representations improved after fine-tuning demonstrates that stage-related features of tumor extent and growth pattern can be captured at 2.5× magnification. Furthermore, our study provides evidence that the original features learned by a foundation model can be repurposed effectively to develop a model that can perform in a resource-poor setting.
Key features of the study were chosen to enhance comparability and stability under constrained resources. First, all cohorts underwent detailed curation and were processed at the same magnification (Semmelweis for training and internal testing, NG and TCGA-BRCA for external testing), minimizing magnification-related confounds. Macenko stain normalization was also applied throughout to reduce staining variability and promote cross-cohort generalization. Second, we mitigated class imbalance via focal loss (class-specific α and γ = 1) and square-root sampling, and we used PR-based threshold tuning to convert probabilities into class labels with operating points chosen on validation folds (Threshold tuning using precision-recall curves section). Third, the NG dataset underwent an additional QC step using slide-level embedding visualization with t-SNE and DBSCAN clustering, in view of the many slides per patient and the frequent presence of lymph node or artifact slides, helping ensure that patient-level embeddings reflected primary tumor tissue (Fig. 7). Together, these choices provide a reproducible, resource-conscious baseline for cross-cohort evaluation.
However, this study also has several limitations: (1) the low magnification of histopathology images at 2.5× restricts access to nuclear details (e.g., mitoses, atypia), which negatively effects the algorithmic performance relative to high-resolution pipelines; (2) thresholds optimized from validation PR curves may not be fully cohort-agnostic and could require recalibration before deployment on new datasets. Thresholds should be set together with clinicians, balancing the different costs of errors (e.g., missing a stage-III case), and may include an “flag for review” option for low-confidence results; (3) we also did not prospectively benchmark the efficiency of the end-to-end model (e.g., slides per hour, per-case processing time), storage footprint, or compute cost. A fair and reproducible measurement of compute requires standardized hardware and storage configurations and is best addressed in deployment-oriented evaluations; (4) the generalization of UNI-FT remains challenged by domain shifts, including differences in staining, specimen preparation, scanner hardware, and institutional workflows (see Appendix Table B.1), even after stain normalization; our cohort sizes were modest (e.g., Semmelweis: 286 patients, 329 WSIs), and class imbalances were present due to variables that we could not account for such as specimen-types. Semmelweis and TCGA-BRCA cohorts comprise resection specimens only, whereas NG contains a mix of resections and core needle biopsies. Such differences can alter morphology and context and thus affect transfer. Although we excluded patients with documented neoadjuvant therapy and post-staging samples wherever possible to avoid distortions in morphology and stage-related confounds, residual uncertainty can persist when metadata are incomplete; (5) finally, insights into the interpretability of our models (e.g., attention heatmaps, saliency analyses) were out of scope but are important for clinical adoption.
An important question that will need to be resolved in the future is whether WSI-derived predictions add independent prognostic information beyond grade, tumor size, or molecular subtype in multivariable models.
Several avenues for future research emerge from this feasibility baseline: expanding training data with more diverse, publicly accessible cohorts; multiscale modeling or selective refinement at 20×/40× to recover fine cellular cues; advanced domain-shift mitigation by broadening data augmentation beyond stain normalization and leveraging targeted fine-tuning on small labeled subsets from new sites (adapting the encoder and/or MIL head with parameter-efficient updates such as adapters/LoRA), alongside adversarial alignment and test-time adaptation; and explicit interpretability to localize stage-relevant regions for pathologists. In addition, comparing supervised fine-tuning against domain-adaptive self-supervised adaptation of the encoder on unlabeled target WSIs may reduce representation shift without requiring new labels. From a deployment perspective, the model is best positioned as a triage/decision-support tool; prospective studies should evaluate workflow impact and calibrate thresholds to local tolerance for false positives/negatives. In such a setting, predictions would serve only as an assist to the oncologist to help in patient management, rather than as a replacement for conventional staging. Future evaluations should also stratify performance by specimen type (core needle biopsy vs resection) and by single-WSI scenarios to mirror diagnostic workflows. Enriched embeddings with spatially aware aggregation remain promising, as specialized histopathology pretraining and spatial modeling often improve downstream performance.55 Finally, the framework could be extended to include stage IV once sufficient data are available and to multimodal pipelines that integrate WSIs with radiology, genomics, and clinical variables, moving a transparent, low-resource pipeline toward greater clinical utility while preserving its practical advantages.
Ethics declarations
The study was conducted in accordance with the Declaration of Helsinki. The use of the private dataset (Semmelweis cohort) was approved by Hungarian Medical Research Council (ETT TUKEB: BM/27896-3/2024).
Declaration of generative AI and AI-assisted technologies in the manuscript preparation process
During the preparation of this work, the authors used ChatGPT in order to improve the readability of the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.
Code availability
The source code used in this study is available in our GitHub repository at https://github.com/bedohazizsolt/wsi-breast-cancer-staging-mil. The repository includes scripts, configuration files, and documentation necessary to reproduce the training, evaluation, and analysis procedures reported in this work. Instructions for applying the filtering and splitting logic described in the manuscript are also provided to facilitate reproducibility. Model weights derived from the UNI encoder are subject to the original licensing terms. Where direct redistribution is restricted, we provide code and configuration to reproduce fine-tuning.
Funding
This work was supported by the National Research, Development and Innovation Office of Hungary Grants 2020-1.1.2-PIACI-KFI-2021-00298 (Zs.B., A.B.); the European Union project RRF-2.3.1-21-2022-00004 within the framework of the MILAB Artificial Intelligence National Laboratory (I.Cs.); the RRF-2.3.1-21-2022-00006 Data-Driven Health Division of National Laboratory for Health Security (P.P., O.K.); and OTKA K128780 (P.P.). B.S.K. acknowledges the support from NIH/NCI R21CA286375.
The funding agencies had no role in study design, data collection, data analyses, interpretation, writing of the report, or any aspect pertinent to the study.
Acknowledgments
We acknowledge Nightingale Open Science for launching the High-Risk Breast Cancer Prediction Contest and for providing access to the Nightingale platform and novel datasets, with particular gratitude to Nick Foster, Josh Risley, and Senthil Nachimuthu for their continued support.
We are grateful to Semmelweis University for providing the dataset used for training and internal testing in this study.
We thank Ziad Obermeyer for his valuable feedback and insights on this manuscript.
We acknowledge the Wigner Scientific Computational Laboratory GPU Lab for providing essential computational infrastructure, and the Hungarian Health Management Association for facilitating institutional access to the Nightingale (NG) dataset through its cooperation agreement with the University of Chicago, with special thanks to Attila Borbás for technical support in data transfer and server management.
The results published here are in whole or part based upon data generated by the TCGA Research Network:
We acknowledge financial support from the National Research, Development and Innovation Office of Hungary (2020-1.1.2-PIACI-KFI-2021-00298), the European Union projects RRF-2.3.1-21-2022-00004 (MILAB Artificial Intelligence National Laboratory) and RRF-2.3.1-21-2022-00006 (National Laboratory for Health Security), OTKA K128780, and NIH/NCI R21CA286375.
Acknowledgments
Declaration of competing interest
The authors declare the following financial interests/personal relationships, which may be considered as potential competing interests: Zsolt Bedohazi reports financial support was provided by National Research, Development and Innovation Office of Hungary (Grant 2020–1.1.2-PIACI-KFI-2021-00298). Andras Biricz reports financial support was provided by National Research, Development and Innovation Office of Hungary (Grant 2020–1.1.2-PIACI-KFI-2021-00298). Istvan Csabai reports financial support was provided by European Union - MILAB Artificial Intelligence National Laboratory (Grant RRF-2.3.1-21-2022-00004). Peter Pollner reports financial support was provided by Data-Driven Health Division, National Laboratory for Health Security (Grant RRF-2.3.1-21-2022-00006). Oz Kilim reports financial support was provided by Data-Driven Health Division, National Laboratory for Health Security (Grant RRF-2.3.1-21-2022-00006). Peter Pollner reports financial support was provided by OTKA - Hungarian Scientific Research Fund (Grant K128780). Beatrice S. Knudsen reports financial support was provided by NIH - National Cancer Institute (Grant R21CA286375). If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
Contributor Information
Zsolt Bedőházi, Email: zsoltbedohazi@inf.elte.hu.
Péter Pollner, Email: pollner.peter@emk.semmelweis.hu.
Appendix A. Supplementary embedding visualization
Fig. A.1.
t-SNE visualization of patient embeddings across datasets. Each point represents a patient embedding, obtained by aggregating patch-level features across valid slides (with slide-level quality control (QC) applied only to Nightingale (NG)). Points are colored by pTNM stage (I, II, and III) and shaped by study cohort (Semmelweis, TCGA-BRCA, and NG). Clustering is primarily driven by dataset origin, with limited stage-based separation.
Appendix B. Data card for the breast cancer datasets used in the study
Table B.1.
Data card for source datasets and curated study cohorts. Overview of four breast cancer histopathology datasets used in this work, summarizing provenance, size, specimen types, scanning parameters, patch-extraction level, stage labeling, access, and contacts. Abbreviations: WSI – whole-slide image; QC – quality control; mpp – micrometers-per-pixel; pTNM – pathological Tumor-Node-Metastasis staging system; ToS – Terms of Service.
| Attribute | BRACS | Semmelweis | Nightingale (NG) | TCGA-BRCA |
|---|---|---|---|---|
| Name | BReAst Carcinoma Subtyping (BRACS) | Internal cohort | Nightingale high-risk breast cancer prediction (NG) dataset | TCGA-BRCA |
| Release date | 2022 | 2024 | 2021 (Phase I Contest) | 2016 |
| Citation | 9 | Private dataset; no citation available | 41 | 43 |
| Provenance | Public research consortium | Semmelweis University | Providence Hospital Network | National Cancer Institute (NCI) |
| Full dataset size | 189 patients (547 WSIs) | 286 patients (329 WSIs) | 2567 patients (52,262 WSIs) | 1084 patients (3111 WSIs) |
| Filtering criteria | No filtering applied | Dataset included only invasive, M0 cases | Excluded cases without stage labels, neoadjuvant therapy, stage IV, cTNM-only, or post-treatment sampling; slide-level WSI QC (t-SNE and DBSCAN) removed lymph-node and artifact-heavy slides | Excluded cases with missing/ambiguous pTNM, stage IV, neoadjuvant therapy, post-treatment sampling/unknown or confounded treatment history, prior cancer, and male patients; retained diagnostic slides only (DX1–DX4) |
| Study cohort size | 189 patients (547 WSIs) | 286 patients (329 WSIs) | 574 patients (9489 WSIs) | 678 patients (731 WSIs) |
| Specimen types | Resection specimens (mastectomy) and core needle biopsies | Resection specimens (mastectomy, lumpectomy) | Surgical specimens and core needle biopsies | Resection specimens (mastectomy, lumpectomy) |
| Scanning parameters | 40× (Aperio AT2 scanner, 0.25 mpp) | 40× (3DHistech Pannoramic 1000, 0.24 mpp) | 40× (Hamamatsu NanoZoomer S360, 0.23 mpp) | 20–40× (multiple scanners, most 0.25 mpp, some 0.5 mpp) |
| Patch extraction level | 2.5× (4.00 mpp) | 2.5× (3.84 mpp) | 2.5× (3.68 mpp) | 2.5× (typically 4.00 mpp; minor scanner-dependent variation) |
| Stage type | Not available—no stage labels | pTNM (UICC 8th edition—stage groupings constructed per AJCC 7th) | pTNM (edition undocumented; stage labels directly available; alignment checks indicated close correspondence to AJCC 7th edition) | pTNM (AJCC 7th edition) |
| Class Distr. | Not available—no stage labels | Stage I: 84 (29.37%) Stage II: 145 (50.7%) Stage III: 57 (19.93%) |
Stage I: 478 (83.28%) Stage II: 70 (12.20%) Stage III: 26 (4.52%) |
Stage I: 122 (17.99%) Stage II: 404 (59.59%) Stage III: 152 (22.42%) |
| Access & License | Public—CC BY-NC 4.0 | Private—Proprietary | Restricted—institutional agreement via Nightingale Network; bound to ToS | Public—CC BY 3.0 with additional TCGA data use policies |
| Access URL | https://www.bracs.icar.cnr.it/ | Not publicly available | https://docs.ngsci.org/access-data | https://portal.gdc.cancer.gov/projects/TCGA-BRCA |
| Point of contact | Giosuè Scognamiglio (National Tumor Institute), Pushpak Pati (IBM Research), Nadia Brancati (ICAR-CNR) | Anna-Mária Tőkés (Semmelweis University) | Nightingale Open Science team (contact details available on dataset portal) | National Cancer Institute (NCI) |
Appendix C. Characteristics of the datasets used in the study
Table C.1.
Clinical and demographic profile of the Nightingale (NG) study cohort. Summary for the curated Nightingale (NG) external test cohort (n = 574), including sex distribution, age, overall mortality, survival time, race, ethnicity, and pathological TNM (pTNM) stage distribution with stage-specific mortality. Values are presented as counts (n) with percentages in parentheses, and medians with ranges for continuous variables.
| Clinical information of the Nightingale (NG) study cohort | ||
|---|---|---|
| Sex, n (%) | Female | 574 (100%) |
| Age | Median, range (years) | 63 (26–86) |
| Known (n) | 565 | |
| Unknown (n) | 9 | |
| Mortality, n (%) | No | 527 (91.81%) |
| Yes | 47 (8.19%) | |
| Survival time | Median, range (months) | 38 (1–88) |
| Known (n) | 47 | |
| Unknown (n) | 527 | |
| Race, n (%) | White or Caucasian | 499 (86.93%) |
| Black or African American | 4 (0.70%) | |
| American Indian or Alaska Native | 1 (0.17%) | |
| Asian | 23 (4.01%) | |
| Other | 23 (4.01%) | |
| Unknown | 24 (4.18%) | |
| Ethnicity, n (%) | Non-Hispanic or Latino | 510 (88.85%) |
| Hispanic or Latino | 24 (4.17%) | |
| Unknown | 40 (6.97%) | |
| Stage, n (%) | I | 478 (83.28%) |
| II | 70 (12.20%) | |
| III | 26 (4.52%) | |
| Mortality by stage, n (%) | I | 24 (5.02%) |
| II | 9 (12.86%) | |
| III | 14 (53.85%) | |
| Total | 47 (8.19%) | |
Table C.2.
Stage distribution across study cohorts. Distribution of pathological TNM (pTNM) stages I–III (counts and percentages) for the Semmelweis cohort, its training and internal test splits, and the curated TCGA-BRCA and Nightingale (NG) external cohorts. Counts reflect the number of patients per stage in each dataset.
| Stage | Semmelweis cohort | Training set | Internal test set | TCGA-BRCA | Nightingale (NG) |
|---|---|---|---|---|---|
| I | 84 (29.37%) | 63 (29.44%) | 21 (29.17%) | 122 (17.99%) | 478 (83.28%) |
| II | 145 (50.70%) | 108 (50.47%) | 37 (51.39%) | 404 (59.59%) | 70 (12.20%) |
| III | 57 (19.93%) | 43 (20.09%) | 14 (19.44%) | 152 (22.42%) | 26 (4.52%) |
| Total | 286 (100%) | 214 (100%) | 72 (100%) | 678 (100%) | 574 (100%) |
Table C.3.
Cohort-level summary of whole-slide images and patch counts at 2.5×. Patch counts are computed after tissue masking and nonoverlapping 224 × 224 patch extraction at 2.5× for each cohort. “Patches/patient” summarizes the distribution of the total number of extracted patches aggregated across all WSIs belonging to a patient (i.e., a patient-level WSI bag). Values are reported as median [Q1, Q3], where Q1 and Q3 denote the 25th and 75th percentiles.
| Cohort | Patients (n) | WSIs (n) | Patches/WSI, median [Q1, Q3] | Patches/patient, median [Q1, Q3] | Total patches |
|---|---|---|---|---|---|
| Semmelweis | 286 | 329 | 371 [288, 444] | 380 [298, 496] | 122,507 |
| Nightingale (NG) | 574 | 9489 | 193 [100,312] | 1228 [458, 5812] | 2,058,794 |
| TCGA-BRCA | 678 | 731 | 225 [92, 334] | 234 [100, 349] | 163,782 |
Appendix D. Cross-validation metrics at the optimal thresholds
Table D.1.
Best precision and recall at PR-optimized thresholds (Semmelweis 5-fold CV). Stage-wise Best precision and recall at the decision thresholds selected by maximizing the F1-score on one-vs-rest precision–recall curves across the Semmelweis training/validation folds. Values are reported as mean ± standard error over the 5 folds. Thresholds follow the hierarchical decision rule described in Methods.
| Model | Metric | Stage I | Stage II | Stage III |
|---|---|---|---|---|
| UNI-FT-MIL | Best precision Best recall |
0.552 ± 0.041 0.757 ± 0.036 |
0.609 ± 0.017 0.873 ± 0.027 |
0.443 ± 0.049 0.609 ± 0.076 |
Data availability
The BRACS dataset is publicly available upon registration at https://www.bracs.icar.cnr.it/ and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.The Semmelweis dataset was provided by the Department of Pathology, Forensic and Insurance Medicine, Semmelweis University, and is not publicly available due to institutional restrictions. Access may be granted for academic research upon reasonable request to the corresponding author and subject to approval by Semmelweis University.The Nightingale (NG) dataset was obtained from the Nightingale Open Science platform (https://docs.ngsci.org/datasets/brca-psj-path/, https://doi.org/10.48815/N5159B). This platform provides curated medical image datasets with ground-truth labels for research purposes. As of June 30, 2025, the Nightingale Cloud platform has been retired, and access is now provided via Globus data transfer (https://docs.ngsci.org/access-data). Most datasets are accessible to researchers affiliated with academic or non-profit institutions using institutional credentials; however, access to the NG dataset requires an institutional agreement through the Nightingale Network. All Nightingale datasets remain restricted to non-commercial academic research. Updated terms of use are available at https://docs.ngsci.org/terms-of-service.The TCGA-BRCA dataset, from which our external test subset was derived, is publicly available via The Cancer Genome Atlas (TCGA) portal (https://portal.gdc.cancer.gov/) in accordance with TCGA data access policies.For the public datasets (NG and TCGA-BRCA), the cohort-construction logic, cross-validation and split definitions (where applicable), and the corresponding labels and model predictions used in this study have been shared to ensure reproducibility. For the Semmelweis dataset, we share the filtering and splitting scripts and aggregated descriptors; underlying protected data and metadata are available only under the access conditions described above.
References
- 1.Breastcancer.org U.S. Breast Cancer Statistics. 2025. https://www.breastcancer.org accessed: July 2025.
- 2.Harbeck N., Penault-Llorca F., Cortes J., et al. Breast cancer. Nat Rev Dis Primers. 2019;5(1):66. doi: 10.1038/s41572-019-0111-2. [DOI] [PubMed] [Google Scholar]
- 3.Sung H., Ferlay J., Siegel R.L., et al. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–249. doi: 10.3322/caac.21660. [DOI] [PubMed] [Google Scholar]
- 4.Brierley J.D., Gospodarowicz M.K., Wittekind C., editors. TNM Classification of Malignant Tumours. 8th ed. Wiley Blackwell; Chichester, West Sussex, UK; Hoboken, NJ: 2017. editors in Chief: James D. Brierley, Sc, MB, FRCP, FRCR, FRCPC; Mary K. Gospodarowicz, MD, FRCPC, FRCR (Hon); Christian Wittekind, MD. Additional Editors: B. O’Sullivan, MD and others. [Google Scholar]
- 5.Giuliano A.E., Edge S.B., Hortobagyi G.N. Eighth edition of the AJCC cancer staging manual: breast cancer. Ann Surg Oncol. 2018;25(7):1783–1785. doi: 10.1245/s10434-018-6486-6. [DOI] [PubMed] [Google Scholar]
- 6.Saslow D., Boetes C., Burke W., et al. American Cancer Society guidelines for breast screening with MRI as an adjunct to mammography. CA Cancer J Clin. 2007;57 doi: 10.3322/canjclin.57.2.75. [DOI] [PubMed] [Google Scholar]
- 7.Plichta J.K., Campbell B.M., Mittendorf E.A., Hwang E.S. Anatomy and breast cancer staging: is it still relevant? Surg Oncol Clin N Am. 2018;27(1):51–67. doi: 10.1016/j.soc.2017.07.010. [DOI] [PubMed] [Google Scholar]
- 8.Zhou J., Lei J., Wang J., et al. Validation of the 8(th) edition of the American Joint Committee on cancer pathological prognostic staging for young breast cancer patients. Aging (Albany NY) 2020;12(8):7549–7560. doi: 10.18632/aging.103111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Brancati N., Anniciello A.M., Pati P., et al. BRACS: a dataset for breast carcinoma subtyping in H&E histology images. Database J Biol Datab Curat 2022. 2022 doi: 10.1093/database/baac093. https://europepmc.org/articles/PMC9575967 URL. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.McKinney S.M., Sieniek M., Godbole V., et al. International evaluation of an ai system for breast cancer screening. Nature. 2020;577(7788):89–94. doi: 10.1038/s41586-019-1799-6. [DOI] [PubMed] [Google Scholar]
- 11.Suh Y.J., Jung J., Cho B.-J. Automated breast cancer detection in digital mammograms of various densities via deep learning. J Personal Med. 2020;10(4) doi: 10.3390/jpm10040211. https://www.mdpi.com/2075-4426/10/4/211 URL. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Khened M., Kori A., Rajkumar H., Krishnamurthi G., Srinivasan B. A generalized deep learning framework for whole-slide image segmentation and analysis. Sci Rep. 2021;11(1) doi: 10.1038/s41598-021-90444-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shmatko A., Ghaffari Laleh N., Gerstung M., Kather J.N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer. 2022;3(9):1026–1038. doi: 10.1038/s43018-022-00436-4. [DOI] [PubMed] [Google Scholar]
- 14.Alam M.R., Seo K.J., Abdul-Ghafar J., et al. Recent application of artificial intelligence on histopathologic image-based prediction of gene mutation in solid cancers. Brief Bioinform. 2023;24(3) doi: 10.1093/bib/bbad151. [DOI] [PubMed] [Google Scholar]
- 15.Hossain M.S., Shahriar G.M., Syeed M.M.M., et al. Region of interest (ROI) selection using vision transformer for automatic analysis using whole slide images. Sci Rep. 2023;13(1) doi: 10.1038/s41598-023-38109-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Badve S.S. Artificial intelligence in breast pathology – dawn of a new era. npj Breast Cancer. 2023;9(1):5. doi: 10.1038/s41523-023-00507-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wang H., Roa A.C., Basavanhally A.N., et al. Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. J Med Imaging. 2014;1(3) doi: 10.1117/1.JMI.1.3.034003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Janowczyk A., Madabhushi A. Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J Pathol Inform. 2016;7(1):29. doi: 10.4103/2153-3539.186902. https://www.sciencedirect.com/science/article/pii/S2153353922005478 URL. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ehteshami Bejnordi B., Veta M., Johannes van Diest P., et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017;318(22):2199–2210. doi: 10.1001/jama.2017.14585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wang D., Khosla A., Gargeya R., Irshad H., Beck A.H. 2016. Deep Learning for Identifying Metastatic Breast Cancer.https://arxiv.org/abs/1606.05718 URL. [DOI] [Google Scholar]
- 21.Liu Y., Gadepalli K., Norouzi M., et al. arXiv preprint. 2017. Detecting cancer metastases on gigapixel pathology images. arXiv:1703.02442. [Google Scholar]
- 22.Lin H., Chen H., Dou Q., Wang L., Qin J., Heng P.-A. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) 2018. Scannet: a fast and dense scanning framework for metastastic breast cancer detection from whole-slide image; pp. 539–546. [DOI] [Google Scholar]
- 23.Kim Y.-G., Kim S., Cho C.E., et al. Effectiveness of transfer learning for enhancing tumor classification with a convolutional neural network on frozen sections. Sci Rep. 2020;10(1) doi: 10.1038/s41598-020-78129-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Khaliliboroujeni S., He X., Jia W., Amirgholipour S. End-to-end metastasis detection of breast cancer from histopathology whole slide images. Comput Med Imaging Graph. 2022;102 doi: 10.1016/j.compmedimag.2022.102136. https://www.sciencedirect.com/science/article/pii/S0895611122001069 URL. [DOI] [PubMed] [Google Scholar]
- 25.Beck A.H., Sangoi A.R., Leung S., et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci Transl Med. 2011;3(108):108–113. doi: 10.1126/scitranslmed.3002564. [DOI] [PubMed] [Google Scholar]
- 26.Naji M.A., Filali S.E., Aarika K., Benlahmar E.H., Abdelouhahid R.A., Debauche O. Procedia Computer Science. Vol. 191. 2021. Machine learning algorithms for breast cancer prediction and diagnosis; pp. 487–492.https://www.sciencedirect.com/science/article/pii/S1877050921014629 the 18th International Conference on Mobile Systems and Pervasive Computing (MobiSPC), The 16th International Conference on Future Networks and Communications (FNC), The 11th International Conference on Sustainable Energy Information Technology. URL. [DOI] [Google Scholar]
- 28.Chen H., Wang N., Du X., Mei K., Zhou Y., Cai G. Classification prediction of breast cancer based on machine learning. Computational Intelligence and Neuroscience. 2023;2023 doi: 10.1155/2023/6530719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rabiei R., Ayyoubzadeh S.M., Sohrabei S., Esmaeili M., Atashi A. Prediction of breast cancer using machine learning approaches. J Biomed Phys Eng. 2022;12(3):297–308. doi: 10.31661/jbpe.v0i0.2109-1403. https://jbpe.sums.ac.ir/article_48331.html URL. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Gurcan M.N., Boucheron L.E., Can A., Madabhushi A., Rajpoot N.M., Yener B. Histopathological image analysis: a review. IEEE Rev Biomed Eng. 2009;2:147–171. doi: 10.1109/RBME.2009.2034865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sandbank J., Bataillon G., Nudelman A., et al. Validation and real-world clinical application of an artificial intelligence algorithm for breast cancer detection in biopsies. npj Breast Cancer. 2022;8(1):129. doi: 10.1038/s41523-022-00496-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jaroensri R., Wulczyn E., Hegde N., et al. Deep learning models for histologic grading of breast cancer and association with disease prognosis. npj Breast Cancer. 2022;8(1):113. doi: 10.1038/s41523-022-00478-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Couture H.D., Williams L.A., Geradts J., et al. Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. npj Breast Cancer. 2018;4(1):30. doi: 10.1038/s41523-018-0079-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Xu F., Zhu C., Tang W., et al. Predicting axillary lymph node metastasis in early breast cancer using deep learning on primary tumor biopsy slides. Front Oncol. 2021;11 doi: 10.3389/fonc.2021.759007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ding Y., Yang F., Han M., et al. Multi-center study on predicting breast cancer lymph node status from core needle biopsy specimens using multi-modal and multi-instance deep learning. npj Breast Cancer. 2023;9(1):58. doi: 10.1038/s41523-023-00562-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Shahriarirad R., Meshkati Yazd S.M., Fathian R., Fallahi M., Ghadiani Z., Nafissi N. Prediction of sentinel lymph node metastasis in breast cancer patients based on preoperative features: a deep machine learning approach. Sci Rep. 2024;14(1):1351. doi: 10.1038/s41598-024-51244-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Alam M.R., Abdul-Ghafar J., Yim K., et al. Recent applications of artificial intelligence from histopathologic image-based prediction of microsatellite instability in solid cancers: a systematic review. Cancers. 2022;14(11) doi: 10.3390/cancers14112590. https://www.mdpi.com/2072-6694/14/11/2590 URL. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mobadersany P., Yousefi S., Amgad M., et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc Natl Acad Sci. 2018;115(13):E2970–E2979. doi: 10.1073/pnas.1717139115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Vorontsov E., Bozkurt A., Casson A., et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat Med. Jul 2024 doi: 10.1038/s41591-024-03141-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Li Z., Li B., Eliceiri K.W., Narayanan V. Computationally efficient adaptive decompression for whole slide image processing. Biomed Opt Express. 2023;14(2):667–686. doi: 10.1364/BOE.477515. https://opg.optica.org/boe/abstract.cfm?URI=boe-14-2-667 URL. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Bifulco C., Piening B., Bower T., et al. 2021. Identifying High-Risk Breast Cancer Using Digital Pathology Images. [DOI] [Google Scholar]
- 42.Mullainathan S., Obermeyer Z. Solving medicine’s data bottleneck: Nightingale open science. Nat Med. 2022;28(5):897–899. doi: 10.1038/s41591-022-01804-4. [DOI] [PubMed] [Google Scholar]
- 43.Lingle W., Erickson B.J., Zuley M.L., et al. The Cancer Imaging Archive; 2016. The cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA) (version 3) data set, [DOI] [Google Scholar]
- 44.Chen R.J., Ding T., Lu M.Y., et al. Towards a general-purpose foundation model for computational pathology. Nat Med. 2024;30(3):850–862. doi: 10.1038/s41591-024-02857-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Xu H., Usuyama N., Bagga J., et al. A whole-slide foundation model for digital pathology from real-world data. Nature. 2024;630(8015):181–188. doi: 10.1038/s41586-024-07441-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Campanella G., Chen S., Verma R., et al. A Clinical Benchmark of Public Self-Supervised Pathology Foundation Models. 2024. https://arxiv.org/abs/2407.06508 arXiv:2407.06508. URL. [DOI] [PMC free article] [PubMed]
- 47.Macenko M., Niethammer M., Marron J.S., et al. 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. 2009. A method for normalizing histology slides for quantitative analysis; pp. 1107–1110. [DOI] [Google Scholar]
- 48.Guo Z., Liu H., Ni H., et al. A fast and refined cancer regions segmentation framework in whole-slide breast pathological images. Sci Rep. 2019;9(1):882. doi: 10.1038/s41598-018-37492-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chan L., Hosseini M.S., Rowsell C., Plataniotis K.N., Damaskinos S. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2019. Histosegnet: Semantic segmentation of histological tissue type in whole slide images. [Google Scholar]
- 50.Lu M.Y., Williamson D.F., Chen T.Y., Chen R.J., Barbieri M., Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat Biomed Eng. 2021;5(6):555–570. doi: 10.1038/s41551-020-00682-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kingma D.P., Ba J. Adam: A Method for Stochastic Optimization, CoRR abs/1412.6980. 2014. https://api.semanticscholar.org/CorpusID:6628106 URL.
- 52.Loshchilov I., Hutter F. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101. [Google Scholar]
- 53.Bera K., Schalper K.A., Rimm D.L., Velcheti V., Madabhushi A. Artificial intelligence in digital pathology — new tools for diagnosis and precision oncology. Nat Rev Clin Oncol. 2019;16(11):703–715. doi: 10.1038/s41571-019-0252-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Campanella G., Hanna M.G., Geneslaw L., et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med. 2019;25(8):1301–1309. doi: 10.1038/s41591-019-0508-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chen S., Campanella G., Elmas A., et al. Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective. 2024. https://arxiv.org/abs/2407.07841 arXiv:2407.07841. URL.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The BRACS dataset is publicly available upon registration at https://www.bracs.icar.cnr.it/ and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.The Semmelweis dataset was provided by the Department of Pathology, Forensic and Insurance Medicine, Semmelweis University, and is not publicly available due to institutional restrictions. Access may be granted for academic research upon reasonable request to the corresponding author and subject to approval by Semmelweis University.The Nightingale (NG) dataset was obtained from the Nightingale Open Science platform (https://docs.ngsci.org/datasets/brca-psj-path/, https://doi.org/10.48815/N5159B). This platform provides curated medical image datasets with ground-truth labels for research purposes. As of June 30, 2025, the Nightingale Cloud platform has been retired, and access is now provided via Globus data transfer (https://docs.ngsci.org/access-data). Most datasets are accessible to researchers affiliated with academic or non-profit institutions using institutional credentials; however, access to the NG dataset requires an institutional agreement through the Nightingale Network. All Nightingale datasets remain restricted to non-commercial academic research. Updated terms of use are available at https://docs.ngsci.org/terms-of-service.The TCGA-BRCA dataset, from which our external test subset was derived, is publicly available via The Cancer Genome Atlas (TCGA) portal (https://portal.gdc.cancer.gov/) in accordance with TCGA data access policies.For the public datasets (NG and TCGA-BRCA), the cohort-construction logic, cross-validation and split definitions (where applicable), and the corresponding labels and model predictions used in this study have been shared to ensure reproducibility. For the Semmelweis dataset, we share the filtering and splitting scripts and aggregated descriptors; underlying protected data and metadata are available only under the access conditions described above.








