Abstract
Background
Stromal tumor-infiltrating lymphocytes (sTILs) have significant prognostic value for breast cancer patients, but its accurate assessment can be very challenging. We comprehensively studied the pitfalls faced by pathologists with different levels of professional experience, and explored clinical applicability of reference cards (RCs)- and artificial intelligence (AI)-assisted methods in assessing sTILs.
Materials and methods
Three rounds of ring studies (RSs) involving 12 pathologists from four hospitals were conducted. AI algorithms based on the field of view (FOV) and whole section were proposed to create RCs and to compute whole-slide image interpretations, respectively. Stromal regions identified and the associated sTIL scores by the AI method were provided to the pathologists as references. Fifty cases of surgical resections were used for interobserver concordance analysis in RS1. A total of 200 FOVs with challenge factors were assessed in RS2 for accuracy of the RC-assisted and AI-assisted methods, while 167 cases were used to validate their clinical performance in RS3.
Results
With the assistance of RCs, the intraclass correlation coefficient (ICC) in RS1 increased significantly to 0.834 [95% confidence interval (CI) 0.772-0.889]. The largest enhancement in ICC, from moderate (ICC: 0.592; 95% CI 0.499-0.677) to good (ICC: 0.808; 95% CI 0.746-0.857) was observed for heterogeneity. Accuracy evaluation showed significant grade improvement for heterogeneity and stromal factor FOVs among senior, intermediate, and junior groups. The ICC of heterogeneity and stromal factor analysis by the AI-assisted method achieved a level comparable to that of the senior group with RC assistance. The area under the receiver operating characteristic (ROC) curve, denoted as AUC, for AI-assisted sTIL scores in predicting pathological complete response after neoadjuvant therapy was 0.937, which was superior to visual assessment with an AUC of 0.775.
Conclusion
RC- and AI-assisted technology can reduce the uncertainty of interpretation caused by heterogeneous distribution.
Key words: stromal TILs, concordance, reference card, AI, breast cancer
Highlights
-
•
Pathologists’ assessments of TILs generally achieve lower consistency in FOVs dominated by challenge factors.
-
•
Heterogeneity is the primary factor of variability in assessments, followed by discrepancies in stromal delineation.
-
•
Multi-assistant AI methods may help pathologists address histological heterogeneity challenges.
Introduction
In recent years, immunotherapy has garnered increasing attention from medical professionals, prompting a rigorous exploration of the immune microenvironment in breast cancer (BC). The interplay between tumors and their microenvironment serves as the foundation for cancer cell proliferation and metastasis, exerting a profound influence on tumor progression. Tumor-infiltrating lymphocytes (TILs), serving as a surrogate for the tumor immune microenvironment, have emerged as a clinically relevant biomarker that can be pathologically evaluated in practice using standardized methods. Notably, triple-negative breast cancer (TNBC) and human epidermal growth factor receptor 2 (HER2)-positive BC subtypes typically exhibit greater TIL infiltration across different BC subtypes. In early-stage TNBC with or without adjuvant therapy, the quantity of stromal TILs (sTILs) was positively correlated with survival, showing that the quantity of sTILs identifies a subset of early-stage TNBC patients who might benefit from de-escalated adjuvant systemic treatments.1, 2, 3 The levels of sTILs may also serve as potential predictors for the therapeutic response and prognosis of TNBC patients undergoing neoadjuvant chemotherapy, and they can dynamically monitor treatment efficacy through changes in their levels before and after neoadjuvant therapy.4, 5, 6, 7, 8 In addition, a great number of studies also focus on the clinical relevance of TILs in HER2-positive BC.4,9, 10, 11 Based on the above evidence, TILs gained 1B evidence for TNBC and 2B for HER2-positive BC.12,13 The Food and Drug Administration granted approval for TIL therapy based on the findings from the C-144-01 trial presented at the 2023 European Society for Medical Oncology (ESMO) Immuno-Oncology Congress, thereby unveiling a promising avenue for TIL therapy in BC.14 Recent artificial intelligence (AI)-related research has reported the beneficial results of TILs in solid cancers such as BC, melanoma, and colorectal cancer.15, 16, 17 The International Immuno-oncology Biomarker Working Group (ITWG) presented a comprehensive analysis of the benefits of computational TIL assessment, but emphasized key barriers to clinical translation.18 In this study, we developed AI-computed reference cards (RCs) and introduced an RC-assisted interpretation method based on fields of view (FOVs), which will be conducive to carrying out large-scale clinical practice and extensive clinical verification.
The ITWG has published guidelines for scoring standards to enhance the standardization of pathologists’ interpretations.19, 20, 21 However, traditional visual assessment (VA) by pathologists lacks reproducibility,22 especially in identifying the residual lesions after neoadjuvant therapy (NAT), thereby compromising its clinical value as an indicator. As recommended by the ITWG, TILs should be assessed as a continuous parameter and reported for the stromal compartment (= % stromal TILs), which is determined by the effective lymphoid component (the numerator in the formula) and the division of stromal regions (the denominator in the formula). Therefore, this study mainly focuses on the heterogeneous distribution of the ‘numerator’ and the stromal factors of the ‘denominator’, and further discusses two questions as follows: (i) what are the primary sources of confusion among pathologists with different levels of professional experience? (ii) can RCs and AI-assisted technology effectively improve the precision and clinical applicability of sTILs? A gold-standard method was established to verify the accuracy.
Materials and methods
Study population and field of view selection
Hematoxylin–eosin (H&E) slides of 266 patients with invasive BC were collected in the fourth hospital of Hebei Medical University. Slides with technical deficiencies were excluded, including histological artifacts caused by prolonged ischemic time, poor fixation or issues during processing, as well as crush artifacts. At last, H&E images of 217 cases were scanned using a UNIC digital scanner (Precision 600 Series, UNIC Technologies, Inc., Beijing, China) at 40× magnification.
Fifty scanned whole sections from excision specimens were used to evaluate the scoring concordance across pathologists by VA. Cases with the top 10% greatest standard deviation were identified (red points in Supplementary Figure S1, available at https://doi.org/10.1016/j.esmoop.2025.105095). The original scanned slides of the cases were reviewed to identify histological factors contributing to the variation in sTIL assessment in these cases. Often multiple factors were present in each slide. Heterogeneity in sTIL distribution was identified as the most prevalent challenge. Other challenges associated with delineating the scoring stromal regions included limited stroma within tumor (small volume of intratumoral stroma present for evaluation) and potential interference factors such as necrosis, histiocyte response, cholesterol deposition, lipofuscin deposition, peritumoral retraction clefting, and adipose tissue, as illustrated in Figure 1. From this set, we selected 129 image FOVs with heterogeneity (1-mm2 regions) and 71 fields characterized by stromal region challenges (1-mm2 regions).
Figure 1.
Pitfalls introduced in assessing sTILs in breast cancer. (A-B) Marked heterogeneity in sTILs density within the tumor. (B) Lymphocytic infiltrates of varying densities separated by collagenous stroma. (C-I) Challenges associated with delineating the stromal regions including limited stromal areas within tumor (C), necrosis (D), adipose tissue (E), peritumoral retraction clefting (F), histiocyte response (G), cholesterol deposition (H), and lipofuscin deposition (I). A: H&E staining, 4×; B-I: H&E staining, 10×. H&E, hematoxylin–eosin; sTILs, stromal tumor-infiltrating lymphocytes.
All tissues and data were retrieved under the permission of the Institutional Research Ethics Board of the Fourth Hospital of Hebei Medical University with the declared number of 2024KY108. Since this study did not involve interaction with human subjects and/or use of individuals’ personal identifiable information, the use of existing pathological materials did not require informed consent and did not reveal identifiable patient information. The study was carried out in accordance with the ethics standards of the participating institutions and the tenets of the Declaration of Helsinki.
Clinical performance validation for AI-assisted sTIL assessment
H&E images of 167 TNBC biopsy samples before neoadjuvant therapy were used for clinical performance verification. Tumors with <1% stained cells were considered to be negative for estrogen receptor and progesterone receptor. Immunohistochemistry results of 0, 1+, or 2+ (FISH-negative) were considered negative for HER2. Pathologists evaluated sTILs in all cases using RC-assisted (RC-AS) and AI-assisted (AI-AS) methods. The pathological complete response (pCR) outcome was coded as 1, and the clinical performance of these methods in predicting pCR was assessed using the area under the receiver operating characteristic (ROC) curve (AUC).
Pathologist recruitment
We organized a multi-institutional ring study for sTIL assessment in invasive BC, recruiting 12 board-certified pathologists from four provincial and municipal hospitals. The pathologists were divided into three groups according to their experience: senior (≥10 years, pathologists A, B, C, and D), intermediate (≥5 years but <10 years, pathologists E, F, G, and H), and junior (≥2 years but <5 years, pathologists I, J, K, and L). All pathologists attended training sessions on the assessment protocols in strict accordance with the ITWG guidelines.18,19
Establishing sTIL gold standard
All H&E slides were re-stained with leukocyte common antigen (LCA; clone PD7/26+2B11), which was used to assist pathologists in setting up the gold standard. Three pathologists with >15 years of professional experience who did not participate in the following ring study, interpreted the mean results as the gold standard. The intraclass correlation coefficient (ICC) among the three pathologists exceeded 0.900 (excellent correlation), with ICC at 0.914 [95% confidence interval (CI) 0.867-0.947]. To validate the accuracy of the RC-AS and AI-AS methods, the consistency between the gold-standard scores and the scores provided by the two methods was evaluated. The average value of pathologist scores was used to analyze the consistency between the senior, intermediate, and junior groups and the gold standard, respectively.
Ring study design
Three parts were conducted in this study, with the VA, RC-AS, and AI-AS methods used in each part. (i) Fifty scanned whole sections from excision specimens were assessed to evaluate interobserver concordance. (ii) Each pathologist scored 200 FOVs with challenge factors for the analyses of both concordance and accuracy. (iii) sTIL assessment on the slide level was conducted for 167 scanned BC core biopsy slides for validating clinical performance. When using the AI-AS method, the pathologists were provided with the stromal regions automatically delineated and the associated sTIL scores by our AI method as references. The pathologists reviewed the sTIL scores and intervened with modifications only when deemed necessary. A high degree of concordance was observed among the 12 pathologists when applying the AI-AS method for scoring. Therefore, for the statistical analysis of the AI-AS data in this study, we employed the mean value. The proposed AI-AS method was described in the subsection ‘AI methods for sTILs quantitative analysis’ with more details stated in the Supplementary Material, available at https://doi.org/10.1016/j.esmoop.2025.105095. In the ring studies (RSs), the images were randomly reordered for each method, and 2-week wash-out periods were implemented.
Development and application of RCs
FOV-based sTIL analysis by the AI algorithm was used to create the RCs which were designed to assist in precisely scoring newly observed samples. According to the recommendations of the ITWG, we developed RCs consisting of a series of images depicting nine distinct densities of sTILs at 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, and 70%, respectively. Each density was with an H&E-stained FOV alongside its annotated image where color coding was used to highlight the scoring mononuclear inflammatory cells (MICs), stromal regions, and other regions, as illustrated in Supplementary Figure S2, available at https://doi.org/10.1016/j.esmoop.2025.105095. During RC-AS interpretation, each pathologist compared the observed sTIL patterns in the tissue sample with the reference images on the cards to provide an assisted final evaluation.
AI methods for sTIL quantitative analysis
According to the ITWG guidelines, we developed deep learning-based automatic sTIL interpretation algorithms for both image FOVs and whole-slide images (WSIs), named as the field of view (AI-F) method and whole-slide (AI-W) method, respectively. The AI-F method supported the sTIL assessment for RCs. It provided sTIL scores together with detailed cell- and region-level results, as demonstrated in Supplementary Figure S3, available at https://doi.org/10.1016/j.esmoop.2025.105095, to assist the pathologists during the RSs. The AI-W approach employed the AI-F method as a core building block to compute the slide-level sTIL score as illustrated in Figure 2. Full details on the AI algorithm, the dataset, and model training beyond the following information are presented in the Supplementary Material, available at https://doi.org/10.1016/j.esmoop.2025.105095.
Figure 2.
System diagram of the AI-W algorithm. The AI-F method was integrated as a building block to analyze all image FOVs in the WSI. The processing of FOVs with both high and low expressions of sTILs was exemplified, with intermediate cell-level and region-level results demonstrated in a pipeline. All of the mononuclear inflammatory cell (MIC)-occupied pixels in the WSI constituted the numerator of sTILs, while the total number of tumor-associated stroma pixels was the denominator. AI-F, artificial intelligence-field of view; AI-W, artificial intelligence; FOVs, fields of view; sTILs, stromal tumor-infiltrating lymphocytes; WSI, whole-slide image.
Deep learning-based FOV interpretation algorithm
In the AI-F algorithm, we designed a three-module hierarchical framework: (i) the cell-level analysis via MIC detection and segmentation, (ii) the tissue region-level segmentation of tumor-associated stroma, and (iii) the FOV-level sTIL scoring based on computing the MIC-occupied area. To develop the deep learning neural networks, a board-certified senior pathologist (ZM) manually selected the image FOVs from WSIs. Therefore, a training set involving 13 patient cases and a validation set of 14 cases, consisting of 853 662 and 456 831 MIC annotations, respectively, were constructed to cover the MIC diversity for cell detection. For the region segmentation model, we utilized a public dataset of 151 H&E images23 in combination with our private dataset, which includes polygon annotations for 1039 BC cases from The Cancer Genome Atlas (TCGA) repository.24,25
At the cell level, a deep learning model was trained for MIC detection that was formulated as a cell centroid regression problem. To this end, a training dataset of over 800 000 MIC centroids was automatically annotated under the supervision of LCA immunohistochemistry. Instead of manual annotation, we obtained these cell labels via image registration of H&E slides and their paired re-stained LCA slides, followed by nucleus instance segmentation,26 color deconvolution27 of LCA images, intensity thresholding for positive staining, and centroid computation of LCA-positive cells. With these MIC annotations, we constructed heatmap images of two-dimensional Gaussian-shaped responses centered at MIC centroid positions. Using the pairs of H&E FOVs and their heatmaps, a fully connected network28,29 was trained to predict MIC heatmaps for H&E images. The centroids could be computed by finding the local maxima within the regression heatmap. Our MIC detection model was evaluated on the validation dataset and obtained an F1-score30 of 0.826. Following MIC centroid detection, we also segmented the MIC instances using the information provided by Koohbanani et al.26 The resulting MIC nucleus contours were used in the FOV-level analysis.
At the region level, to delineate the tumor-associated stroma for sTIL assessment, we trained a breast tissue multi-class region segmentation (BMRS) model. We utilized both a public dataset23 and our privately annotated dataset of images extracted from TCGA.24 On this combined dataset, six tissue categories were defined, including the tumor, stroma, dense inflammation, necrosis, normal epithelium, and other tissues (i.e. miscellaneous contents, including adipose tissue, blood vessels, muscle fibers, and duct cavities). For these categories, the BMRS model achieved Dice coefficients31 of 0.802, 0.823, 0.727, 0.814, 0.703, and 0.683 on the validation set, respectively. The design of these types was inspired by the paper by Amgad et al.,23 aiming to reflect the integral components of breast tissue and therefore handle varied challenges in estimating the tumor-associated stroma for sTIL assessment. Pixel-wise predictions of the stroma and dense inflammation classes together contributed to the tumor-associated stroma region, while the pixels of other classes were excluded from sTIL scoring.
Finally, to determine the area occupied by MICs, we created a mask of the segmented MIC instances followed by a morphological dilation operation. To handle the cases of densely populated MICs stated in the ITWG guidelines,19,20 we partitioned the area of tumor-associated stroma into blocks and applied an occupancy test on them to distinguish the occupied blocks. After summing up all the occupied blocks in the FOV, the sTIL score was computed as the ratio of MIC-occupied area to the total area of tumor-associated stroma. We refer our readers to the Supplementary Material, available at https://doi.org/10.1016/j.esmoop.2025.105095, for details.
Whole-slide assessment algorithm
To compute the whole-slide sTIL score, the AI-W algorithm firstly extracted the non-overlapping grid-like FOVs from a WSI. Then, by applying the AI-F method, we attained FOV results of MIC centroids, MIC instance contours, and segmented stromal regions. Finally, all the predicted tumor-associated stromal regions within the invasive margin19,20 were included for assessment, and the whole-slide sTIL score was calculated in a similar way as in the AI-F method with an additional step of summing up all eligible areas.
Statistical analysis
The ICC was employed for the concordance study with the following settings: two-way random effects, single measures, and absolute agreement. Using 95% CI, ICC values were categorized as follows: <0.5 indicated poor correlation, between 0.5 and 0.75 represented moderate correlation, between 0.75 and 0.9 reflected good correlation, and >0.9 indicated excellent correlation.32 The AUC was used to assess the accuracy of predicting pCR. All statistical analyses were conducted using SPSS 26.0 (IBM Corp., Armonk, NY) and GraphPad Prism 8.0.1 (GraphPad Software, La Jolla, CA).
Results
Concordance analysis for traditional visual assessment, RC-assisted and AI-assisted methods in RS1
The full sTIL scores from 12 pathologists for 50 images in RS1 are shown in Supplementary Figure S4, available at https://doi.org/10.1016/j.esmoop.2025.105095. The ICC for VA was 0.730 (95% CI 0.644-0.812), indicating moderate concordance. For the RC-assisted interpretations, the ICC improved to 0.834 (95% CI 0.772-0.889), reaching the good concordance level. By employing the RC-AS method, the senior group achieved the highest ICC value of 0.933 (95% CI 0.896-0.959), reflecting excellent concordance. Furthermore, improvements were observed across the intermediate and junior groups, with ICC rising from 0.745 (95% CI 0.638-0.832) to 0.857 (95% CI 0.789-0.909) and from 0.625 (95% CI 0.498-0.742) to 0.730 (95% CI 0.624-0.820), respectively. For the AI-assisted method applied to the 50 samples, the ICC value reached 0.966 (95% CI 0.951-0.978). These results demonstrated that all three groups benefited from the AI-AS approach.
Analysis of visual assessment in FOVs with challenge factors in RS2
The concordance results in the hierarchical analysis of challenge factors are shown in Figure 3. ICC with heterogeneity and stromal factors decreased to 0.594 (95% CI 0.503-0.674) by VA. Heterogeneity led to lower ICC value compared with stromal factors. ICC was 0.592 (95% CI 0.499-0.677) for the heterogeneity analysis and 0.603 (95% CI 0.469-0.718) for stromal region division. The senior, intermediate, and junior groups all achieved ‘moderate’ concordance, in terms of heterogeneity. As to the stromal factors, the senior and junior groups had ‘good’ concordance (Figure 3A).
Figure 3.
Concordance results in the hierarchical analysis of challenge factors. Inter-pathologist concordance in VA (A) and reference card-assisted interpretation (B). Heter: heterogeneity; OC, overall challenge factors (heterogeneity and stromal factors); Stro, all challenges associated with delineating the stromal regions.
We further analyzed the consistency of the pathologist assessment without assistance compared with the gold standard. Heterogeneity resulted in ‘poor’ concordance of ICC for 6/12 pathologists (two intermediate and four junior pathologists). Furthermore, 8/12 pathologists achieved the ‘poor’ concordance with the stromal factors. The visual fields with the greatest absolute differences from the gold standard (approximately the top 10% of all FOVs) were identified. It was observed that the primary contributing factors were the presence of variably spaced clusters of cancer cells surrounded by dense tight lymphocytic infiltrate, separated by collagenous stroma with sparse infiltrate, and limited interstitial space (Supplementary Figure S5, available at https://doi.org/10.1016/j.esmoop.2025.105095).
RC-assisted method improved pathologists’ concordance in FOVs with challenge factors
The largest improvement in ICC with RC assistance was observed in heterogeneity analysis, where the score increased from 0.592 (95% CI 0.499-0.677) under VA to 0.808 (95% CI 0.746-0.857) with RCs, elevating the concordance level from ‘moderate’ to ‘good’. Specifically, the ICC on heterogeneity improved from ‘moderate’ to ‘excellent’ concordance in the senior group, and from ‘moderate’ to ‘good’ in both intermediate and junior groups.
In the stromal factor analysis, the ICC reached 0.720 (95% CI 0.631-0.798), although the improvement was less pronounced compared with heterogeneity. Both senior and intermediate groups achieved improvements in ICC. Interestingly, the junior group achieved a concordance ICC of 0.809 (95% CI 0.742-0.866) under VA, surpassing the intermediate group’s ICC of 0.635 (95% CI 0.453-0.763). However, with the assistance of RCs, the junior group’s concordance decreased to 0.742 (95% CI 0.657-0.816), transitioning from ‘good’ to ‘moderate’ (Figure 3B).
This result shows that the RC-AS method assisted only some junior pathologists in understanding stromal factors. The specific impacts and reasons are detailed in the section ‘Efficacy of RC-AS and AI-AS in stromal factor subgroup analysis’, incorporating accuracy assessments within a subgroup analysis.
Accuracy verification of RC- and AI-assisted methods
With the assistance of RCs, the accuracy evaluation using ICC between individual pathologists’ interpretations and the gold standard showed notable improvement. In the heterogeneity analysis, ICC values for 6 out of 12 pathologists (namely two intermediate and four senior pathologists) demonstrated ‘good’ concordance or better. For stromal factors, 9 out of 12 pathologists (namely four senior, three intermediate, and two junior pathologists) achieved ‘moderate’ concordance or better (Figure 4A). Additionally, the average ICC for the three groups in terms of both heterogeneity and stromal factor FOVs indicated a statistically significant improvement in grades achieved (Figure 5).
Figure 4.
Accuracy evaluation results. (A) ICC values of individual pathologists’ assessments and the gold standard for heterogeneity and stromal factor FOVs, respectively: traditional visual assessment (VA) method (red dashed lines) and the reference card-assisted (RC-AS) method (blue dashed lines). (B) ICC values and 95% CI for RC-AS and AI-assisted methods, with concordance ranges (excellent, good, moderate, and poor) represented as bands of varying grayscale intensities. AI-AS, artificial intelligence-assisted; CI, confidence interval; FOVs, fields of view; ICC, intraclass correlation coefficient.
Figure 5.
Comparison of accuracy evaluation among different groups. Accuracy evaluation using ICC for pathologist groups with varying experience levels: comparison between VA (green) and RC-assisted method (red) concerning heterogeneity (A) and stromal factors (B). ICC, intraclass correlation coefficient; RC, reference card; VA, visual assessment.
Using the AI-assisted method, the ICC for challenge factors in RS2 was significantly higher, reaching 0.906 (95% CI 0.865-0.933), compared with VA. For both heterogeneity and stromal factors, AI-AS demonstrated at least ‘good’ concordance with the gold standard, achieving ICC values of 0.930 (95% CI 0.895-0.953) and 0.858 (95% CI 0.766-0.913), respectively (Figure 4B). Notably, these ICC values for heterogeneity and stromal factor analyses were on par with those obtained by senior pathologists.
Efficacy of RC-AS and AI-AS in stromal factor subgroup analysis
Based on the findings on stromal factors in the section ‘RC-assisted method improved pathologists’ concordance in FOVs with challenge factors’, we further conducted a subgroup analysis on the role of RC-AS in aiding pathologists’ interpretations, with results demonstrated in Supplementary Figure S6, available at https://doi.org/10.1016/j.esmoop.2025.105095. This analysis divided all images of stromal factors into two subgroups: limited stroma within tumor (LS), and interference factors (IF). Results showed that in the IF subgroup, the RC-AS method improved both consistency and accuracy for all pathologists without altering their levels. In the LS subgroup, the consistency of intermediate pathologists increased from ‘moderate’ for VA to ‘good’ for RC-AS but that of junior pathologists decreased.
All pathologist levels saw improvements in accuracy, consistent with previous findings. These results in Supplementary Figure S6(b), available at https://doi.org/10.1016/j.esmoop.2025.105095, indicated that the RC-AS method primarily benefited pathologists in the LS subgroup, particularly for intermediate and junior pathologists. Due to the lack of image content featuring interference factors in RCs, the RC-AS method showed reduced efficacy in assisting pathologists. The AI-AS method also demonstrated superior accuracy in the LS subgroup, with ICC values of 0.895 (95% CI 0.786-0.948) compared with 0.793 (95% CI 0.627-0.890) in the IF subgroup.
Clinical performance verification of RC- and AI-assisted methods
The cohort consisted of 167 female patients diagnosed with TNBC as summarized in Supplementary Table S1, available at https://doi.org/10.1016/j.esmoop.2025.105095. The median age at diagnosis was 49 years (mean 49.7 years, range 27-71 years). All cases had invasive ductal carcinoma as the prevalent histology (100.0%). Before surgery, all patients received the ACT regimen. The majority of tumors (59.9%) were classified as clinical stage T2 (cT2) with a size >2 cm at diagnosis, and 52.1% were high grade (G3). pCR was achieved in 47.3% (79/167) of patients.
The sTIL scores, as assessed by pathologists, yielded an AUC of 0.775 (95% CI 0.704-0.845) in predicting pCR. In comparison, the AI-assisted interpretation demonstrated the highest AUC of 0.937 (95% CI 0.902-0.972), suggesting a marked improvement in predictive performance. The RC-AS method also produced promising results, falling between those of the VA and AI-assisted methods (Supplementary Figure S7, available at https://doi.org/10.1016/j.esmoop.2025.105095). RC and AI techniques have superior clinical performance to VA.
Discussion
The assessment of sTILs is subject to considerable variability, which raises significant concerns. The main contributing factors include the heterogeneity in lymphocyte distribution and the presence of diversified ‘impurities’ within the tumor stroma. These challenges can lead to variations in risk estimation, particularly in early TNBC, where accurate classification around the cut-off points is crucial for treatment decisions. While employing scoring and averaging techniques across multiple areas can improve consistency among pathologists, this approach comes at the cost of increased workload. Therefore, it is imperative to explore effective strategies to address the limitations of sTIL assessment, such as incorporating reference images or leveraging AI assistance.18,21,22,33,34
Multicenter studies highlighted substantial interobserver variability, with heterogeneity being one of the most prominent challenges perplexing pathologists.34,35 Our research also proved these findings and provided an in-depth analysis of the factors contributing to this variability. Compared with stromal factors, heterogeneity was found to have a more significant impact on pathologists’ performance, particularly for those at junior and intermediate levels. This suggested that heterogeneity was more likely to induce variability in assessment, whereas stromal factors tended to maintain a relatively stable influence across different levels of professional experience. However, further validation of accuracy is necessary to confirm this stability in pathologists’ performance.
The accuracy of pathologists’ evaluations, especially across different levels of experience, was significantly influenced by both heterogeneity and stromal factors. Among these, stromal factors had a particularly profound impact on junior and intermediate pathologists. We observed that sparse lymphocytes infiltrating between widely spaced clusters of epithelial cells were the primary source of uncertainty. Pathologists tended to prioritize dense lymphocytic aggregates while overlooking the more dispersed regions, a conclusion also reported by Kos et al.34 Our study found that the majority of sTIL results in this situation were 40%-45%. Identifying high and low cut-off values remains a challenge for pathologists, and quantitative assessment tools can be valuable under such circumstances. The accuracy of pathologists’ assessments is particularly influenced by stromal factors, and this effect is more pronounced in those with less professional experience. Pathologists may lack a comprehensive understanding of certain challenge factors, leading to notable errors in delineating stromal regions, especially in cases with limited or irregular stroma involvement. Therefore, incorporating tools such as stromal outlines and hotspot markers is critical to improve the accuracy.
The comprehensive analysis of repeatability and accuracy revealed that the consistency among junior pathologists was at a ‘good’ level, significantly higher than the ‘moderate’ level observed in intermediate pathologists, particularly in relation to stromal regions. While the VA accuracy of junior pathologists remained in the ‘poor’ range, intermediate pathologists demonstrated higher accuracy, suggesting that junior pathologists require further development in their assessment abilities, particularly when it comes to addressing stromal region issues. The varying interpretations of challenge factors among intermediate pathologists can result in significant discrepancies in their assessments, leading to reduced reproducibility and moderate accuracy. Notably, an intermediate pathologist with exemplary performance (pathologist F) had greater exposure to BC cases and more opportunities to practice sTIL evaluation. This implies that targeted training with a substantial number of cases could be an effective approach to enhancing VA in the decision-making process, which could be considered for implementation in medical institutions lacking access to assisted technology.
The implementation of the RCs has significantly enhanced both reproducibility and accuracy among pathologists, particularly by mitigating the effects of heterogeneity. This improvement is especially significant among senior and certain intermediate pathologists, who, by applying their expertise in delineating stromal regions, utilize RCs to circumvent heterogeneity interference, thereby achieving optimal VA accuracy. Two noteworthy findings emerged from this study: (i) despite the assistance of RCs, intermediate pathologists still exhibited the lowest level of reproducibility compared with junior and senior colleagues; (ii) with RC assistance, junior pathologists experienced reduced reproducibility in cases involving stromal challenges; however, their accuracy was significantly enhanced. Our analysis revealed that while pathologists across all experience levels demonstrated substantial improvements in accuracy when addressing stromal issues, the reproducibility did not exhibit a corresponding enhancement. This trend was not observed concerning heterogeneity. Consequently, it can be inferred that RCs primarily optimized the reproducibility and accuracy of the ‘numerator’ components. The limited reproducibility among intermediate and junior pathologists, even with the assistance of RCs, is particularly evident in cases involving stromal region challenges. This underscores the necessity for further enhancements beyond RCs to effectively address stroma-related factors.
In light of the aforementioned issues, this study highlights the transformative potential of AI technology in addressing limitations in sTIL evaluation. Leveraging deep learning-based algorithms, the proposed method excels in identifying lymphocytic components and segmenting tumor-associated stroma to enable automated sTIL scoring. A key contribution lies in the incorporation of stromal region labeling, which overcomes the limitations of the RCs. By facilitating regional partitioning of entire sections, this approach empowers pathologists to identify meaningful lymphoid components with enhanced efficiency. The accuracy of AI-assisted method consistently demonstrates robust stability, irrespective of heterogeneity or stromal factors. Of particular note is its ability to markedly attenuate the impact of heterogeneity, approaching an ‘excellent’ level of accuracy. Even in challenging cases involving limited irregular stromal factors—traditionally a major impediment to pathologists’ VA—the AI-AS method also achieved ‘good’ accuracy, significantly benefiting the pathologists’ VA accuracy. This method effectively assists pathologists with varying levels of professional experience in mitigating the interference posed by challenge factors, surpassing the performance of intermediate and junior pathologists reliant solely on RCs. Moreover, in our analysis of pCR prediction, we found that RC and AI-AS technology surpassed conventional VA, further underscoring its predictive accuracy for decision-making support.
Although our current study indicates that AI is capable of assisting pathologists in making breakthroughs based on traditional interpretation methods, we admit its limitations. Firstly, as the gold standard, immunohistochemical LCA might lead to high interpretation results of the algorithm, because the range of lymphocytes labeled by LCA is slightly larger than that recommended by the Working Group. Furthermore, the process of re-staining slides can result in antigen loss and may impact the accuracy of secondary confirmation with immunohistochemistry. Secondly, in this study, we had excluded technical deficiencies (e.g. histological artifacts secondary to crush artifact) and mucinous tumors, which mean that our proposed AI-AS technology did not offer an advantage in addressing these issues. Pathologists should pre-screen for intact, morphologically assessable areas in clinical practice. Thirdly, AI-assisted technology achieved lower accuracy in the presence of stromal interference factors, for example peritumoral retraction clefting, histiocyte response, cholesterol deposition, and lipofuscin deposition. The ability of AI to handle such interference factors needs to be improved. Furthermore, axillary sectioned fibroblasts, apoptotic nuclei, and perinuclear clearing can mimic lymphocytes, potentially leading to misclassification by the AI algorithm.34 In future work, we will (i) test and enhance the generalizability of the sTIL interpretation algorithm on multicenter data, and (ii) further improve the models for more accurate differentiation of cell types, and for the segmentation of a larger number of tissue types.
The standardized assessment of sTILs is an important task in pathology. However, the pathologists’ assessment was hindered by the heterogeneity and stromal factors. To solve these challenging issues, we presented the RC and AI technology to assist pathologists, and the results showed a benefit for pathologists with varying levels of experience, which may provide more valuable evidence for clinical practice.
Acknowledgements
The authors would like to sincerely thank the pathologists Yan Ding, Xu Wang, and Fang Li for their assistance in establishing the gold standard, who did not participate in this interpretation study. In addition, the pathologists Lei Wang, Huiyan Deng, Meng Yue, Yao Liu, Lijing Cai, Kun Wang, Zhanli Jia, Chang Liu, Jiuyan Shang, and Xuemei Sun participated in the ring study.
Funding
This work was supported by the Hebei Natural Science Foundation [grant number H2024206504].
Disclosure
The authors have declared no conflicts of interest.
Data sharing
Data analyzed in this study are available from the corresponding author on reasonable request.
Supplementary data
References
- 1.Loi S., Drubay D., Adams S., et al. Tumor-infiltrating lymphocytes and prognosis: a pooled individual patient analysis of early-stage triple-negative breast cancers. J Clin Oncol. 2019;37(7):559–569. doi: 10.1200/JCO.18.01010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Leon-Ferre R.A., Polley M.Y., Liu H., et al. Impact of histopathology, tumor-infiltrating lymphocytes, and adjuvant chemotherapy on prognosis of triple-negative breast cancer. Breast Cancer Res Treat. 2018;167(1):89–99. doi: 10.1007/s10549-017-4499-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.de Jong V.M.T., Wang Y., Ter Hoeve N.D., et al. Prognostic value of stromal tumor-infiltrating lymphocytes in young, node-negative, triple-negative breast cancer patients who did not receive (neo)adjuvant systemic therapy. J Clin Oncol. 2022;40(21):2361–2374. doi: 10.1200/JCO.21.01536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Denkert C., von Minckwitz G., Darb-Esfahani S., et al. Tumour-infiltrating lymphocytes and prognosis in different subtypes of breast cancer: a pooled analysis of 3771 patients treated with neoadjuvant therapy. Lancet Oncol. 2018;19(1):40–50. doi: 10.1016/S1470-2045(17)30904-X. [DOI] [PubMed] [Google Scholar]
- 5.Dieci M.V., Criscitiello C., Goubar A., et al. Prognostic value of tumor-infiltrating lymphocytes on residual disease after primary chemotherapy for triple-negative breast cancer: a retrospective multicenter study. Ann Oncol. 2014;25(3):611–618. doi: 10.1093/annonc/mdt556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Demaria S., Volm M.D., Shapiro R.L., et al. Development of tumor-infiltrating lymphocytes in breast cancer after neoadjuvant paclitaxel chemotherapy. Clin Cancer Res. 2001;7(10):3025–3030. [PubMed] [Google Scholar]
- 7.Ladoire S., Arnould L., Apetoh L., et al. Pathologic complete response to neoadjuvant chemotherapy of breast carcinoma is associated with the disappearance of tumor-infiltrating foxp3+ regulatory T cells. Clin Cancer Res. 2008;14(8):2413–2420. doi: 10.1158/1078-0432.CCR-07-4491. [DOI] [PubMed] [Google Scholar]
- 8.Luen S.L., Salgado R., Loi S. Residual disease and immune infiltration as a new surrogate endpoint for TNBC post neoadjuvant chemotherapy. Oncotarget. 2019;10(45):4612–4614. doi: 10.18632/oncotarget.27081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Loi S., Michiels S., Salgado R., et al. Tumor infiltrating lymphocytes are prognostic in triple negative breast cancer and predictive for trastuzumab benefit in early breast cancer: results from the FinHER trial. Ann Oncol. 2014;25(8):1544–1550. doi: 10.1093/annonc/mdu112. [DOI] [PubMed] [Google Scholar]
- 10.Dieci M.V., Conte P., Bisagni G., et al. Association of tumor-infiltrating lymphocytes with distant disease-free survival in the ShortHER randomized adjuvant trial for patients with early HER2+ breast cancer. Ann Oncol. 2019;30(3):418–423. doi: 10.1093/annonc/mdz007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Conte P., Frassoldati A., Bisagni G., et al. Nine weeks versus 1 year adjuvant trastuzumab in combination with chemotherapy: final results of the phase III randomized Short-HER study. Ann Oncol. 2018;29(12):2328–2333. doi: 10.1093/annonc/mdy414. [DOI] [PubMed] [Google Scholar]
- 12.Dieci M.V., Miglietta F., Guarneri V. Immune infiltrates in breast cancer: recent updates and clinical implications. Cells. 2021;10(2):223. doi: 10.3390/cells10020223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Valenza C., Taurelli Salimbeni B., Santoro C., Trapani D., Antonarelli G., Curigliano G. Tumor infiltrating lymphocytes across breast cancer subtypes: current issues for biomarker assessment. Cancers (Basel) 2023;15(3):767. doi: 10.3390/cancers15030767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chesney J., Lewis K.D., Kluger H., et al. Efficacy and safety of lifileucel, a one-time autologous tumor-infiltrating lymphocyte (TIL) cell therapy, in patients with advanced melanoma after progression on immune checkpoint inhibitors and targeted therapies: pooled analysis of consecutive cohorts of the C-144-01 study. J Immunother Cancer. 2022;10(12) doi: 10.1136/jitc-2022-005755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yousif M., van Diest P.J., Laurinavicius A., et al. Artificial intelligence applied to breast pathology. Virchows Arch. 2022;480(1):191–209. doi: 10.1007/s00428-021-03213-3. [DOI] [PubMed] [Google Scholar]
- 16.Ugolini F., De Logu F., Iannone L.F., et al. Tumor-infiltrating lymphocyte recognition in primary melanoma by deep learning convolutional neural network. Am J Pathol. 2023;193(12):2099–2110. doi: 10.1016/j.ajpath.2023.08.013. [DOI] [PubMed] [Google Scholar]
- 17.Cai M., Zhao K., Wu L., et al. Artificial intelligence-based analysis of tumor-infiltrating lymphocyte spatial distribution for colorectal cancer prognosis. Chin Med J (Engl) 2024;137(4):421–430. doi: 10.1097/CM9.0000000000002964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Amgad M., Stovgaard E.S., Balslev E., et al. Report on computational assessment of tumor infiltrating lymphocytes from the International Immuno-Oncology Biomarker Working Group. NPJ Breast Cancer. 2020;6:16. doi: 10.1038/s41523-020-0154-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Salgado R., Denkert C., Demaria S., et al. The evaluation of tumor-infiltrating lymphocytes (TILs) in breast cancer: recommendations by an International TILs Working Group 2014. Ann Oncol. 2015;26(2):259–271. doi: 10.1093/annonc/mdu450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hendry S., Salgado R., Gevaert T., et al. Assessing tumor-infiltrating lymphocytes in solid tumors: a practical review for pathologists and proposal for a standardized method from the International Immunooncology Biomarkers Working Group: part 1: assessing the host immune response, TILs in invasive breast carcinoma and ductal carcinoma in situ, metastatic tumor deposits and areas for further research. Adv Anat Pathol. 2017;24(5):235–251. doi: 10.1097/PAP.0000000000000162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Dieci M.V., Radosevic-Robin N., Fineberg S., et al. Update on tumor-infiltrating lymphocytes (TILs) in breast cancer, including recommendations to assess TILs in residual disease after neoadjuvant therapy and in carcinoma in situ: a report of the International Immuno-Oncology Biomarker Working Group on Breast Cancer. Semin Cancer Biol. 2018;52(Pt 2):16–25. doi: 10.1016/j.semcancer.2017.10.003. [DOI] [PubMed] [Google Scholar]
- 22.Denkert C., Wienert S., Poterie A., et al. Standardized evaluation of tumor-infiltrating lymphocytes in breast cancer: results of the ring studies of the international immuno-oncology biomarker working group. Mod Pathol. 2016;29(10):1155–1164. doi: 10.1038/modpathol.2016.109. [DOI] [PubMed] [Google Scholar]
- 23.Amgad M., Elfandy H., Hussein H., et al. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics. 2019;35(18):3461–3467. doi: 10.1093/bioinformatics/btz083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Editorial The future of cancer genomics. Nat Med. 2015;21(2):99. doi: 10.1038/nm.3801. [DOI] [PubMed] [Google Scholar]
- 25.Tomczak K., Czerwińska P., Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol. 2015;2015(1):68–77. doi: 10.5114/wo.2014.47136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Koohbanani N.A., Jahanifar M., Tajadin N.Z., Rajpoot N. NuClick: a deep learning framework for interactive segmentation of microscopic images. Med Image Anal. 2020;65 doi: 10.1016/j.media.2020.101771. [DOI] [PubMed] [Google Scholar]
- 27.Ruifrok A.C., Johnston D.A. Quantification of histochemical staining by color deconvolution. Anal Quant Cytol Histol. 2001;23(4):291–299. [PubMed] [Google Scholar]
- 28.Chen H., Qi X., Yu L., Dou Q., Qin J., Heng P.-A. DCAN: deep contour-aware networks for object instance segmentation from histology images. Med Image Anal. 2017;36:135–146. doi: 10.1016/j.media.2016.11.004. [DOI] [PubMed] [Google Scholar]
- 29.Yang L., Zhang Y., Chen J., Zhang S., Chen D.Z. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) Springer; Quebec City, Quebec, Canada: 2017. Suggestive annotation: a deep active learning framework for biomedical image segmentation; pp. 399–407. [Google Scholar]
- 30.Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J, editors. Mitosis detection in breast cancer histology images with deep neural networks. In: Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013, Part II 16. 2013. [DOI] [PubMed]
- 31.Milletari F, Navab N, Ahmadi S-A, editors. V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV); 2016.
- 32.Koo T.K., Li M.Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–163. doi: 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sun P., He J., Chao X., et al. A computational tumor-infiltrating lymphocyte assessment method comparable with visual reporting guidelines for triple-negative breast cancer. EBioMedicine. 2021;70 doi: 10.1016/j.ebiom.2021.103492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kos Z., Roblin E., Kim R.S., et al. Pitfalls in assessing stromal tumor infiltrating lymphocytes (sTILs) in breast cancer. NPJ Breast Cancer. 2020;6:17. doi: 10.1038/s41523-020-0156-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Van Bockstal M.R., François A., Altinay S., et al. Interobserver variability in the assessment of stromal tumor-infiltrating lymphocytes (sTILs) in triple-negative invasive breast carcinoma influences the association with pathological complete response: the IVITA study. Mod Pathol. 2021;34(12):2130–2140. doi: 10.1038/s41379-021-00865-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.