Abstract
Purpose:
Although tumor infiltrating lymphocytes (TIL) assessment has been acknowledged to have both prognostic and predictive importance in triple negative breast cancer (TNBC), it is subject to inter and intra-observer variability that has prevented widespread adoption. Here we constructed a machine-learning based breast cancer TIL scoring approach and validated its prognostic potential in multiple TNBC cohorts.
Experimental Design:
Using the QuPath open source software, we built a neural-network classifier for tumor cells, lymphocytes, fibroblasts and “other” cells on hematoxylin-eosin (H&E) stained sections. We analyzed the classifier-derived TIL measurements with five unique constructed TIL variables. A retrospective collection of 171 TNBC cases was used as the discovery set to identify the optimal association of machine-read TIL variables with patient outcome. For validation we evaluated a retrospective collection of 749 TNBC patients comprised of four independent validation subsets.
Results:
We found that all five machine TIL variables had significant prognostic association with outcomes (p≤0.01 for all comparisons) but showed cell specific variation in validation sets. Cox regression analysis demonstrated that all five TIL variables were independently associated with improved overall survival after adjusting for clinicopathological factors including stage, age and histological grade (p≤0.003 for all analyses).
Conclusions:
Neural net driven cell classifier defined TIL variables were robust and independent prognostic factors in several independent validation cohorts of TNBC patients. These objective, open source TIL variables are freely available to download and can now be considered for testing in a prospective setting to assess clinical utility.
Keywords: Tumor-infiltrating lymphocytes (TILs), survival, triple negative breast cancer
Introduction
Recent clinical trials have demonstrated that host anti-tumor immunity, as measured by stromal tumor infiltrating lymphocytes (sTILs), has clinical importance in primary triple negative breast cancer (TNBC) (1–4). Data from numerous studies have showed that increased sTIL levels are associated with favorable recurrence free survival and better response to neoadjuvant treatment in early stage and metastatic TNBC (1,4–8). However, the clinical utility of sTILs is still limited in daily practice of breast cancer patient care due to subjectivity and lack of standardization. Therefore, the International Immuno-Oncology Biomarker Working Group on Breast Cancer has undertaken efforts to standardize TIL assessments. They have introduced a TIL assessment guideline that has reached good but not perfect reproducibility in a series of international ring studies (9–12).
However, inter and intra-observer variability could remain an issue in real world clinical practice due to the difficulty and subjectivity in quantitatively evaluating histological features (13–15). Automated digital analysis using machine learning derived algorithms could provide a solution to the problems of standardization and operator variance due to subjectivity (13). Recently developed machine-learning based TIL analysis algorithms have demonstrated prognostic potential. In these studies, TILs were analyzed within either the intra-tumoral stroma or whole tumor region (16–21) mimicking how pathologists assess TILs where each marker reflected distinct TIL spatial information in their given compartment. Moreover, only a few of the reported TIL scoring algorithms have been tested in TNBC cohorts (16,18,20) and essentially all are “black box”. That is, the algorithm cannot be translated back to numbers of specific cells or cell features and the algorithms are neither open source nor easily achieved by current real-world regional pathology departments.
Here, we have used an open source software platform (QuPath) to build a machine-learning based breast cancer TIL scoring algorithm based on recognition of specific cells types. Then math can be performed on the cell types to result in potentially meaningful TIL measurement variables not easily calculated by pathologist. We do this by first defining four cells types (tumor cells, TILs, fibroblast cells, and others and then mathematically combining them to create variables that assess the proportion of TILs within different cell populations or the density of TILs within variable tissue compartments. The aim of our study was to identify and validate a transparent, accessible method for collection and use of a cell type-based variable that is operator independent (reproducible) and prognostic in TNBC cohorts.
Materials and Methods
Patient cohorts
Our retrospective collection of 920 TNBC patients included five independent cohorts, three from the Pathology Department of Yale, School of Medicine, one from The Cancer Genome Atlas (TCGA) and one from the Skåne healthcare region in southern Sweden, based on data from the Swedish National Breast Cancer Quality Registry (NKBC) (Table 1). The WTS (whole tissue slides) Yale cohort consists 171 breast cancer patients diagnosed between 1985 and 2012 with 66.1 months median follow-up. TMA (Tissue Microarray) Yale1 cohort comprises 139 patients diagnosed between 1962 and 2006 with 63.8 months median follow-up. TMA Yale2 cohort consist 278 breast cancer patients diagnosed between 1981 and 2012 with 64.8 months median follow-up. The publicly available WTS-TCGA cohort comprises 116 patients operated between 1996 and 2013 with 13 months median follow-up (https://portal.gdc.cancer.gov/repository, last accessioned June, 2020). The WTS-Sweden cohort consists 216 patients enrolled in the prospective, observational, population-based SCAN-B study (ClinicalTrials.gov ID NCT02306096, PMID:29341157) between 2010 and 2015 with 49.7 months median follow-up and has been described elsewhere (22). In the Yale and TCGA cohorts, TNBC was defined as breast cancer with <1% of cells with IHC-staining for ER, PR and an IHC HER2-staining score < 2, or for patients with IHC 2+ a non-amplified ISH-status. In Sweden, the definition of TNBC is a tumor with <10% of cells with IHC-staining for ER and PR (thus including tumors with 1-9% stained cells) and an IHC HER2-staining score < 2, or for patients with IHC 2+ a non-amplified ISH-status. Representative tumor areas for TMA Yale1 and TMA Yale2 sets were selected by pathologists based on H&E-stained slides. Tumor cores were punched and arrayed (each 0.6mm in diameter) into a recipient block by Yale Pathology Tissue Service (YPTS) facility. In TMA Yale1 cohort, all cases have duplicate TMA cores, while in TMA Yale2 cohort, the number and the percentage of tissue cores having one-fold, two-fold and three-fold redundancy were 76 (27%), 124 (45%) and 78 (28%), respectively (Table. S1). The average of multiple cores per tumor (0.57 mm2 per tumor) were analyzed for TMA cohorts. For whole tissue section slide sets, one whole slide per patient selected by a pathologically trained research scientist was used for this study. Average areas of assessment for WTS Yale, WTS TCGA and WTS Sweden were 51.2 mm2, 59 mm2 and 96.1 mm2, respectively (Table. S2). Our classifier training set contained 97 TMA spots originating from 95 patients with breast cancer derived by random selection from multiple older Yale cohorts.
Table 1.
Clinicopathological information of discovery and validation sets
| WTS Yale | TMA Yale1 | TMA Yale2* | WTS TCGA | WTS Sweden | ||
|---|---|---|---|---|---|---|
| n (%) | n (%) | n (%) | n (%) | n (%) | ||
| Cases | 171 (100%) | 139 (100%) | 278 (100%) | 116 (100%) | 216 (100%) | |
|
| ||||||
| Age | < 50 | 49 (28,7%) | 54 (38.8%) | 109 (39.2%) | 40 (34.5%) | 49 (22.7%) |
| ≥ 50 | 92 (53.8%) | 85 (61.2%) | 151 (54.3%) | 76 (65.5%) | 167 (77.3%) | |
| NA | 30 (17.5%) | 0 (0.0%) | 18 (6.5%) | 0 (0.0%) | 0 (0.0%) | |
|
| ||||||
| Race | White | 122 (71.3%) | 124 (89.2%) | 148 (53.2%) | 75 (65.0%) | |
| African American | 38 (22.2%) | 9 (6.5%) | 60 (21.6%) | 33 (28.0%) | ||
| other | 9 (5.3%) | 1 (0.7%) | 13 (4.7%) | 3 (3.0%) | ||
| NA | 2 (1.2%) | 5 (3.6%) | 57 (20.5%) | 5 (4.0%) | 216 (100%) | |
|
| ||||||
| Tumor size (cm) | Median, range | 2, 0.45-6.5 | 2.5, 0.5-12 | 2, 0.45-10 | 2.1, 0.5-10 | |
| ≤ 2 | 74 (43.3%) | 53 (38.1%) | 130 (46.8%) | 85 (39.4%) | ||
| > 2 | 64 (37.4%) | 71 (51.1%) | 120 (43.2%) | 123 (56.9%) | ||
| NA | 33 (19.3%) | 23 (16.5%) | 28 (10.1%) | 116 (100%) | 8 (3.7%) | |
|
| ||||||
| Histological grade | Well-DI** | 2 (1.2%) | 2 (1.4%) | 3 (1.1%) | 0 (0.0%) | |
| Moderate-DI | 39 (22.8%) | 42 (30.2%) | 61 (21.9%) | 22 (10.2%) | ||
| Poor-DI | 122 (71.3%) | 49 (35.3%) | 190 (68.3%) | 191 (88.4%) | ||
| NA | 8 (4.7%) | 55 (39.6%) | 24 (8.6%) | 116 (100%) | 3 (1.4%) | |
|
| ||||||
| Stage | I | 60 (35.1%) | 19 (13.7%) | 102 (36.7%) | 21 (18.0%) | 60 (27.8%) |
| II | 78 (45.6%) | 51 (36.7%) | 128 (46.0%) | 72 (62.0%) | 49 (22.7%) | |
| III | 13 (7.6%) | 53 (38.1%) | 34 (12.2%) | 19 (16.0%) | 52 (24.1%) | |
| IV | 7 (4.1%) | 7 (5.0%) | 6 (2.2%) | 1 (1.0%) | 49 (22.7%) | |
| NA | 13 (7.6%) | 9 (6.5%) | 8 (2.9%) | 3 (3.0%) | 6 (2.8%) | |
|
| ||||||
| Chemotherapy | Chemo | 56 (32.7%) | 138 (49.6%) | 58 (50%) | 158 (73.1%) | |
| No chemo | 6 (3.5%) | 31 (11.2%) | 58 (50%) | 55 (25.5%) | ||
| NA | 109 (63.7%) | 139 (100%) | 109 (39.2%) | 3 (1.4%) | ||
|
| ||||||
| Follow up (months) | OS median, range | 66.1, 0.5-233.1 | 63.8, 2.4-455.6 | 64.8, 3.3-338 | 13, 0.07-109 | 49.7, 1.9-84.8 |
TMA Yale2 and WTS Yale sets have 65 overlapped cases which were collected from different tissue blocks and in either TMA or WTS format
DI: differentiation
This study has complied with all relevant ethical regulations (Declaration of Helsinki, CIOMS, Belmont Report, U.S. Common Rule), and it was approved by the Yale Human Investigation Committee under protocol #9505008219, #0304025173 and #0003011706. Patients in each cohort provided informed consent or (especially for older tissues) the tissue was obtained through Yale Human Investigation Committee protocol #9505008219, #0304025173 and #0003011706 which allows waiver of consent in some cases. The data were analyzed anonymously. The SCAN-B study was approved by the Regional Ethical Review Board in Lund, Sweden (applicable registration numbers 2009/658, 2015/277, 2016/742, 2018/267, and 2019/01252 for this study) as outlined in (22). All patients provided written informed consent prior to enrolment. The clinicopathological characteristics of all study cohorts are listed in Table 1.
Digital-image analysis
In the Yale cohorts, Aperio ScanScope CS2 platform (Leica Biosystems, Wetzlar, Germany) was used to scan H&E-stained slides at 20x with a pixel size of 0.4986 μm x 0.4986 μm. WTS TCGA images were downloaded from NIH CDC porta specimen repository (23) https://portal.gdc.cancer.gov/repository. In the WTS Sweden cohort, H&E-stained slides were digitized using the NanoZoomer 2.0-HT (Hamamatsu Photonics K.K., Hamamatsu, Japan) platform at 20x, with a pixel size of 0.4537 × 0.4537 μm. QuPath open-source software platform (version 0.1.2) was used to build an automated TIL scoring algorithm (24,25). As the date of H&E staining varied both between and within cohorts, we refined the H&E stain estimates for each digitized slide using the “estimate stain vectors” function in QuPath. Watershed cell detection was used (25) to segment the cells in the image with the following settings: Detection image: hematoxylin OD; requested pixel size: 0.5 μm; background radius: 8 μm; median filter radius: 0 μm; sigma: 1.5 μm; minimum cell area: 10 μm2; maximum cell area: 400 μm2; threshold: 0.1; maximum background intensity: 2. Cell expansion: 5μm. The quality control of the cell segmentation was performed by two pathologists (DR and BA). In order to classify detected cells into tumor cells, immune cells (TILs), fibroblast cells, and others (false detections, background) (Fig. S1), we used neural network as a machine-learning method with eight hidden layers (maximum iterations: 100). The features used in the classification was previously described (21). In order to help the algorithm perform an accurate classification, we also added smoothed object features at 25 μm and 50 μm radius to supplement the existing measurements of individual cells. Multiple rounds of cell classification review and correction were required to achieve the most accurate algorithm on the classifier training set, resulting in an algorithm named “CNN11”. Complete step by step instructions and the CNN11 TIL algorithm are available at (https://medicine.yale.edu/lab/rimm/).
Building breast cancer TIL quantification algorithm
A flowchart for the quantitative analysis of tissue images based on open-source software for the TIL assessment was established (Fig.1). Estimated stain vector was first defined after uploading H&E images. This step is required to normalize different staining properties and batches and is image specific. This is followed by cell segmentation using standardized watershed cell detection parameters. Next, a cell classifier was trained using neural network with tumor cell, TILs, fibroblast, and other cell types (false detection, background) that are color coded for each type. A temporary classifier was built and applied to the rest of the images in a classifier training set. After several rounds of cell classification review and correction, a trained classifier (CNN11) was locked down once it was considered to have satisfactory performance, as defined by pathologist assessment where most cells are correctly classified (Fig. 1A) (21). Note that no cell classifying algorithm achieves 100% success, but rather the correct classification is made for >95% of cells. The result is that after application of CNN11, the program delivers the number of cells in the image in one of four categories, including: 1) tumor cells, 2) TILs, 3) fibroblasts and 4) others.
Figure 1.

Flowchart of algorithm training, developing to algorithm validation. A. Estimate stain vector was first defined after uploading H&E images. This is followed by cell segmentation using standardized watershed cell detection parameters. Next, a CNN was trained using neural network with tumor cell, TILs, fibroblast and other type or background cells with color coding of each type. A temporary classifier was built and applied to rest of images in classifier training set. After many rounds of cell classification review and correction, a trained classifier (CNN11) was locked once it was morphologically judged to be at least 95% accurate on most images. B: Application of trained classifier resulted in TIL measurements which were calculated as following TIL variables: eTILs%, etTILs%, esTILs%, eaTILs (mm2) and easTILs (see methods section for definition of variables). Associations between TIL variables and patient outcome were identified in WTS Yale (discovery set) using the optimal cut-points determined by X-tile software. All TIL variables were subsequently tested in validation sets including TMA Yale1, TMA Yale2, WTS TCGA and WTS Sweden. C: Workflow explaining how TIL quantification is performed in H&E image-based whole tissue image. Step of tumor region definition is followed by estimate stain vector to normalize hematoxylin and eosin colors. Then, cell segmentation is performed using standardized watershed cell detection parameters and cell classification using the trained classifier. At last, TIL measurements were analyzed into constructed variables. * Pathologist’s supervision is required.
For WTS sets, the tumor region annotation was defined based on the guideline introduced by the International Immuno-Oncology Biomarker Working Group (10) as follows: i) Include TILs within the borders of the invasive tumor, including both “central tumor” and “invasive margin. ii) All mononuclear cells (including lymphocytes and plasma cells) should be scored, but polymorphonuclear leukocytes are excluded. iii) Exclude TILs at a distance outside of the tumor borders. iv) Exclude TILS around DCIS and normal lobules. v) Exclude areas with crush artifacts, necrosis, regressive hyalinization as well as in the previous core biopsy site. Quality control of the algorithm to classify detected cells was performed by two pathologists (DR and BA). Example workflow and images using the cell classification mask are shown in Fig. 1C and Fig. 2. We have excluded areas with tissue artifacts, e.g., necrosis, before running the algorithm. The algorithm does not replace the pathologist since the algorithm cannot select the correct area for analysis, nor can it eliminate common artifacts. Finally, the algorithm output is quality controlled after the cell assignment. In the validation sets, the tumor region annotations were defined by different pathologists (DR, BA, JH). Furthermore, the algorithm was run by different observers (YB and BA) in order to demonstrate user-independency. We observed catastrophic segmentation failure in 1-2% of cases that must be flagged for repeated analysis or eliminated if some unusual artefact is present that triggers failure. This is most common in invasive lobular carcinoma and in a rare minority, cases with high proportion of intra-tumoral TILs (iTILs). Stromal TILs (sTILs) of both WTS Yale and WTS Sweden cohorts were also traditionally visually assessed by expert breast pathologists in the US (KC) and in Sweden (JH) according to the guidelines published by the International Immuno-Oncology Biomarker Working Group.
Figure. 2.

Representative images of four sample cases showing the H&E images (A, C, E and G) and the cell classification masks (B, D, F and H). E and F: representative image of sample cases with inaccurate cell classification, these rare fields are ultimately censored. G and H: only invasive breast cancer regions were selected and analyzed. Color code of cell classification mask: tumor cells (red), TILs (purple), fibroblasts (green) and others (yellow). Scale bar from A to F: 20um; scale bar from G to H: 200um.
Construction of TIL variables
Breast cancer classification algorithm CNN11-derived tissue annotation measurements consist of: (1) assignment to one of the four cell types defined above, (2) annotation area (mm2) (whole tumor region), and (3) accumulative area of each cell type (mm2).
TIL measurements were analyzed in the following constructed variables (Fig. S1):
eTILs%= 100 * [# of TILs / (# of Tumor Cells + # of TILs)]; representing proportion of TILs over tumor cells.
etTILs%= 100 * (# of TILs / # of Total cells); representing proportion of TILs over all detected cells.
esTILs= 100 * [# of TILs / (# Total cells - # Tumor cells)]; representing proportion of TILs over stromal cells.
eaTILs (mm2) = # of TILs / Sum of tumor region areas analyzed (mm2); representing density of TILs over tumor region.
easTILs= 100 * [sum of TIL Area (mm2)/Stroma Area (mm2)]-mimics the international TIL working group variable as read by pathologists.
Note that Stroma Area (mm2) = Sum of tumor region areas analyzed (mm2) – Sum of tumor cell area (mm2) in variable #5, easTILs represents the density of TILs over stroma area which mimics the pathologist scoring of sTIL per instructions from the International Immuno-Oncology Biomarker Working Group on Breast Cancer (10). However, all the variables include iTILs in the measurements. Cases with high iTILs proportions were excluded from the analysis.
Statistical analysis
Overall survival (OS) was defined as the elapsed time from the date of primary diagnosis of the tumor to the date of death caused any events, or when patients were last censored if still alive. We visualized continuous data and their association with patient outcome using X-tile software (26). The statistically significant threshold (cut-point) of each TIL variable determined by X-tile software was then tested in validation sets (Fig. 1B) (26). The Kaplan–Meier analysis supported with Log-rank test was executed with GraphPad Prism (GraphPad software Inc., San Diego, CA) to assess prognostic potential. Mann–Whitney test was used to investigate the association between automated easTILs score and visual sTILs by pathologists. Spearman’s rho coefficient (r) between pathologist-read sTIL and automated easTIL scores was assessed. To test independent prognostic potential, multivariate Cox-regression analysis was applied using JMP Pro 15 software (SAS institute, Inc., Cary, NC). In all statistical analysis, the level of significance was set at p < 0.05.
Results
Development of the CNN11 TIL algorithm in the WTS Yale discovery cohort
Machine-learning is often used to define a black-box algorithm associated with human TILs scores or outcome, without information related to the parameters of traditional pathology. Here we take a different approach training the algorithm to define familiar cells types, then performing mathematical operations on the cell type counts, as per description in the methods Figure 1 and Fig. S1. Thus, from a single cell finding (segmentation) algorithm, called CNN11, we generate five candidate variables that we can test and compare, to select the optimal variable for future use. Note that this approach, even though automated, results in variables familiar to traditional pathology. The five variables were chosen as described above to test for prognostic value and potential future use as objective biomarkers.
Next, the numerical values for each variable were then tested for association with patient outcome. This is no longer a training process, but a rather optimal cut-point discovery and thus we obtained outcome information on a whole tissue section set of TNBC cases (WTS Yale set) which is independent from the CNN11 cell segmentation algorithm training set. This discovery set was used to find the optimal cut-point for each of the five TIL variables.
To test the variables for prognostic value in this discovery set we used the X-tile software to visualize the association at every possible cut-point and each TIL variable with patient outcome. Figure 3 shows the optimal cut-point of each variable and the unadjusted p-value derived for assessment of the cohort for optimal cut-point. The cut-point, expressed as a percentage or cells/mm2, is inset in each plot. Note that we chose to try five variables from start, but the number of variables that could be generated is much larger. For example; high eTILs% (≥ 18.2%) had statistically significant better overall survival (OS) rate compared to patients with low eTILs% (Hazard ratio (HR): 0.35, Confidence Interval (CI)=0.20-0.61, p=0.0002). Similar clinical associations were observed for the other four TIL variables in the WTS Yale cohort, each with their own optimal cut-point: etTILs% (threshold: ≥16.9%. HR: 0.35, 95% CI=0.19-0.63, p=0.0005); esTILs (threshold: ≥57.4%, HR=0.35, 95% CI=0.18-0.65, p=0.001); eaTILs (mm2) (threshold: ≥ #1195.6/mm2, HR=0.35, 95% CI=0.20-0.63, p=0.0005) and easTILs (threshold: ≥19.9%. HR=0.30, 95% CI=0.16-0.54, p<0.0001). Pathologist-read sTIL scores were also significantly linked to OS (threshold (≥19.9%, HR=0.44, 95% CI=0.23-0.83, p=0.01) (Fig. 3).
Figure 3.

Identification of QuPath TIL prognostic role in discovery set (WTS Yale). Kaplan-Meier curves of overall survival (OS) in WTS Yale Discovery set by eTILs% dichotomized at the value of 18.2% (A), etTILs% dichotomized at the value at 16.9%. (B) esTILs% dichotomized at the value 57.4%. (C) eaTILs (mm2) dichotomized at the value #1195.6/mm2. (D) easTILs dichotomized at the value 19.9% (E) and pathologist sTILs% at 19.9% (F). Corresponding Hazard Ratio with 95% Cl and P values are illustrated. Note, P values in this figure are not corrected for multiple testing as occurs in optimal cut-point discovery
Performance of the CNN11 TIL algorithm constructed variables in validation sets
Next, to test each of the five variables for prognostic value we validated them on four completely independent cohorts; TMA Yale 1, TMA Yale 2, WTS TCGA and WTS Sweden. Using the TIL variable-specific cut-points defined above, patients of the TMA Yale1 set all five variables were significantly separated into favorable and unfavorable prognostic subsets [eTILs%: HR=0.64, 95% CI=0.43-0.94, p=0.025; etTILS%: HR=0.51, 95% CI=0.32-0.81, p=0.004; esTILs: HR=0.48, 95% CI=0.25-0.89, p=0.02; eaTILs (mm2): HR=0.48, 95% CI=0.31-0.74, p=0.0009 and easTILs: HR=0.65, 95% CI=0.43-0.98, p=0.04] (Fig. S2). In the TMA Yale2 cohort, eTILs%, etTILs% and esTILs scores were significantly associated with OS, while eaTILs and easTILs’s were not [eTILs%: HR=0.43, 95% CI=0.26-0.69, p=0.0005; etTILS%: HR=0.47, 95% CI=0.28-0.77, p=0.003; esTILs: HR=0.42, 95% CI=0.24-0.76, p=0.004; eaTILs (mm2): HR=0.62, 95% CI=0.37-1.01, p=0.06; easTILs: HR=0.78, 95% CI=0.48-1.26, p=0.31] (Fig. S3), showing the inherently different properties of the variables.
In the clinical setting, TIL assessment is performed exclusively on WTS slides, thus we applied the CNN11 algorithm variables on two external and independent WTS cohorts. Using the discovery set’s derived optimal cut-point on the WTS TCGA cohort, patients with high eTILs% had a significantly favorable OS (eTILs%: HR=0.09, 95% CI=0.01-0.70, p=0.02) when compared to the low eTILs% group. Similarly, patients with either high etTILs% or high eaTILs (mm2) had significant favorable outcomes [etTILs%: HR=0.10, 95% CI=0.01-0.80, p=0.03; eaTILs (mm2): HR=0.10, 95% CI=0.01-0.76, p=0.03] (Fig. 4). In the WTS Sweden cohort, only easTILs was significantly linked to OS (easTILs: HR=0.54, 95% CI=0.31-0.92, p=0.02) (Fig. S4). The Swedish cohort had more high stage patients than the TCGA or Yale cohorts and this may explain the differential performance of the five TIL variable algorithms. Future studies will assess the interaction of these variables with tumor stage.
Figure 4.

Validation of QuPath TIL algorithms in WTS TCGA. Kaplan-Meier curves of overall survival (OS) in WTS TCGA set by eTILs% dichotomized at the value of 18.2% (A), etTILs% dichotomized at the value at 16.9% (B), esTILs% dichotomized at the value 57.4% (C), eaTILs (mm2) dichotomized at the value #1195.6/mm2(D) and easTILs dichotomized at the value 19.9% (E). Corresponding Hazard Ratio with 95% Cl and P values are illustrated.
The eTILs% and etTILs% variables presented significant associations with clinical outcomes, validated in three out of four validation sets. Meanwhile, the esTILs, eaTILs (mm2) and easTILs variables were validated in two out of four validation sets (Table. S3). When combining all the validation sets into a single cohort, all five TIL variables had significant associations with OS with or without the adjustment of the staging status, age and histological grade based on multivariate Cox regression analysis (Table 2).
Table 2.
Cox regression analysis of TIL variables for OS in combined validation sets
| Univariate (N=749) | Multivariate* (N=529) | |||||
|---|---|---|---|---|---|---|
| Parameter | HR | 95% CI | p | HR | 95% CI | p |
| All validation cases | ||||||
|
| ||||||
| eTILs% low (<18.2%) | 1 | 1 | ||||
| eTILs% high (≥18.2%) | 0.53 | 0.41-0.68 | <0.0001 | 0.53 | 0.39-0.70 | <0.0001 |
|
| ||||||
| etTILs% low (<16.9%) | 1 | 1 | ||||
| etTILs% high (≥16.9%) | 0.51 | 0.39-0.67 | <0.0001 | 0.56 | 0.41-0.76 | 0.0002 |
|
| ||||||
| esTILs% low (<57.4%) | 1 | 1 | ||||
| esTILs% high (≥57.4%) | 0.51 | 0.38-0.68 | <0.0001 | 0.49 | 0.35-0.70 | <0.0001 |
|
| ||||||
| eaTILs (mm[ISP]^2) Low (<1195.6) | 1 | 1 | ||||
| eaTILs (mm^2) high (≥1195.6) | 0.55 | 0.42-0.72 | <0.0001 | 0.56 | 0.42-0.77 | 0.0003 |
|
| ||||||
| easTILs low (<19.9) | 1 | 1 | ||||
| easTILs high (≥19.9) | 0.61 | 0.47-0.79 | 0.0002 | 0.57 | 0.42-0.78 | 0.0003 |
multivariate analysis was adjusted by staging status (stage I&II: low; III&IV: high), age (<50 or ≥50) and histological grade
Multivariate Cox regression model analyses were run in order to test independent prognostic potential of machine-read algorithm variable adjusted for stage, age and histological grade. All of the CNN11-derived TIL algorithm variables remained significant with similar HR and overlapping CI values (HR<0.57, 95% CI=0.35-0.78, p<=0.0003, for all comparisons (Table 2). Of note, both eTILs% and esTILs% appeared two more robust markers (eTILs%: HR=0.53, 95% CI=0.39-0.70, p<0.0001; esTILs%: HR=0.49, 95% CI=0.35-0.70, p<0.0001) (Table 2). While it is statistically unsound to compare p-values, hazard ratios may be compared. As shown in Table 2, the hazard ratios are similar between all five TIL variables, but two, eTIL% and esTIL% are consistently better performing algorithms.
Further analysis of the pathologist-read sTIL assessment in WTS Sweden cohorts revealed that patients with high sTILs had better outcomes than the low sTILs patient group (HR=0.52, 95% CI=0.30-0.90, p=0.02). This observation raises a general question about comparison of TIL algorithm variables compared to pathologist reads. When we compared the CNN11-derived easTLs variable score with the pathologist-read sTILs assessment, a good correlation was observed in both WTS Yale (Spearman’s r coefficient =0.61, p<0.0001) and WTS Sweden cohorts (Spearman’s r coefficient=0.63, p<0.0001). Further analysis showed that cases with high sTILs had significantly higher easTILs in both cohorts (Fig. S5). Finally, we also compared CNN11-derived variables performance on TMAs vs. WTS. We found a moderate correlation between TMA and WTS specimens (Table. S4 and Fig. S6).
Discussion
In recent years, TILs have been acknowledged to have both prognostic and predictive importance in patients with early or metastatic TNBC (1,4–8). However, pathologist-read TIL assessments can be a significant source of variability (14,15). Furthermore, adjacent microenvironmental cellular populations are not accounted for due to the difficulty in making such assessments. In an attempt to quantify TILs, computational studies have either focused on mimicking the guidelines introduced by the International Immuno-Oncology Biomarker Working Group(16–18,20) or generated black box algorithms (19,27). Our algorithm to define cell types (CNN11) is similarly black box. However, the output variables used for prognostic biomarkers are transparent cell type computations and thus more familiar to pathologists.
Over the last couple years, different machine-learning approaches have been proposed to score anti-tumor immunity, resulting in a variety of TIL biomarkers with potential clinical applicability. Some of these machine-learning tools have been based on patch classification, while others mainly relied on object (cell) detection/ and classification. Another widely-adopted approach is the implementation of tissue pattern recognition distinguishing tissue regions and evaluation of TILs in different tissue compartments (e.g.: in intra-tumoral stroma) (16–20). However, many of these machine-learning derived TIL biomarkers lack broad validation, which is essential for clinical adoption. Furthermore, there is a need for studies comparing various TIL variables focusing on different spatial aspects. Here we showed that machine-learning based TIL scoring is able to provide comprehensive information of TILs in TNBC tumor microenvironment not easily determined by pathologist assessment.
In our study, using the QuPath platform, we have developed a cell classifier to score TIL measurements deriving five TIL variables (eTILs%, etTILs%, esTILs, eaTILs (mm2) and easTILs) representing proportion of TILs in relation to cell counts in the whole tumor region, and to the area of different tumor regions (e.g: intra-tumoral stroma) (21,24). The digital analysis approach we used is an unsupervised nuclei segmentation followed by a neural network-based machine-learning cell classification. The advantage of this concept is that it requires relatively smaller sized training sets. However, it has the limitation that segmentation sensitivity and classification performance is dependent on biological and technical image variation which may lead to overfitting of the classifier in the training set. To address this limitation, we have validated the classifier in four independent TNBC cohorts that were retrieved from different institutions, varied in time of diagnosis and format, using both TMA and WTS format. Furthermore, in our assessment protocol, we keep pathologist review in the loop, both to select the tumor and related stroma and to exclude catastrophic algorithm failures. Our results demonstrated the prognostic role of all five TIL variables based on successful validation in independent TMA and WTS sets. We have seen similar prognostic results with eTIL% in melanoma (21).
Although not proven in this pilot study, the potential clinical utility of objective TILs assessment is to enable clinicians to identify a group of patients who might benefit from immunotherapy (28) or de-escalation of therapy where chemotherapy could be omitted in populations that are extremely unlikely to develop recurrent disease (29). Although our machine-read TILs scores with pre-defined cut-points were validated in independent cohorts, these cut-points are most likely not applicable generally as they were not adjusted for stage, tumor types, type of therapy. For this reason, we propose this machine read scoring as a continuous variable and clinical utility studies adjusted for the aforementioned factors are needed to develop the specific cut-point for the specific indication. Also, the International Immuno-Oncology Biomarker Working has not recommended a generally applicable TIL threshold for clinical practice, however in recently published studies, the thresholds where sTILs had an impact on prognosis were between 10% and 30% (5,12). In our validation sets, after application of cut-points derived in the discovery set (WTS Yale set), we found optimal prognostic cut-points in the same general range, but with tighter variance, and with observer-independent reproducibility.
This pilot effort has a number of potential limitations. One limitation of our work is algorithm assignment error as a potential pitfall in machine-read scoring. For example, false-positive TILs have been detected in areas containing apoptotic figures, neutrophils, and low-grade tumors with monotonously uniformed nuclei. Furthermore, our model was not fine-tuned to distinguish iTILs from sTILs because iTILs accounts for only 1-3% minority of TILs in the vast majority of cases. These errors tend to be catastrophic in rare cases, requiring a pathologist to review the cases for exclusion or re-analysis. However, pathologist cell assignment review validation is a reasonable step in the process, since we believe computational algorithms, in the near term, will assist, not replace pathologists. Our goal was not to build a fully automated TIL scoring application, but a computer-assisted, open source tool that might help pathologist to improve reproducibility. Therefore, a pathologist is still essential for quality control and systematic performance evaluation.
Perhaps the most significant limitation of this work is the fact that all of the cohorts are retrospective collections. As such, they have heterogeneous treatment and variable inclusion criteria. Furthermore, differences in tissue handling and HE staining might further contribute to the variability between the cohorts. However, machine read TILs scores showed prognostic performance in validation sets despite the variability among the cohorts which further supports its robustness. In this study, prognostic association was assessed without special consideration to administered therapy, as this varied between cohorts. To truly evaluate prognostic value, the rigor of a clinical trial is best. However, tissue and data from a clinical trial are a very precious and often limited resource. Thus pilot, discovery studies such as this need to use retrospective cohort for proof of concept.
Finally, a variable to mention, although not necessarily a limitation, is that the WTS Sweden cohort was scanned using a different brand of slide scanner (a NanoZoomer, rather than an Aperio). It was not possible to rescan all of the Swedish Cohort with the Aperio Scanner and it is beyond the scope of this work to compare the effect of different slide scanners on the final results. However, this variable may generate variation that should be acknowledged as a potential limitation of this work and should be considered in the results seen in the Swedish Cohort. Future efforts will compare slide scanning hardware.
In conclusion, we demonstrated that machine-learning derived TIL algorithm variables were significantly associated with outcomes in TNBC patients. They were also shown to be objective and independent prognostic factors in several validation cohorts. With further investigation in a clinical trial, we believe that this objective tool could be useful in the clinical setting for objective quantification of TILs.
Supplementary Material
Translational Relevance:
The presence of high TILs (tumor infiltrating lymphocytes) have been shown to be predictive of the response to chemotherapy and is also a prognostic factor associated with a better outcome in breast cancer, especially in early stage triple-negative (TNBC) and HER2-positive breast cancers. Despite the standardization efforts of the TIL assessment, the subjective nature and degree of variability in evaluation has prevented its broad adoption. Using QuPath open source software, we built an algorithm for H&E image-based automated assessment of TILs and invented a method for TIL assessment that is beyond human capability. Using one discovery set and four validation sets from three institutions, we have found that machine-read measured TIL variables have stratified patients with TNBC into favorable and poor prognosis cohorts, where higher TILs scores were significantly associated with better overall survival. This open source method of assessment is broadly accessible and machine-read TILs scoring is now ready for consideration for prospective testing to prove clinical utility.
Acknowledgements:
This automated TIL algorithm research was supported by the Breast Cancer Research Foundation (David L. Rimm)
David L. Rimm was supported by the National Institute of Health (NIH) Yale SPORE in Lung Cancer Career Development Program (NIH P50 CA196530)
We would like to thank Yale Pathology Tissue Service TMA facility (YPTS) for the valuable contributions.
Johan Staaf was supported by The Governmental Funding of Clinical Research within the National Health Service (ALF) 2018/40612, The Swedish Cancer Society (CAN 2018/685) and a 2018 Senior Investigator Award (SIA190013), as well as Mrs. Berta Kamprad Foundation FBKS-2020-5-282
Disclosure of Potential Conflict of Interest
David L. Rimm has served as an advisor for Astra Zeneca, Agendia, Amgen, BMS, Cell Signaling Technology, Cepheid, Daiichi Sankyo, Genoptix/Novartis, GSK, Konica Minolta, Merck, NanoString, PAIGE.AI, Perkin Elmer, Roche, Sanofi, Ventana and Ultivue. Astra Zeneca, Cepheid, NavigateBP, NextCure, Nanostring, Lilly, and Ultivue fund research in David L. Rimm’s lab. Ana Bosch has participated in Advisory Board meetings for Pfizer and Novartis and received a travel grant from Roche. Other authors have no potential conflicts of interest.
References
- 1.Adams S, Gray RJ, Demaria S, Goldstein L, Perez EA, Shulman LN, et al. Prognostic value of tumor-infiltrating lymphocytes in triple-negative breast cancers from two phase III randomized adjuvant breast cancer trials: ECOG 2197 and ECOG 1199. J Clin Oncol 2014;32(27):2959–66 doi 10.1200/JCO.2013.55.0491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Dieci MV, Mathieu MC, Guarneri V, Conte P, Delaloge S, Andre F, et al. Prognostic and predictive value of tumor-infiltrating lymphocytes in two phase III randomized adjuvant breast cancer trials. Annals of Oncology 2015;26(8):1698–704 doi 10.1093/annonc/mdv239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Loi S, Michiels S, Salgado R, Sirtaine N, Jose V, Fumagalli D, et al. Tumor infiltrating lymphocytes are prognostic in triple negative breast cancer and predictive for trastuzumab benefit in early breast cancer: results from the FinHER trial. Annals of Oncology 2014;25(8):1544–50 doi 10.1093/annonc/mdu112. [DOI] [PubMed] [Google Scholar]
- 4.Loi S, Sirtaine N, Piette F, Salgado R, Viale G, Van Eenoo F, et al. Prognostic and predictive value of tumor-infiltrating lymphocytes in a phase III randomized adjuvant breast cancer trial in node-positive breast cancer comparing the addition of docetaxel to doxorubicin with doxorubicin-based chemotherapy: BIG 02-98. J Clin Oncol 2013;31(7):860–7 doi 10.1200/JCO.2011.41.0902. [DOI] [PubMed] [Google Scholar]
- 5.Denkert C, von Minckwitz G, Darb-Esfahani S, Lederer B, Heppner BI, Weber KE, et al. Tumour-infiltrating lymphocytes and prognosis in different subtypes of breast cancer: a pooled analysis of 3771 patients treated with neoadjuvant therapy. Lancet Oncol 2018;19(1):40–50 doi 10.1016/S1470-2045(17)30904-X. [DOI] [PubMed] [Google Scholar]
- 6.Savas P, Salgado R, Denkert C, Sotiriou C, Darcy PK, Smyth MJ, et al. Clinical relevance of host immunity in breast cancer: from TILs to the clinic. Nat Rev Clin Oncol 2016;13(4):228–41 doi 10.1038/nrclinonc.2015.215. [DOI] [PubMed] [Google Scholar]
- 7.Pruneri G, Gray KP, Vingiani A, Viale G, Curigliano G, Criscitiello C, et al. Tumor-infiltrating lymphocytes (TILs) are a powerful prognostic marker in patients with triple-negative breast cancer enrolled in the IBCSG phase III randomized clinical trial 22-00. Breast Cancer Res Treat 2016;158(2):323–31 doi 10.1007/s10549-016-3863-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Adams S, Schmid P, Rugo HS, Winer EP, Loirat D, Awada A, et al. Pembrolizumab monotherapy for previously treated metastatic triple-negative breast cancer: cohort A of the phase II KEYNOTE-086 study. Ann Oncol 2019;30(3):397–404 doi 10.1093/annonc/mdy517. [DOI] [PubMed] [Google Scholar]
- 9.Denkert C, Wienert S, Poterie A, Loibl S, Budczies J, Badve S, et al. Standardized evaluation of tumor-infiltrating lymphocytes in breast cancer: results of the ring studies of the international immuno-oncology biomarker working group. Mod Pathol 2016;29(10):1155–64 doi 10.1038/modpathol.2016.109. [DOI] [PubMed] [Google Scholar]
- 10.Salgado R, Denkert C, Demaria S, Sirtaine N, Klauschen F, Pruneri G, et al. The evaluation of tumor-infiltrating lymphocytes (TILs) in breast cancer: recommendations by an International TILs Working Group 2014. Ann Oncol 2015;26(2):259–71 doi 10.1093/annonc/mdu450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hendry S, Salgado R, Gevaert T, Russell PA, John T, Thapa B, et al. Assessing Tumor-infiltrating Lymphocytes in Solid Tumors: A Practical Review for Pathologists and Proposal for a Standardized Method From the International Immunooncology Biomarkers Working Group: Part 1: Assessing the Host Immune Response, TILs in Invasive Breast Carcinoma and Ductal Carcinoma In Situ, Metastatic Tumor Deposits and Areas for Further Research. Adv Anat Pathol 2017;24(5):235–51 doi 10.1097/PAP.0000000000000162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Loi S, Drubay D, Adams S, Pruneri G, Francis PA, Lacroix-Triki M, et al. Tumor-Infiltrating Lymphocytes and Prognosis: A Pooled Individual Patient Analysis of Early-Stage Triple-Negative Breast Cancers. J Clin Oncol 2019;37(7):559–69 doi 10.1200/JCO.18.01010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Klauschen F, Muller KR, Binder A, Bockmayr M, Hagele M, Seegerer P, et al. Scoring of tumor-infiltrating lymphocytes: From visual estimation to machine learning. Semin Cancer Biol 2018;52(Pt 2):151–7 doi 10.1016/j.semcancer.2018.07.001. [DOI] [PubMed] [Google Scholar]
- 14.Kos Z, Roblin E, Kim RS, Michiels S, Gallas BD, Chen W, et al. Pitfalls in assessing stromal tumor infiltrating lymphocytes (sTILs) in breast cancer. NPJ Breast Cancer 2020;6:17 doi 10.1038/s41523-020-0156-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wein L, Savas P, Luen SJ, Virassamy B, Salgado R, Loi S. Clinical Validity and Utility of Tumor-Infiltrating Lymphocytes in Routine Clinical Practice for Breast Cancer Patients: Current and Future Directions. Front Oncol 2017;7:156 doi 10.3389/fonc.2017.00156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Amgad M, Sarkar A, Srinivas C, Redman R, Ratra S, Bechert CJ, et al. Joint Region and Nucleus Segmentation for Characterization of Tumor Infiltrating Lymphocytes in Breast Cancer. Proc SPIE Int Soc Opt Eng 2019;10956 doi 10.1117/12.2512892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Heindl A, Sestak I, Naidoo K, Cuzick J, Dowsett M, Yuan Y. Relevance of Spatial Heterogeneity of Immune Infiltration for Predicting Risk of Recurrence After Endocrine Therapy of ER+ Breast Cancer. J Natl Cancer Inst 2018;110(2) doi 10.1093/jnci/djx137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Le H, Gupta R, Hou L, Abousamra S, Fassler D, Torre-Healy L, et al. Utilizing Automated Breast Cancer Detection to Identify Spatial Distributions of Tumor-Infiltrating Lymphocytes in Invasive Breast Cancer. Am J Pathol 2020;190(7):1491–504 doi 10.1016/j.ajpath.2020.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, et al. Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. Cell Rep 2018;23(1):181–93 e7 doi 10.1016/j.celrep.2018.03.086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yuan Y, Failmezger H, Rueda OM, Ali HR, Graf S, Chin SF, et al. Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci Transl Med 2012;4(157):157ra43 doi 10.1126/scitranslmed.3004330. [DOI] [PubMed] [Google Scholar]
- 21.Acs B, Ahmed FS, Gupta S, Wong PF, Gartrell RD, Sarin Pradhan J, et al. An open source automated tumor infiltrating lymphocyte algorithm for prognosis in melanoma. Nat Commun 2019;10(1):5440 doi 10.1038/s41467-019-13043-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Staaf J, Glodzik D, Bosch A, Vallon-Christersson J, Reutersward C, Hakkinen J, et al. Whole-genome sequencing of triple-negative breast cancers in a population-based clinical study. Nat Med 2019;25(10):1526–33 doi 10.1038/s41591-019-0582-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013;45(10):1113–20 doi 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bankhead P, Loughrey MB, Fernandez JA, Dombrowski Y, McArt DG, Dunne PD, et al. QuPath: Open source software for digital pathology image analysis. Sci Rep 2017;7(1):16878 doi 10.1038/s41598-017-17204-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Malpica N, de Solorzano CO, Vaquero JJ, Santos A, Vallcorba I, Garcia-Sagredo JM, et al. Applying watershed algorithms to the segmentation of clustered nuclei. Cytometry 1997;28(4):289–97 doi . [DOI] [PubMed] [Google Scholar]
- 26.Camp RL, Dolled-Filhart M, Rimm DL. X-tile: a new bio-informatics tool for biomarker assessment and outcome-based cut-point optimization. Clin Cancer Res 2004;10(21):7252–9 doi 10.1158/1078-0432.CCR-04-0713. [DOI] [PubMed] [Google Scholar]
- 27.Corredor G, Wang X, Zhou Y, Lu C, Fu P, Syrigos K, et al. Spatial Architecture and Arrangement of Tumor-Infiltrating Lymphocytes for Predicting Likelihood of Recurrence in Early-Stage Non-Small Cell Lung Cancer. Clin Cancer Res 2019;25(5):1526–34 doi 10.1158/1078-0432.CCR-18-2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Luen SJ, Salgado R, Dieci MV, Vingiani A, Curigliano G, Gould RE, et al. Prognostic implications of residual disease tumor-infiltrating lymphocytes and residual cancer burden in triple-negative breast cancer patients after neoadjuvant chemotherapy. Ann Oncol 2019;30(2):236–42 doi 10.1093/annonc/mdy547. [DOI] [PubMed] [Google Scholar]
- 29.Park JH, Jonas SF, Bataillon G, Criscitiello C, Salgado R, Loi S, et al. Prognostic value of tumor-infiltrating lymphocytes in patients with early-stage triple-negative breast cancers (TNBC) who did not receive adjuvant chemotherapy. Ann Oncol 2019;30(12):1941–9 doi 10.1093/annonc/mdz395. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
