Abstract
Loss of organized structure is a hallmark of malignant transformation in breast cancer. Traditionally, such morphological features are captured by descriptive histological assessments, such as grade, that represent reliable diagnostic and prognostic determinants. Nonetheless, the predictive value of these semiquantitative approaches is limited by their subjective nature and the computational restrictions inherent to discrete integer-based scoring systems. Here, we described an application of topological measurements and statistical modeling to derive continuous mathematical scores that quantitatively reflect the level of organized structure within human breast cancer tissues. This approach generated quantifiable biomarkers, assessable on a continuous scale, that predicted breast cancer survival. Compared to traditional biomarkers, these topology-based measurements showed higher prognostic accuracy with less variation associated with race and ethnicity. Integration of these biomarkers with gene expression data produced topology-derived gene signatures that predicted therapeutic response and uncovered gene regulatory networks linking metabolism with the breast cancer tumor microenvironment in racially diverse breast cancer cohorts. Overall, this study demonstrates the potential of spatial and topological biomarkers in breast cancer treatment and diagnosis. Application and adaptation of methods that quantify tumor architectural features to develop prognostic and predictive algorithms exemplify the immense future promise of defining linkages between biology, medicine, and mathematics.
Introduction
This year, it is predicted that there will be over 2 million newly diagnosed cases of cancer in the United States, and nearly 600,000 men, women, and children living with a cancer diagnosis will die from their disease (1). Among women, breast cancer is the second leading cause of death from cancer in the United States, with a rate of newly diagnosed cases exceeding 300,000 and a projected yearly mortality rate approaching 42,250 (1–3). Several advances have been made in the development of both diagnostic and predictive modalities aimed at the integration of diverse informative morphological, biochemical, and molecular biomarkers, with the collective goal of improving breast cancer diagnosis, survival prediction, and advanced estimates of the response to therapy (4, 5). Nonetheless, there remains a critical and unmet need for the further development of objective biomarkers that will enable more accurate and reliable diagnosis of disease, including assessment of disease risk, severity, and outcome. Current methods for studying tumor morphology in tissue sections utilize conventional histological stains that selectively distinguish tissue components based on their biochemical or molecular composition (e.g., nucleic acid, protein, or lipid). Other, more selective methods employ antibodies or hybridization probes to detect specific analytes whose coordinates within the tissue can be directly detected and amplified by a secondary antibody or complementary nucleic acid linked to an enzyme or fluorescent probe (6). These methods facilitate visualization through the detection of enzyme-catalyzed precipitation of a chromagen substrate, for brightfield detection, or fluorescent light emission in immunofluorescent assays. In each case, the presence, structure, and position of the analytes are visually conveyed. Both conventional chromagen-based and immunofluorescent-based assays are now available in multiplexed formats, providing a means of simultaneously characterizing each cell within the tissue of interest through the examination of multiple analytes (7–11). These methods also share common analytical formats that compare the 2-dimensional relationship between cells and groups of cells, where each cell is spatially mapped by its x and y coordinate position and assigned specific phenotypes based on the presence and level of specific molecular analyte or antigen detection (12, 13). To leverage this rapidly evolving technology, several methods have been recently described to improve the efficacy, extent, and impact of spatial feature characterization within histological images (14–16). The recent application of artificial intelligence to the analysis of digital images of tissue sections, labeled with either histological stains and/or antibodies or both, has broadened and extended the practical application and impact of digital pathology in the diagnosis and characterization of breast cancer and the tumor microenvironment. This rich spatial information has enabled more structure-focused computational approaches aimed at examining relationships within tissue architecture and morphology through the application of topology and spatial statistics (17–21). Using the mathematical theory of persistent homology, we have developed a topological approach that provides a quantitative assessment of the architecture of tissue structure, including clusters, shapes, and voids, that captures and quantifies the architectural features that include but are not limited to those intrinsic to glandular and vascular structures and varying patterns of stromal cellular infiltration. Through explicit modeling of the topology of tissue architecture, combined with its mapping into machine-learning-ready topological features, this approach provides unique insights into tissue structural morphology. A proof-of-principle study demonstrating that such topological features could be applied as objective parameters suitable for use in a correlative context with breast cancer clinical diagnoses and outcomes data has been recently published (13).
In this study, we performed a quantitative multiplex immunofluorescent (qmIF) analysis of breast cancer tissue microarrays (TMAs) from a racially diverse cohort of breast cancer patients residing in a rural 29-county region of eastern North Carolina (N=555 with 8.5-year follow-up) (12, 13, 22, 23). The immunofluorescent analysis included markers for DNA (DAPI stain) and primary antibodies against CD8 (cytotoxic effector T-cells), CD68 (macrophage surface marker), pan-cytokeratin (tumor/epithelial cell marker), and PD-L1 (immune-checkpoint surface receptor) (12, 13, 22, 23). These images were then analyzed for spatial features (nearest-neighbor analysis, Ripley’s K function) and topological features by persistent homology, including the introduction of PD-L1 intensity-dependent filtration for expanded application, interpretation, and prediction value of the spatial statistics. Results from these measurements were then combined with clinical features, including patient gene expression and clinical outcome data, to develop clinically predictive spatial biomarkers through the application of machine learning, gene pathway analysis, and statistical modeling.
Materials and Methods
Data Preparation
Patient cohort, Tissue Microarray.
The study was performed using de-identified formalin-fixed and paraffin-embedded (FFPE) tissue samples and de-identified clinical information abstracted from medical records that were fully annotated for the patient features examined in this study (a total of 482 of 550). Data was obtained with IRB approval from East Carolina University and the National Institutes of Health intramural research program. Tissue samples were obtained at the time of surgery in the adjuvant setting between 2001 and 2010 at Pitt County Memorial Hospital (now Vidant Medical Center) in Greenville, North Carolina. All patient samples and data obtained were de-identified and approved by the East Carolina University Institutional Review Board as a human subject exempt project, for which no informed consent is needed. The study was conducted in accordance with the Declaration of Helsinki. Patients selected were those who underwent surgery for Stage 0 to Stage IV breast cancer between 2001 and 2010 at Vidant Medical Center, Greenville, NC. Race, ethnicity, or “ancestry” was self-reported at the initial visit and captured in the medical record. Survival was recorded retrospectively from the medical records and the cancer registry with a median follow-up of 8.5 years. Less than 5% of patient tissue was exposed to chemotherapy (mostly for debulking before surgery). Replicate tissue microarrays were constructed with 1 mm cores.
Quantitative Multiplex Immunofluorescence.
The tissue microarray was stained with the Ultivue UltiMapper I/O PD-L1 kit, comprised of CD8, CD68, PD-L1, pancytokeratin specific antibodies and DAPI according to the manufacturer’s specifications as previously described (12). Image data was collected at 20x magnification. The fluorescent dye intensities were normalized to a range of 0-255. We performed image analysis using the HALO (Indica Labs, v 3.0.311.201) software platform at the full available magnification.
The TMA spots were decomposed into individual analysis regions using the HALO TMA module, ignoring cores that were 85% empty space. A coordinate system was established for each spot, with the origin being the bottom-left corner of the square TMA boundary, with a unit coordinate equivalent to one pixel or 0.5 microns. Watershed nuclear identification was performed on the DAPI channel with a nuclear contrast threshold of 0.5 and nuclear segmentation aggressiveness of 92% to well differentiate overlapping nuclei common in leukocyte clusters. Nuclei are required to be between 10 and 250 in size. A cytoplasmic region was grown from the nuclear boundary up to a radius of 4.2 microns. Cells were required to be less than 500 .
The average stain intensity within the cytoplasmic region was measured, and the positive-dye status for the antigens was defined as follows: CD8 (15), CD68 (8), panCK (10), and PD-L1 (13). Overall, we observed low backgrounds and strong signals (see Supplementary Figure 1 and Supplementary Figure 2). Phenotypes are defined using coincidence/anti-coincidence logic of the positive-indicator dye status. The logical combination for the main cell types is stromal (not panCK), T-cell (CD8), macrophage (CD68), and tumor (panCK and not CD8 and not CD68). To incorporate the PD-L1 biomarker in the study, we either use the continuous-valued PD-L1 intensity or categorize each cell into two sub-phenotypes, PD-L1+ and PD-L1- (Supplementary Figure 1). We observed that using continuous PD-L1 intensity in topological analysis is the most powerful solution. The result of the phenotyping analysis is a text file for each tissue sample consisting of entries listing information about each cell location, including the manual phenotyping result and raw staining intensities per cellular compartment using the defined coordinate system. The cell-point location was taken as the center of the rectangle that fully bounds the cell. Cores with fewer than 1000 identified nuclei were excluded from the analysis.
Spatial Profiling Features
Basic Statistics: Counting Analysis.
The most basic characterization of tumor-infiltrating lymphocytes is simply counting them. We computed the following cellular counts: all, stromal, tumor, CD8, and CD68. Since PD-L1 is often characterized as dichotomous, determined by the presence of any cells that express PD-L1 above a threshold, for each sample, we assign a binary value corresponding to the presence of PD-L1. To further take advantage of colocalized qmIF markers, we computed counting statistics for PD-L1-expressing cells. We also computed PD-L1 proportion scores, defined as the ratio of the number of PD-L1-positive cells to the number of cells, for a given cell type. We also computed counts related to PD-L1 expression: positivity, PD-L1 proportion scores (tumor, stroma, CD8, CD68), and PD-L1 expressing cell counts (tumor, stroma, CD68, CD8).
Basic Statistics: PD-L1 Intensity Analysis.
In addition to the presence of cells that express PD-L1 above a threshold, we also compute statistics on the PD-L1 expression intensity profile for each sample, defined as the distribution of PD-L1 intensities across all cells or a subset of cells within a sample. We characterize the PD-L1 intensity profile by the mean and standard deviation.
Basic Statistics: Nearest-Neighbor Analysis.
Nearest-neighbor analysis was used to characterize the spatial arrangement of cells identified by qmIF by finding the point, in a given population, that is closest to the current point of interest. In this case, we search for the nearest neighbor belonging to the phenotype of all tumor cells, keeping track of the nearest-neighbor distances. We use this nearest-neighbor analysis to characterize the spatial arrangement of different phenotypes in proximity to tumor cells. We then compute the average of the nearest-neighbor distance distributions. Shown in Figure 1A is an example nearest-neighbor graph and associated distribution with and without using PD-L1 thresholds.
Figure 1.

Spatial and topological features extracted from multiplex immunofluorescent images of breast cancer. A, (left) Example of point clouds generated from multiplex quantitative immunofluorescent analysis. (right) Shown are example spatial profiles for collected simple spatial features including nearest neighbor analyses measuring clustering and closeness between different biomarkers. B, Ripley’s K-function analysis used to measure the spatial clustering or dispersion within and between the point clouds. C, Topological analysis of the point clouds conducted to extract various topological features by applying spatial filtration using Euclidean Balls. D, Spatial filtration analysis using Delauney Triangulation. E, Delauney triangulation based spatial filtration using PDL1 intensity. (C-E, bottom, right) persistence diagrams generated from each respective filtration.
Spatial Statistics: Ripley’s K-functions.
Nearest-neighbor analysis can effectively characterize the distances between tumor cells and the closest cells. However, nearest-neighbor analysis is sensitive to outliers (skewing the distribution) and completely misses any larger-scale spatial arrangements. To more comprehensively characterize the spatial arrangement of tumor-infiltrating lymphocytes, we use Ripley’s K-functions (24), which we define to be the number of cells within a fixed radius of a tumor cell. We apply Ripley’s k-functions to assess the number of CD8, CD68, other stromal, tumor, and PD-L1 positive cells surrounding tumor cells within several fixed radii: 15, 30, and 60 microns. In addition to the value of Ripley’s K-functions at these fixed radii, we also consider the rate of increase of Ripley’s K-function, known as the gradient. The gradient is related to how quickly the density of clustering changes at each fixed radius. We used the spatstat R package to compute Ripley’s K-functions (25). We developed an intensity-based Ripley’s K-function to consider the PD-L1 intensities of expressing cells, defined as the sum of PD-L1 intensities on CD8, CD68, other stromal and tumor cells surrounding tumor cells within the same radii considered in the spatial analysis. As a result, this enabled the characterization of high areas of PD-L1 intensity. We again compute two characterizations at each radius: the value and gradient. We used a custom implementation of the spatstat R package to compute intensity-based Ripley’s K-functions.
Topological Features: Persistence Diagrams.
Formally, topological features are computed by inspecting and tracking the topology of the so-called sublevel sets of an input filter function, where a sublevel set at threshold is comprised of all regions of the image domain whose filter value is less than . In the example in Figure 1, the filter function is the distance from the nearest cell , in which is the set of all possible cells. One can think of this as thresholding the domain using a fixed threshold . This gives a subdomain that might contain topological structures. In the real setting, it is unrealistic to find an ideal threshold. Instead, we inspect all the different thresholds holistically. We increase the threshold from small to large, and the sub-domain grows from empty to the whole domain. Corresponding to growing the union of disks with radius . During this process, topological structures appear and disappear. We track the life span of each topological structure and encode it into a concise representation called the persistence diagram. In the following, we define two distinct filter functions. The first is used to characterize the spatial arrangement of cells. The second is used to characterize the spatial arrangement of PD-L1 intensity. Spatial filtration. The first, which we call the spatial filter, is simply defined as the nearest neighbor distance from a cell. This corresponds to the growing-disk example in Figure 1C. The corresponding sublevel set at threshold is then defined as a union of Euclidean balls of radius centered on the tumor cells. As the radius grows, the balls intersect, resulting in the birth of new topological structures and the death of existing ones. We apply the spatial filtration to compute topological characterizations of CD8, CD68, other stromal, tumor, and PD-L1-positive cells. Technically, we first decompose the 2D domain into a discretization called the Delaunay triangulation (26) (Figure 1D). The triangulation consists of vertices (cells), edges, and triangles. Their union covers the whole domain spanned by the cells (more exactly, a convex hull of the vertices). Next, we filter this triangulation using the function values associated with each of the elements. An edge’s function value is half of its length. A triangle’s function value is the radius of its Circumcircle. As we continuously increase the threshold and add all elements one by one, we have a filtration of the domain and can track all topologies appearing and disappearing through the filtration.
Intensity-reweighted spatial filtration.
The second filtration is novel and based on PD-L1 intensity. Instead of filtering based on the distances between the tumor cells, the filtration is based on tracking topological structures that appear and disappear while sequentially increasing the PD-L1 intensity threshold. For this, we consider the sublevel sets of the piecewise-constant extension of the opposite of the PD-L1 intensity filter (which is initially defined only on the tumor cells and not on the whole 2D domain). The tracking is once again encoded into a persistence diagram. We again construct a Delaunay triangulation and filter its elements (vertices, edges, triangles) with the function value. But this time, we use a lower-star filtration: the function value of each element is the maximal intensity value among all its vertices. This is illustrated in Figure 1E. Persistence diagrams are not suitable as input to machine learning algorithms. Therefore, it is required to transform them into vectors that can be combined with other features (such as those defined above). After exploring different options of representations of persistence diagrams, including persistence image (27) and persistence landscape (28), we chose to vectorize persistence diagrams using persistence images (27). Since the dimensionality of persistence images is large (100×100 pixels), we use Kernel Principal Component Analysis and keep the first 50 components. These components are used as features for our statistical analyses. We used GUDHI (29) to compute persistence diagrams and images. Note that multidimensional persistence has been explored in recent years (30) (31). We plan to investigate the prediction power of this multidimensional persistence in our setting in the future.
Feature Details.
In Table 1, we list the specific features for each of the characterizations used. “Stroma” refers to detected cells that do not stain positively for pan-cytokeratin. “Other Stroma” refers to detected cells that do not stain positively for pancytokeratin, CD8, or CD68.
Table 1.
List the specific features for each of the each spatial and topological characterizations used in this study. “Stroma” refers to detected cells that do not stain positively for pan-cytokeratin. “Other Stroma” refers to detected cells that do not stain positively for pancytokeratin, CD8, or CD68.
| Counting Analysis Features | ||||
|---|---|---|---|---|
| All Phenotypes | w\out PD-L1 | Total Cell Count, Tumor Cell Count, Stroma Cell Count, CD68 Cell Count, CD8 Cell Count | ||
| All Phenotypes | with PD-L1 | Tumor PD-L1+ Cell Count, Stroma PD-L1+ Cell Count, CD68 PD-L1+ Cell Count, CD8 PD-L1+ Cell Count, presence of PD-L1+ cells (binary variable), Tumor PD-L1 Proportion Score, Stroma PD-L1 Proportion Score, CD8 PD-L1 Proportion Score, CD68 PD-L1 Proportion Score | ||
| Tumor Only | w\out PD-L1 | Total Cell Count, Tumor Cell Count | ||
| Tumor Only | With PD-L1 | Tumor PD-L1+ Cell Count, presence of PD-L1+ cells (binary variable), Tumor PD-L1 Proportion Score | ||
| Nearest-Neighbor Analysis Features | ||||
| All Phenotypes | w\out PD-L1 | CD8 near Tumor, CD68 near Tumor, Stroma near Tumor, Tumor near Tumor | ||
| All Phenotypes | with PD-L1 | PD-L1: CD8 PD-L1+ near Tumor, CD68 PD-L1+ near Tumor, Stroma PD-L1+ near Tumor, Tumor PD-L1+ near Tumor, any PD-L1+ cell near Tumor | ||
| Tumor Only | w\out PD-L1 | Tumor near Tumor | ||
| Tumor Only | With PD-L1 | Tumor PD-L1+ near Tumor | ||
| PD-L1 Intensity Analysis Features | ||||
| All Phenotypes | All cells (average and standard deviation), Tumor (average and standard deviation), Stroma (average and standard deviation), CD8 (average and standard deviation), CD68 (average and standard deviation) | |||
| Tumor Only | Tumor (average and standard deviation) | |||
| Ripley’s K-function Analysis Features (values at 250, 500, 1000, 1500, 2000, 2500 pixel distance) | ||||
| All Phenotypes | w\out PD-L1 | Tumor, Stroma, CD8, CD68 | ||
| All Phenotypes | with PD-L1 | Tumor PD-L1+, Stroma PD-L1+, CD8 PD-L1+, CD68 PD-L1+ | ||
| All Phenotypes | with PD-L1 intensity | Tumor, Stroma, CD8, CD68 (cells are weighted by PD-L1 intensity) | ||
| Tumor Only | w\out PD-L1 | Tumor | ||
| Tumor Only | With PD-L1 | Tumor PD-L1+ | ||
| Tumor Only | With PD-L1 intensity | Tumor (cells are weighted by PD-L1 intensity) | ||
| Topological Features | ||||
| All Phenotypes | w\out PD-L1 | Tumor, Stroma, CD8, CD68, Other Cells (cells that are not positive in all four markers) – Computed using Spatial Filtration | ||
| All Phenotypes | with PD-L1 | Tumor PD-L1+, Stroma PD-L1+, CD8 PD-L1+, CD68 PD-L1+, Other Cells PD-L1+ – Computed using Spatial Filtration | ||
| All Phenotypes | with PD-L1 intensity | Tumor, Stroma, CD8, CD68, Other Cells – Computed using PD-L1 intensity filtration | ||
| Tumor Stromal | w\out PD-L1 | Tumor -- Computed using Spatial Filtration | ||
| Tumor Stromal | With PD-L1 | Tumor PD-L1+ -- Computed using Spatial Filtration | ||
| Tumor Stromal | With PD-L1 intensity | Tumor – Computed using PD-L1 intensity filtration | ||
Learning-Based Scores and Statistical Analysis
Comparison with the standard practice clinical variables may under-represent their importance when considering all spatial and intensity characterizations in the cohort. To address this, we train models without the clinical baseline features and extrapolate the model output to a hold-out test dataset. The model output on hold-out test data forms the TME risk score. The features we used to train the model include simple features, Ripley’s K-functions, and topology. To prevent overfitting, we drop the sparse topological feature columns. Sparse topological feature columns mean that at least 1/4 of the elements in these columns are 0. In this context, ensuring equal scales across input features is important. Simple, K-function, and topology features are scaled by standard deviation. Further, we stratify the test samples into high and low scores using a predetermined cut-off threshold. This threshold is defined to be the median expected lifetime of trained samples. We can obtain Kaplan-Meier curves and an associated log-rank p-value. Besides this, we also use the concordance score (c-index) to evaluate the performance of the trained survival model. The Cox Proportional hazard ratios, Kaplan-Meier curves, and log-rank test are calculated by the lifelines Python package. To evaluate the pipeline, we split the samples into training (2/3) and testing (1/3) groups 7 times independently (with different random states).
An L1 penalty-based feature selection process is used in this pipeline. For baseline feature: Select 17 (33 in total) most important variables for each train-test split by doing survival regression on the train set. And keep the variables, which appear more than 4 times in 7 splits (15 variables kept). For baseline and topology features: Select 87 (487 in total) most important variables for each train-test split by doing survival regression on the train set. And keep the variables, which appear more than 5 times in 7 splits (35 variables kept). For baseline and intensity K-function feature: Select 32 (57 in total) most important variables for each train-test split by doing survival regression on the trained set. And keep the variables, which appear more than 4 times in 7 splits (22 variables kept). For baseline topology and intensity K-function feature: Select 91 (511 in total) most important variables for each train-test split by doing survival regression on the trained set. And keep the variables, which appear more than 3 times in 7 splits (54 variables kept). We trained the model on selected features to indicate the quality of selection. As shown before, selected-features-based models represented similar prediction power as full-feature-based models.
About the hyper-parameter tuning, we use the average unpenalized partial log-likelihood as the criterion to select the model. A basic grid search is used to find the best hyperparameter. The hyperparameter here refers to the weight of the L1 penalty. The search range is [0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.08, 0.1, 0.12, 0.15, 0.2, 0.5, 1]. We use a separate validation set within the training set to choose the hyperparameter.
Statistical analysis
The analysis employed a variety of statistical techniques. The optimal cutoff for all 4 spatial features was determined using maximally selected rank statistics, allowing for classification into low- and high-risk groups (32). This cutoff approach has been applied to overall data (All) and subclasses of data - i.e., the NHB (Non-Hispanic Black) and NHW (Non-Hispanic White) populations. Kaplan-Meier survival curves were constructed to visualize survival differences between these groups. A Cox proportional hazards regression model was applied to assess the relationship between biomarker expression and survival outcomes, with hazard ratios and 95% confidence intervals (CI) calculated (33). The analysis was conducted using the survival package in R: R Project for Statistical Computing (RRID: SCR_001905), which is available on CRAN (RRID: SCR_003005) (https://cran.rproject.org/web/packages/survival/survival.pdf). In survival modeling, the time-dependent area under the receiver operating characteristic (AUROC) was used to evaluate the performance of multivariate survival models (34). Statistical significance between the groups was tested using ANOVA (35) and Wald’s test (36). Spearman’s rank correlation test (37) was used to assess the correlation between variables, and an unsupervised hierarchical clustering approach was applied for data stratification and pattern identification. Furthermore, forest plot analysis using optimized cutoff values in total patients, African American (AA) patients, or European American (EA) patients was performed to visualize and compare the relative differences between the HR (CI) of different biomarkers. The Ingenuity Pathway Analysis software (IPA version 01-22-01) (RRID: SCR_008653) was used to explore the functional networks associated with our list of significant genes. This analysis facilitated a deeper understanding of the biological processes influenced by the significant genes and provided insights into potential therapeutic targets or biological mechanisms. Relationships between complex pathways were visualized using Cytoscape (RRID:SCR_003032).
Results
Discovery pipeline for defining predictive spatial and topological breast cancer biomarkers
Digital images of tumor cores (1 mm) from 550 patients stained with DAPI and a panel of cellular markers for immune and tissue phenotyping, including antibodies against CD8, CD68, pan-cytokeratin, and PD-L1(12, 13, 22, 23), were captured for downstream analysis of nuclear, cytoplasmic, and membrane staining following nuclear segmentation. The cellular phenotype for each DAPI-detected cell (nuclear staining) was determined using manual thresholds for each quantified stain to assign discrete phenotypes to each cell. The phenotype and position of the labeled cells were then displayed as point clouds (Figure 1A, left). The point clouds were the primary input for all subsequent analyses, including basic counting of cells of different phenotypes, nearest neighbor profiling (Figure 1A, right, see Materials and Methods and Supplementary Figures 1 and 2), and Ripley’s K-function analysis using classic spatial statistics to identify and quantify cell aggregation or clustering (Figure 1B, see Materials and Methods). These spatial profiling approaches measure proximity between cells of the same or different phenotypes by either inspecting the distribution of nearest neighbor distances or counting cells surrounding given cells at increasing distances.
In addition to conventional spatial profiling of tumors and the tumor microenvironment (TME), we have implemented an approach from applied mathematics to quantify shapes and structures based on topology (13). The field of topological data analysis (TDA) (38–40) seeks to mathematically define spatial structures and shapes based on how they are arranged and has been previously applied as an approach to understand and characterize tissue morphology (41). However, within the field of TDA, the theory of persistent homology generates a mathematical assessment of topological structures arising from a series of continuous observations/measurements over multiple scales. Such topological structures include connected components (structures that could be represented by cell-to-cell adhesions) and voids or holes (structures that could be represented by vascular or glandular lumens). In practice, these structures are extremely hard to quantify in histological images because they are often incomplete and, indeed, form a continuous spectrum of structures. To capture the topology of different levels of formations, persistent homology constructs a nested sequence of progressively growing shapes and encompasses their evolving topological scaffolds. The net effect or objective is to define those topological structures that persist despite a progressive change in the resolution of the image during filtration. The sequence starts with all cells of interest and then proceeds by growing the space occupied by these cells, by increasing their radius. This collective space-filling sequence is referred to as Euclidean balls, where each cell coordinate, centered at every cell, develops as a progressive “space-filling” expansion, called a spatial filtration. Throughout the filtration (or decreasing resolution), topology evolves as structures appear and disappear (Figure 1C). Connected components, reflecting aspects of clustered arrangements of cells at different sizes and densities, are formed and merged into each other. Higher order topology, i.e., voids/loops/holes, are formed as cells along their boundaries merge, and disappear when their interior is filled (Figure 1C). During the filtration, the appearance time (called birth, x-axis) and disappearance time (called death, y-axis) of these topological structures are recorded as a persistence diagram (Figure 1C, bottom right). Alternatively, the same filtration can be performed using the so-called “Delauney Triangulation”, which is constituted through the formation of connections between the cell coordinates to form triangles provided that, for any triangle, its circumcircle does not contain other cell coordinates or points (Figure 1D, bottom left). Similar to Euclidean Balls, a filtration by Delauney Triangulation progressively occupies space by adding connections/edges between the cell coordinates based on a growing length threshold, thus adding triangles of increasing sizes. Likewise, connected components and voids/holes are born and lost throughout the filtration (Figure 1D), resulting in a persistence diagram, identical to the earlier spatial filtration by Euclidean Balls, that similarly maps the progressive gain and loss of connections and voids as the image resolution decreases (Figure 1D, bottom right). We also incorporate intensity values assigned to cells for specific biomarkers, such as PDL1, in the analysis. In this intensity-driven filtration, points or cells within the point cloud are added progressively, from highest to lowest, according to the biomarker intensity of interest. Adjacent edges/triangles are added along with these points, creating incremental space-filling, until the whole Delauney Triangulation is completed (Figure 1E). As in Figure 1D, this filtration produces a similar, though unique, persistence diagram (Figure 1E, bottom right). When plotting the diagram, we invert and min/max normalize the intensity values so that for each captured structure, its birth time intensity is smaller than the death time/intensity.
The data from the persistence diagrams and basic spatial profiling features (including the counting and K-functions measured for each patient sample) are then combined with the patient’s clinical characteristics, survival, and clinical outcome data for downstream analysis using machine learning and statistical modeling (see Materials and Methods). This is ultimately used to develop a tissue morphology-based topological algorithm that predicts survival (see schematic pipeline in Figure 2). The accuracy of survival prediction is dependent on both the type of analysis and the type of data included in the combined spatial and topological calculations (Supplementary Table 1). To validate the efficacy of different spatial profiling features, we used these features to train a survival prognosis model to predict the survival months of each patient. We evaluated the model through 5-fold cross-validation and reported the prediction power of different combinations of spatial/topological features. Furthermore data, shown in Supplementary Table 1, provides a comparison of the overall accuracy of survival prediction, measured by the concordance index (C-index), generated from the analysis of the data, using different spatial/topological feature combinations. Data input combinations included all phenotypes (stromal, CD8, CD8/PDL1, CD68, CD68/PDL1, Tumor, Tumor/PDL1) where the tumor was defined by pan-cytokeratin positive staining. Cox’s proportional hazards modeling was applied to compare combinations of intensity-dependent K-function, or all combined with K-function and Topology function. Using these Base features (counting and nearest neighbor distance features), Base + Topology, Base + PDL1 approach, Base + Topology features, or all three, gave the highest performance in terms of survival prediction (C-index of 0.6726 (+/−0.0150)), thus demonstrating the significance of topological profiling. Notably, the prediction accuracy decreased significantly (C-index 0.5641 (+/−0.0098)) when PD-L1 measurements were not included in the analysis. This suggested a significant role played by the presence of this immune checkpoint protein (42, 43), expressed by tumor, CD68, and CD8 cells, in the accuracy of survival predictions that incorporate spatial and topological analysis using features extracted from tumor cells exclusively (Tumor only). This reveals that the intrinsic organized structure of the tumor alone provides a very high predictive value (C-index= 0.6695 (+/− 0.0360)). Again, the prediction power is substantially decreased when PDL1 expression is not included (C-index = 0.6097 (+/−0.0167)). These findings indicate that most of the survival-predictive features in the patient samples reside in the intrinsic shape, spatial, and topological properties of the tumor, independent of the tumor microenvironment. Moreover, accuracy also appears less dependent on PD-L1 expression when incorporating features from the tumor only.
Figure 2.

Analytical pipeline for development of topological breast cancer biomarkers. A, Schematic diagram of steps involved in the pipeline to develop topology-based morphological biomarkers. Steps include generation of a point cloud from QmIF, analysis and measurement of both spatial and topological biomarkers followed by merging with clinical features for combined machine learning, logistic regression and statistical modeling to identify predictive features and feature combinations.
When compared to standard H&E images of corresponding patient tumor sections, the topological features faithfully capture and reflect clinically relevant features of tumor morphology, including gland formation and histologic grade (Figure 3). For example, a comparison of the H&E morphology of tissues with high versus low topological scores demonstrates readily distinguishable differences in the tumor morphology, displaying much more discernible glandular structure in the high-scoring tissue. This relationship is demonstrated in both calculations involving all phenotypes (Figure 3A) and those based on tumor-only measurements (Figure 3B).
Figure 3.

Sample comparison of multiplex immunofluorescent and H&E staining of tumor samples demonstrating high versus low Base Topology Scores. A, Examples of the immunofluorescent and H&E-stained histologic images of tumor samples comparing high versus low base topology (left) and base topology with intensity K-function (right) scores based on calculations using spatial features of All phenotypes. B, Similar examples of tumor samples representing high versus low base topology (left) and base topology with intensity K-function (right) measures derived from calculations using spatial features of Tumor Only.
Developing Topology-based scores as Predictive Breast Cancer Biomarkers
As shown in Figures 4A and 4B, all the scores based on spatial statistics and topology show a significant correlation with survival in the breast cancer cohort. Higher scores correlate with longer survival, consistent with an association between increased organized tumor structure, lower tumor grade, and more favorable overall outcomes. Based on the optimal cut-off values defined by the log-rank statistics, the spatial scores developed using topological metrics demonstrate the strongest association with survival with exceptional accuracy (p-values < 0.0001) (see Figure 4B). This is illustrated by comparisons between Base: HR=0.49, (0.34-0.69), p-value < 0.0001 and Base + Topology: HR=0.27, (0.19-0.38), p-value <0.0001; as well as the difference between Base + Intensity K-Function: HR=0.44, (0.3-0.65), p-value < 0.0001 and Base + Topology + Intensity K-Function: HR=0.28, (0.2-0.4), p-value < 0.0001 (Figure 4B). Notably significant predictive accuracy persist both in measures including All-Phenotypes and Tumor Only (Supplementary Figure 3). Moreover, although accuracy appears less dependent on PD-L1 expression when incorporating features from the tumor only, PD-L1 remains a significant contributor to survival prediction by Kaplan-Meier (KM) analysis regardless of whether calculated with All Phenotypes or Tumor-only features (Supplementary Figures 4).
Figure 4.

Spatial and topological features predict breast cancer outcome independent of subtype. A, Optimized survival cutoff for all 4 spatial feature combinations determined using all phenotypes and tumor alone respectively, determined by the method of maximally selected rank statistic. B, Kaplan Meier analysis showing that high topological scores predict favorable survival. C, Comparison of topological scores derived for All Phenotypes compared to D, those derived from Tumor Only.
Gene or protein expression signatures are long-established methods for the clinical classification of breast cancers (44, 45). The intrinsic molecular subtypes are the most commonly used methods (44). By this method, classifications are based on the expression of specific proteins or genes, including estrogen receptor (ER), progesterone receptor (PR), and Human EGF-1 Receptor 2 (HER2). Categories in this classification include Luminal A (ER-positive), Luminal B (ER-positive, with HER2 or high Ki67 expression), HER2 (HER2 over-expressed), and triple-negative breast cancer (TNBC)/basal-like (negative for both ER and PR without overexpressed HER2). In terms of clinical outcome, HER2, Luminal B, and TNBC/Basal-like subtypes are viewed as more aggressive than Luminal A (44–47). Notably, when all phenotypes are analyzed, neither calculation for Base, Base + Intensity K-Function, Base + Topology, nor Base + Topology + Intensity K-Function showed significant differences in distribution by subtype (Figure 4C). In contrast, when the calculations are restricted to only the spatial features of the tumor (Tumor Only), the subtypes are more readily distinguished by Base + Intensity K-Function, Base + Topology, and Base + Topology + Intensity K-Function (Figure 4D). Here, the differences appear to be driven more by the clustering of lower scores in TNBC in contrast to the clustering of higher scores in Luminal A (see Figure 4D).
Topology-based biomarkers show less racial variation in survival prediction.
Though traditional gene and protein markers have shown significant utility in predicting breast cancer outcomes, recent studies have revealed substantial race-associated variation in predicting overall survival, recurrence, and response to therapy (23, 48–51) This has been especially the case for hormone receptor biomarkers and biomarkers linked to hormone receptor status (23, 52) (see Supplementary Figure 5). In contrast, significantly less predictive variation is noted when comparing the Hazard ratios and confidence intervals generated by the spatial and topology-based scoring when breast cancer cohorts, stratified by race/ethnicity (NHW, non-Hispanic White versus NHB, non-Hispanic Black), are compared (Figure 5A versus Figure 5B). This is illustrated by the comparison of NHW Base: HR= 0.49 (0.31-0.76), p-value= 0.0013 versus NHB Base HR= 0.31 (51) (0.17-0.56), p-value <0.0001; NHW Base+Intensity K-Function: HR=0.39 (0.24-0.65), p-value= 0.00019 versus NHB Base+Intensity K-Function: HR=0.44, (0.28-0.71), p-value= 0.00058; NHW Base +Topology: HR=0.18 (0.1-0.32), p-value< 0.0001 versus NHB Base+Topology: HR=0.33 (0.21-0.54), p- value< 0.0001; and NHW Base+Topology+Intensity K-Function: HR=0.21, (0.12-0.345) versus NHB Base+Topology+Intensity K-Function: HR=0.34, (0.21-0.55), p-value<0.0001 (Figure 5B). These similarities across races are in sharp contrast to other traditional protein biomarkers. A head-to-head comparison of the estrogen receptor (ER), the hormone receptor pioneer transcriptional factor protein GATA3, and the hormone receptor pioneer protein FOXA1 to the spatial and topological markers by Forest Plot reveals stark racial differences in survival prediction (Figures 5C and 5D). When both spatial and topological biomarkers, measured using all phenotypes, are compared to ER, GATA3, and FOXA1, neither the spatial biomarkers nor the biomarkers utilizing topological features show strong differences associated with race. In contrast, both the GATA3 and ER show very distinct differences in survival prediction or association, each showing non-significant survival prediction of NHB in contrast to NHW patients (Figures 5C and 5D) (23). In each case, the spatial and topological features reveal significant sensitivity and accuracy in predicting breast cancer survival independent of race. Moreover, the topological biomarkers trend toward higher sensitivity and comparable or better accuracy (Figure 5C). Similar results are obtained regardless of whether the analysis is limited exclusively to tumor cells or includes all phenotypes (Figure 5D and Supplementary Figure 6).
Figure 5.

Survival prediction by Spatial and Topological score shows little variation based on Race and ethnicity. A,B Kaplan-Meier analysis reveals that spatial and topological features are highly sensitive predictors of breast cancer survival and show very little variation associated with Race when analyzing either NHW (A) or NHB (B) patient populations. C,D Similarly, Forest plot analysis shows that survival prediction sensitivity (HR) and accuracy (confidence intervals) are greater using spatial and topological features compared to the protein biomarkers, ER, GATA3, and FOXA1 (23). These differences persist whether comparing scores calculated using all phenotypes (C) or calculated with scores from tumor-only features (D).
Topological biomarkers show comparable prediction across breast cancer subtypes.
Thus far, spatial and topological biomarkers consistently demonstrate higher survival prediction sensitivity and lower variation across the entire breast cancer cohort without consideration for subtypes, which, as described above, are differentially associated with survival. Therefore, these biomarkers were used to compare survival prediction across breast cancer subtypes (Figure 6A). As demonstrated in Figure 6A, both the Base+Topology: HR=0.15 (0.08-0.31), p-value<0.0001 (top left), and the Base+Topology+Intensity+K-Function: HR=0.11 (0.08-0.25), p-value<0.0001 (top right) showed a comparable prediction of survival in patients with Luminal A breast cancers. Similar predictive sensitivity was observed for these values in Luminal B breast cancers, Base+Topology: HR=0.17 (0.08-0.36), p-value<0.0001 versus Base + Topology + Intensity + K-Function: HR=0.23 (0.11-0.49), p-value<0.0001. Despite a smaller sample size, these measures continued to show comparable predictive value in the HER2 breast cancer subtype with comparable Base + Topology: HR=0.11, (0.04-0.31); p-value<0.0001 versus Base+Topology+Intensity K-Function: HR=0.14, (0.05-0.37), p-value<0.0001 predictions. Notably, compared to the hormone receptor-positive breast cancer subtypes, the topological biomarkers show both reduced sensitivity and accuracy in patients with TNBC (Figure 6A, bottom right). Base+Topological estimates show HR=0.55 (0.32-0.94), p-value=0.028 versus Base+Topology + Intensity K-Function: HR=0.16 (0.04-.76), p-value=0.0084. These findings suggest lower sensitivity and accuracy of survival prediction in subtypes with distributions “enriched” or skewed to lower topological scores, which is consistent with both the lower tissue organization and higher-grade characteristic of the TNBC subtype (Figure 6A). These differences persisted in both the presence and absence of PD-L1 (Supplementary Figure 7)
Figure 6.

Topology-based scores show comparable survival prediction across subtypes. A, KM plot Topology-derived scores compared for survival prediction sensitivity (HR) and accuracy (confidence interval) across the subtypes. B, Forest plot comparison of predictive performance topology-base scores against traditional biomarkers by univariate comparison. C, Multivariate logistic regression analysis of topology scores derived from All Phenotypes. D, Multivariate logistic regression analysis of topology scores derived from tumor only predictions showing dependence on subtype and ER status.
Topology-based gene signatures are independent predictors of breast cancer survival.
As has been previously described for protein biomarkers (12, 22, 53), we have used RNA-seq from our patient samples to develop topology-based gene expression scores or gene modules based on the stratification of patients by their topological score. This provides surrogate gene expression biomarkers representing topology scores that facilitate the assessment of their predictive value in proportional hazards models using available cohorts that have gene expression data (Figure 6). The predictive values of several different breast cancer biomarkers, patient traits, and protein-based gene modules were compared by univariate modeling including Kaiso_Cyto (a signature based on the cytoplasmic enrichment for the Kaiso protein, an established negative predictor of survival) (22), Kaiso Nuclear (a signature based on the nuclear enrichment for Kaiso protein and an established negative predictor of survival independent of Kaiso_Cyto) (22), ER (a positive protein predictor of survival), PR (a positive protein predictor of survival) (54), HER2 (a negative protein predictor of survival) (55), Race=NHB (a negative predictor of survival) (56), membrane bound E-CAD (a positive predictor of survival (57), increasing grade (a negative predictor of survival) (5), subtype (where TNBC, is a negative predictor of survival), the tumor suppressor, p120 (a positive predictor of survival), increasing stage (a negative predictor of survival) , and the spatial/topological features Base+Topology, Base+Topology+Intensity K-Function, Base alone and Base+Intensity K-Function (all predicting favorable survival) (see forest plot, Figure 6B). Very similar relationships with similar magnitude were observed when measurements were limited to tumor-only determinants (Supplemental Figures 8A and 8B). When used in a multivariate Cox proportional hazard model, Base + Topology remains an independent predictor of survival, independent of the Kaiso_Cyto, Kaiso_Nuc, Stage, and p120 gene modules (Figures 6C, D). Notably, logistic regression models incorporating topology scores generated from All-Phenotypes (AUC= 0.823) (Figure 6C) are comparable to those generated from Tumor-Only measurements (AUC= 0.813) (Figure 6D). Moreover, the inclusion of ER and Subtype in the Tumor-Only models allows the models to achieve comparable predictive accuracy with All-Phenotypes. This interesting finding suggests that the predictive features intrinsic to the tumor microenvironment (i.e., immune cell components) are partially recovered in Tumor-Only modeling by properties associated with hormone receptor and subtype classifications, implicating biologically relevant linkages between subtype, receptor status, and features of the tumor microenvironment. For the most part, All-Phenotypes show moderately greater predictive sensitivity compared to Tumor Only. However, both measurements demonstrate comparable accuracy (Supplemental Figures 8A, B).
Topology-based signatures predict therapeutic response and associate with novel pathways.
A comparison of the various topology-based gene modules with those previously described demonstrates a significant correlation between the topology-based gene modules (Figure 7A, blue) and those developed based on immune-associated pathways, such as TIL (tumor-infiltrating lymphocytes). As described previously, these gene modules were generated using survival data calculated by applying different survival cutoff values as indicated (e.g., median, optimized, and Tumor-Only features) (12, 22, 23). Notably, the spatial and topological modules all show strong anti-correlations with the epithelial-to-mesenchymal transition-based gene modules (Figure 7B). This is very consistent with the features of epithelial-to-mesenchymal transition (EMT), a prerequisite phenotypic transition in cells that promotes mesenchymal features, loss of adhesion, invasion, and its major role as a functional driver of lost organized glandular structure (58).
Figure 7.

Topology-base gene modules anti-correlate with EMT and STROMAL gene signatures, predict response to therapy, and are enriched in novel molecular signaling pathways. A, Hierarchical Clustering of Topology-based gene signatures. B, Topology-based signatures correlation with established gene signatures used in clinical trials. C, Topology-based gene signatures show a correlation with response to neoadjuvant BrCa therapy (70) D, Overlap between All Phenotype and Tumor Only derived signatures genes. E, Gene regulatory network populated by Topology-derived gene signatures derived from All Phenotypes. F, The IL-4I1 gene regulatory network is populated by Topology-derived gene signatures based on Tumor Only features. G, Patient breast cancers wth high versus lowspatial topology scores stained for IL-4I1 protein. H, Correlation betweem spatial topology scores, IL-41I protein and IL-4I1 mRNA.
An analysis of gene expression and therapeutic response data from publicly available breast cancer clinical trials (Supplementary Table 2) demonstrates that topology-based gene signatures readily predict treatment outcome (i.e., pathological complete response (pCR) vs residual disease (RD)) in response to neoadjuvant therapy (Figure 7C). This suggests that the level of organized structure within the tumor, before treatment, plays a major role in predicting patient response to neoadjuvant therapy (Figure 7C). The gene modules, most correlated with pCR, contained signatures that were derived from topological scores that included features from both All-Phenotypes and Tumor Only data sets. Interestingly, the most predictive gene modules were derived from the Topological and Intensity K-function scores calculated using features from the Tumor Only data (Figure 7C). Similar relationships were observed in the analysis of an independent data set (Supplementary Figure 9A-D).
Topology-based gene modules are selectively enriched in novel gene regulatory networks.
To define genes and gene pathways that are uniquely enriched in each topology-based gene module, the relationship between constituent genes in the Base, Base+Intensity K-Function, Base+Topology, and the Base+Topology+Intensity K-Function gene modules was identified and illustrated by a Venn diagram (Figure 7D) and analyzed through the Ingenuity Pathway software (Figures 7E–F). The genes unique to the Base+Topology+Intensity K-Function derived gene module using All-Phenotypes data (N=175) (Figure 7D, left) were found to maximally populate a beta-catenin centered gene regulatory network enriched in cell and tumor migration, cell and tumor invasion, cell movement, and migration of tumor cell lines (Figure 7E). Genes unique to the Base Topology with Intensity K-Function gene module using Tumor Only data (N=142) (Figure 7D, right) overpopulated a highly interconnected regulatory pathway downstream of IL-4I1, an enzyme important in tryptophan metabolism (Figure 7F). As a secreted enzyme, IL-4I1 metabolizes tryptophan to the indole metabolites, Kyn, and kynurenic acid, which function as potent activators of the aryl-hydrocarbon receptor (AHR) (Figure 7F)(59–61). The aryl hydrocarbon receptor (AHR) is a cytoplasmic receptor and transcription factor that, when activated by binding to tryptophan metabolites, plays a major role in driving breast cancer growth, proliferation, migration, and invasion (59–63). AHR is expressed across all subtypes of breast cancer, including ER-positive tumors, and has been shown to play a key role in breast cancer progression as a major ligand-activated transcription factor driving gene modules associated with mesenchymal tumor properties and EMT(62, 63). Furthermore, recent studies have implicated AHR and its ligands, generated downstream of IL-4I1, as central regulators of immune checkpoints that suppress B and T-cell adaptive immunity across multiple different tissues and cell types (61, 64). As shown in Figures 7G-H, quantitative comparison of the protein expression of IL-4I1, in both the nucleus and cytoplasm, of patient breast cancer tumors are anti-correlated with topological scores (also see Supplementary Figure 10). These observations strongly implicate a central role for this regulatory limb in driving or responding to disruptions in organized tumor morphology within the tumor microenvironment, and promoting cancer motility, metastasis, and immune suppression and support a possible role for IL4-I1 as a relevant breast cancer biomarker.
Discussion
In this study, we employed the use of spatial statistics and topology to quantitatively define the level of organized structure in breast cancer tissues visualized by multiplex immunofluorescence. The phenotype and spatial coordinates of tumors, immune cells, and other cell types within the tumor microenvironment were captured and analyzed. We applied spatial and topological features to quantitatively assess these complex, yet signal-rich spatial relationships, within and between the tumor tissues. These features were then combined with clinical data using statistical modeling to define new and novel biomarkers associated with patient outcome and survival. The novel spatial biomarkers were then used in combination with patient-specific gene expression data to develop topology-based gene signatures (TbGs) that could be used as gene expression surrogates to assess their predictive value in independent data sets generated and made publicly available through clinical trials.
One notable finding in the study is that when predicting survival, spatial and topological biomarkers demonstrate accuracy comparable to established protein and RNA biomarkers, but show significantly less variation in predictive accuracy based on race. This provocative finding is noteworthy since several recent studies and clinical trial data demonstrate substantial variation in the ability of a variety of gene and protein-based biomarkers and gene signatures to predict clinical outcomes in breast cancer based on race and ethnicity and the role of the tumor microenvironment in these differences (56, 65–68). These studies include observations of traditional protein biomarkers commonly evaluated by immunohistochemistry (e.g., ER, FOXA1, and GATA3) and commercially available gene signatures commonly used in clinical trials (23, 69, 70). Several studies have suggested a demonstrable role for the immune tumor microenvironment as a driver associated with race and ethnicity (49, 71–75). Given the growing appreciation for the impact and role of race, ethnicity, and other social determinants of health in linking environmental exposures to breast cancer risk and outcome, there may be significant value in the capacity to distinguish those features of breast cancer and the tumor environment that are dependent on environmental exposures from those that are intrinsic to the tumor. These findings suggest that loss of organized structure is an intrinsic feature of cancer that is invariant or shows less variation in association with elements linked to environmental exposure or the interaction between race and the “built” or man-made environment. Nonetheless, there were isolated examples in this study of notable differences in survival prediction accuracy that were differentially associated with spatial and topological features linked to race. For example, as shown in Figures 5A and 5B, for non-topology features (both Base and Base_iK), we observe clear differences in prediction power across races. In contrast, for topological features (both Base_topo and Base_topo_iK), the prediction power is much more robust and shows much less variation based on race. Though other differences across races for certain topological features were also detectable, understanding their possible implications will require larger datasets for future evaluation.
Aside from the novel insights, technical innovations described in this work include: 1) the introduction of filtration weighted by the PDL1 intensity, to enable the generation of topological signatures with higher biological relevance; 2) the novel introduction of topological features into the training of survival analysis models; and 3) the adaptation of these models to derive topology-based gene signatures that uncover and define novel new linkages between metabolism and breast cancer progression.
One of the most intriguing aspects of this study is the finding that topology-dependent gene signatures are enriched for pathways linked to metabolic regulation of aryl-hydrocarbon receptor signaling. As described above, IL4-I1 is a secreted enzyme that is a major driver of the production of ligands for aryl-hydrocarbon receptors (AHR), which have broad influences over important functions and properties within the tumor microenvironment. Aryl-hydrocarbons are well-known cancer-causing environmental pollutants, including hexachlorobenzene and chlorpyrifos, polycyclic aromatic hydrocarbons, polychlorinated dibenzo-p-dioxins and dibenzofurans, polychlorinated biphenyls, parabens, and phthalates. These ligands and their receptors have been the broad focus of numerous clinical and preclinical studies designed to find and understand the role of the Aryl-hydrocarbon receptor in breast cancer development and progression. IL-4I1 plays a central role in the metabolism of tryptophan to Kynurenic acid (KynA) and Indole-3-pyruvic acid (I3P) and their metabolites Indole-3-acetic acid (IAA), Indole-3-aldehyde (I3A), and Indole-3-lactic acid (ILA), all of which are potent activators of AHR signaling (59). Our finding that the IL4-I1-regulated metabolic pathway and its production of “intrinsic activators” of the aryl-hydrocarbon pathway is highly associated with low spatial and topological scores or “loss of organized structure” implicates IL4-I1 as a “regulator” or “responder” to changes in organized epithelial structure. Whether loss of organized structure occurs upstream or downstream of activation of IL4-I1, this receptor and its control of aryl-hydrocarbon signaling will be an intriguing future direction as we explore the utility of spatial and topological biomarkers in breast cancer treatment and diagnosis. Such findings are also likely to provide new therapeutic insights into linkages between human metabolism, environmental exposures, and cancer. Our application and adaptation of such methods to quantify shape and histological morphology using persistent homology to develop prognostic and predictive algorithms exemplify how defining new linkages between biology, medicine, and mathematics has immense future promise. With increased availability and accuracy of machine and deep-learning based algorithms to identify and generate point clouds that distinguish and identify tumors, immune cells, that are distinct from other stromal cells on immunohistochemically stained slides, the application of topology-derived algorithms to standard-of-care PD-L1-based IHC workflows has the potential of increasing both sensitivity, accuracy and predictive value in contemporary clinical settings.
Supplementary Material
Significance.
Topological features of breast cancer histology can be quantified on a continuous scale and used to accurately predict breast cancer patient survival and response to therapy.
Acknowledgments
We thank Rami Vanguri for his contributions to the preliminary studies (13).
Funding:
National Institutes of Health, R01CA253368, R01CA266040 (KG); National Institutes of Health, R01CA297843, R01NS1431, R01GM148970 (CC); National Institutes of Health, R01CA267897, (JAM); National Institutes of Health, P01CA174653, P01CA285250, R35CA253126 , P30CA013696, R01AG077020, U01CA261822 (RR)
Footnotes
The authors have no conflict of interests to disclose
Data Availability
RNA-seq data are available at SRA archives https://www.ncbi.nlm.nih.gov/sra/SRP158272 (22, 23). Protein-based immune histochemistry and Immunofluorescent intensity (x,y) coordinate point cloud data have been uploaded to Figshare: (https://figshare.com/s/b5652eb7712fa83cf8bc) as previously described (22). All other raw data generated in this study are available upon request from the corresponding author.
References
- 1.Siegel RL, Giaquinto AN, Jemal A, Cancer statistics, 2024. CA Cancer J Clin 74, 12–49 (2024). [DOI] [PubMed] [Google Scholar]
- 2.Islami F et al. , American Cancer Society’s report on the status of cancer disparities in the United States, 2023. CA Cancer J Clin 74, 136–166 (2024). [DOI] [PubMed] [Google Scholar]
- 3.Bray F et al. , Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 10.3322/caac.21834 (2024). [DOI] [PubMed] [Google Scholar]
- 4.Wang M et al. , Determining breast cancer histological grade from RNA-sequencing data. Breast Cancer Res 18, 48 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lee AH, Ellis IO, The Nottingham prognostic index for invasive carcinoma of the breast. Pathol Oncol Res 14, 113–115 (2008). [DOI] [PubMed] [Google Scholar]
- 6.Mebratie DY, Dagnaw GG, Review of immunohistochemistry techniques: Applications, current status, and future perspectives. Semin Diagn Pathol 41, 154–160 (2024). [DOI] [PubMed] [Google Scholar]
- 7.Harms PW et al. , Multiplex Immunohistochemistry and Immunofluorescence: A Practical Update for Pathologists. Mod Pathol 36, 100197 (2023). [DOI] [PubMed] [Google Scholar]
- 8.Francisco-Cruz A, Parra ER, Tetzlaff MT, Wistuba, II, Multiplex Immunofluorescence Assays. Methods in molecular biology (Clifton, N.J.) 2055, 467–495 (2020). [DOI] [PubMed] [Google Scholar]
- 9.Wrobel J, Harris C, Vandekar S, Statistical Analysis of Multiplex Immunofluorescence and Immunohistochemistry Imaging Data. Methods in molecular biology (Clifton, N.J.) 2629, 141–168 (2023). [DOI] [PubMed] [Google Scholar]
- 10.Locke D, Hoyt CC, Companion diagnostic requirements for spatial biology using multiplex immunofluorescence and multispectral imaging. Front Mol Biosci 10, 1051491 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kumar G, Pandurengan RK, Parra ER, Kannan K, Haymaker C, Spatial modelling of the tumor microenvironment from multiplex immunofluorescence images: methods and applications. Frontiers in immunology 14, 1288802 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Singhal SK et al. , Protein expression of the gp78 E3 ligase predicts poor breast cancer outcome based on race. JCI Insight 7 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Aukerman A et al. , Persistent homology based characterization of the breast cancer immune microenvironment: a feasibility study. Journal of Computational Geometry 12, 183–206–183–206 (2021). [Google Scholar]
- 14.Vasaturo A, Galon J, Multiplexed immunohistochemistry for immune cell phenotyping, quantification and spatial distribution in situ. Methods Enzymol 635, 51–66 (2020). [DOI] [PubMed] [Google Scholar]
- 15.Glasson Y et al. , Single-cell high-dimensional imaging mass cytometry: one step beyond in oncology. Semin Immunopathol 45, 17–28 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Le Rochais M, Hemon P, Pers JO, Uguen A, Application of High-Throughput Imaging Mass Cytometry Hyperion in Cancer Research. Frontiers in immunology 13, 859414 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Benjamin K et al. , Multiscale topology classifies cells in subcellular spatial transcriptomics. Nature 630, 943–949 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Vipond O et al. , Multiparameter persistent homology landscapes identify immune cell spatial patterns in tumors. Proc Natl Acad Sci U S A 118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lawson P, Sholl AB, Brown JQ, Fasy BT, Wenk C, Persistent homology for the quantitative evaluation of architectural features in prostate cancer histology. Scientific reports 9, 1139 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Heindl A, Nawaz S, Yuan Y, Mapping spatial heterogeneity in the tumor microenvironment: a new era for digital pathology. Laboratory investigation 95, 377–384 (2015). [DOI] [PubMed] [Google Scholar]
- 21.Nawaz S, Yuan Y, Computational pathology: Exploring the spatial dimension of tumor ecology. Cancer letters 380, 296–303 (2016). [DOI] [PubMed] [Google Scholar]
- 22.Singhal SK et al. , Kaiso (ZBTB33) subcellular partitioning functionally links LC3A/B, the tumor microenvironment, and breast cancer survival. Communications biology 4, 150 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Byun JS et al. , Racial Differences in the Association Between Luminal Master Regulator Gene Expression Levels and Breast Cancer Survival. Clin Cancer Res 26, 1905–1914 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ripley BD, Spatial statistics (John Wiley & Sons, 2005). [Google Scholar]
- 25.Baddeley A, Turner R, Spatstat: an R package for analyzing spatial point patterns. Journal of statistical software 12, 1–42 (2005). [Google Scholar]
- 26.De Berg M, Computational geometry: algorithms and applications (Springer Science & Business Media, 2000). [Google Scholar]
- 27.Adams H et al. , Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research 18, 1–35 (2017). [Google Scholar]
- 28.Bubenik P, Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 77–102 (2015). [Google Scholar]
- 29.Maria C, Boissonnat J-D, Glisse M, Yvinec M (2014) The gudhi library: Simplicial complexes and persistent homology. in International congress on mathematical software (Springer; ), pp 167–174. [Google Scholar]
- 30.Vipond O, Multiparameter persistence landscapes. Journal of Machine Learning Research 21, 1–38 (2020).34305477 [Google Scholar]
- 31.Carriere M, Blumberg A, Multiparameter persistence image for topological machine learning. Advances in Neural Information Processing Systems 33, 22432–22444 (2020). [Google Scholar]
- 32.T. a. L. Hothorn, Berthold, On the exact distribution of maximally selected rank statistics. Computational Statistics & Data Analysis 43, 121 – 137 (2003). [Google Scholar]
- 33.Cox DR, Regression models and life‐tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187–202 (1972). [Google Scholar]
- 34.Hanley JA, McNeil BJ, The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982). [DOI] [PubMed] [Google Scholar]
- 35.Girden ER, NOVA: Repeated measures. Sage 84 (1992). [Google Scholar]
- 36.Ward MDA, John S, Maximum Likelihood for Social Science : Strategies for Analysis. Cambridge University Press, p. 36 (2018). [Google Scholar]
- 37.Zar JH, Significance Testing of the Spearman Rank Correlation Coefficient. Journal of the American Statistical Association 67, 578–580 (1972). [Google Scholar]
- 38.Edelsbrunner H, Harer JL, Computational topology: an introduction (American Mathematical Society, 2022). [Google Scholar]
- 39.Dey TK, Wang Y, Computational topology for data analysis (Cambridge University Press, 2022). [Google Scholar]
- 40.Munch E, A user’s guide to topological data analysis. Journal of Learning Analytics 4, 47–61–47–61 (2017). [Google Scholar]
- 41.Gibson MC, Patel AB, Nagpal R, Perrimon N, The emergence of geometric order in proliferating metazoan epithelia. Nature 442, 1038–1041 (2006). [DOI] [PubMed] [Google Scholar]
- 42.Cha JH, Chan LC, Li CW, Hsu JL, Hung MC, Mechanisms Controlling PD-L1 Expression in Cancer. Mol Cell 76, 359–370 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sun C, Mezzadra R, Schumacher TN, Regulation and Function of the PD-L1 Checkpoint. Immunity 48, 434–452 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Perou CM, Parker JS, Prat A, Ellis MJ, Bernard PS, Clinical implementation of the intrinsic subtypes of breast cancer. The Lancet. Oncology 11, 718–719; author reply 720–711 (2010). [DOI] [PubMed] [Google Scholar]
- 45.Nielsen TO et al. , A comparison of PAM50 intrinsic subtyping with immunohistochemistry and clinical prognostic factors in tamoxifen-treated estrogen receptor-positive breast cancer. Clin Cancer Res 16, 5222–5232 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.O’Brien KM et al. , Intrinsic breast tumor subtypes, race, and long-term survival in the Carolina Breast Cancer Study. Clin Cancer Res 16, 6100–6110 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Parker JS et al. , Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 27, 1160–1167 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Martini R et al. , African Ancestry Associated Gene Expression Profiles in Triple Negative Breast Cancer Underlie Altered Tumor Biology and Clinical Outcome in Women of African Descent. Cancer discovery 10.1158/2159-8290.Cd-22-0138 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Troester MA et al. , Racial Differences in PAM50 Subtypes in the Carolina Breast Cancer Study. J Natl Cancer Inst 110 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Williams LA et al. , Differences in race, molecular and tumor characteristics among women diagnosed with invasive ductal and lobular breast carcinomas. Cancer Causes Control 30, 31–39 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ma H et al. , Quantitative measures of estrogen receptor expression in relation to breast cancer-specific mortality risk among white women and black women. Breast Cancer Res 15, R90 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Carey LA et al. , Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. JAMA 295, 2492–2502 (2006). [DOI] [PubMed] [Google Scholar]
- 53.Byun JS et al. , Racial Differences in the Association between Luminal Master Regulator Gene Expression Levels and Breast Cancer Survival. Clin Cancer Res 10.1158/1078-0432.CCR-19-0875 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Beato M, Wright RHG, Dily FL, 90 YEARS OF PROGESTERONE: Molecular mechanisms of progesterone receptor action on the breast cancer genome. J Mol Endocrinol 65, T65–t79 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Prat A et al. , Molecular features and survival outcomes of the intrinsic subtypes within HER2-positive breast cancer. J Natl Cancer Inst 106 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Silber JH et al. , Characteristics associated with differences in survival among black and white women with breast cancer. JAMA 310, 389–397 (2013). [DOI] [PubMed] [Google Scholar]
- 57.Horne HN et al. , E-cadherin breast tumor expression, risk factors and survival: Pooled analysis of 5,933 cases from 12 studies in the Breast Cancer Association Consortium. Sci Rep 8, 6574 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Neelakantan D et al. , EMT cells increase breast cancer metastasis via paracrine GLI activation in neighbouring tumour cells. Nat Commun 8, 15773 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Zeitler L, Murray PJ, IL4i1 and IDO1: Oxidases that control a tryptophan metabolic nexus in cancer. J Biol Chem 299, 104827 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Sabnis RW, Novel IL4I1 Inhibitors for Treating Cancer. ACS Med Chem Lett 14, 700–701 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Sadik A et al. , IL4I1 Is a Metabolic Immune Checkpoint that Activates the AHR and Promotes Tumor Progression. Cell 182, 1252–1270.e1234 (2020). [DOI] [PubMed] [Google Scholar]
- 62.Miret NV, Pontillo CA, Buján S, Chiappini FA, Randi AS, Mechanisms of breast cancer progression induced by environment-polluting aryl hydrocarbon receptor agonists. Biochem Pharmacol 216, 115773 (2023). [DOI] [PubMed] [Google Scholar]
- 63.Vogel CFA et al. , Targeting the Aryl Hydrocarbon Receptor Signaling Pathway in Breast Cancer Development. Frontiers in immunology 12, 625346 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Li T et al. , Thymol targeting interleukin 4 induced 1 expression reshapes the immune microenvironment to sensitize the immunotherapy in lung adenocarcinoma. MedComm (2020) 4, e355 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Yao S et al. , Breast Tumor Microenvironment in Black Women: A Distinct Signature of CD8+ T Cell Exhaustion. J Natl Cancer Inst 10.1093/jnci/djaa215 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Singh M, Konduri SD, Bobustuc GC, Kassam AB, Rovin RA, Racial Disparity Among Women Diagnosed With Invasive Breast Cancer in a Large Integrated Health System. Journal of patient-centered research and reviews 5, 218–228 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Ademuyiwa FO et al. , US breast cancer mortality trends in young women according to race. Cancer 121, 1469–1476 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Sparano JA et al. , Race and hormone receptor-positive breast cancer outcomes in a randomized chemotherapy trial. J Natl Cancer Inst 104, 406–414 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Holowatyj AN et al. , Racial Differences in 21-Gene Recurrence Scores Among Patients With Hormone Receptor-Positive, Node-Negative Breast Cancer. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 36, 652–658 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Ignatiadis M et al. , Gene modules and response to neoadjuvant chemotherapy in breast cancer subtypes: a pooled analysis. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 30, 1996–2004 (2012). [DOI] [PubMed] [Google Scholar]
- 71.Du Z et al. , Evaluating Polygenic Risk Scores for Breast Cancer in Women of African Ancestry. J Natl Cancer Inst 10.1093/jnci/djab050 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Kim G et al. , The Contribution of Race to Breast Tumor Microenvironment Composition and Disease Progression. Front Oncol 10, 1022 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Albain KS et al. , Race, ethnicity and clinical outcomes in hormone receptor-positive, HER2-negative, node-negative breast cancer in the randomized TAILORx trial. J Natl Cancer Inst 10.1093/jnci/djaa148 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.O’Meara T et al. , Immune microenvironment of triple-negative breast cancer in African-American and Caucasian women. Breast cancer research and treatment 10.1007/s10549-019-05156-5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Ding YC et al. , Molecular subtypes of triple-negative breast cancer in women of different race and ethnicity. Oncotarget 10, 198–208 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
RNA-seq data are available at SRA archives https://www.ncbi.nlm.nih.gov/sra/SRP158272 (22, 23). Protein-based immune histochemistry and Immunofluorescent intensity (x,y) coordinate point cloud data have been uploaded to Figshare: (https://figshare.com/s/b5652eb7712fa83cf8bc) as previously described (22). All other raw data generated in this study are available upon request from the corresponding author.
