Summary
Multiplexed spatial proteomics profiling platforms expose the intricate geometric structure of cells in the tumor microenvironment (TME). The spatial arrangement of cells has been shown to have important clinical implications, correlating with disease prognosis and treatment response. These datasets require new statistical methods to test whether cell-level images are associated with patient-level outcomes. We propose the topological kernel association test (TopKAT), which combines persistent homology with kernel testing to determine whether geometric structures created by cells predict continuous, binary, or survival outcomes. TopKAT quantifies the topological structure of cells in each image using persistence diagrams and compares the similarities between persistence diagrams on the basis of the number and lifespan of the detected homologies among cells. We show that TopKAT can be more powerful than existing approaches, particularly when cells arise along the boundary of a ring and demonstrate its utility in breast cancer and colorectal cancer applications.
Keywords: topological data analysis, persistent homology, kernel association testing, kernel machine regression, multiplexed spatial proteomics, cell-level imaging, tumor microenvironment
Highlights
-
•
TopKAT is an association test between spatial proteomics and sample-level outcomes
-
•
TopKAT tests whether topological patterns among cells associate with clinical outcomes
-
•
TopKAT increases power to detect patterns in the spatial arrangement of cells
-
•
TopKAT detected meaningful patterns among cells in breast and colorectal cancers
The bigger picture
The tumor microenvironment refers to the ecosystem of cells, tissue, and blood vessels that populate a tumor. Growing evidence suggests that the cellular composition of the tumor microenvironment, including the cell types present and their locations, correlates with how patients fare in the clinic. In particular, the complex structures formed by cells may distinguish between patients with different clinical profiles, such as those more likely to respond to treatment or those who may have better survival outcomes. Using one of the several available technologies to image the tumor microenvironment to retrieve this cellular information can allow researchers to determine which cellular structures are associated with better clinical outcomes. In the future, this may guide treatment decisions and risk stratification based on the unique molecular qualities of each patient’s tumor.
The organization of cells within tumors is hypothesized to associate with patient outcomes, such as clinical prognosis and treatment response. To test this hypothesis, statistical association tests between the arrangement of cells and patient-level outcomes are needed. A new method, TopKAT, is proposed to address this need. TopKAT quantifies the geometric patterns created by cells, such as loops or rings, and tests whether these patterns associate with outcomes such as continuous, binary, or time-to-event endpoints.
Introduction
The use of spatially resolved, multiplexed proteomics profiling is rapidly expanding in oncology research.1,2,3,4 The ability of spatial proteomic technologies to comprehensively probe the intricate spatial arrangement of tumor, immune, and stromal cells in the tumor microenvironment (TME) provides critical knowledge on how the TME affects tumorigenesis and disease progression,5,6 treatment response,7 and eventual patient survival.8,9 Multiplexed spatial proteomics has been successfully implemented to elucidate the effect of the TME on clinical outcomes in breast,7,8,10 colorectal,11 and lung12,13 cancers, among other diseases.14,15
Despite the growing use of spatial proteomic technologies,16 tools for analyzing the resulting cell-level data lag far behind technical developments, with a serious dearth of tailored, flexible, and robust analytic methods for studying how cellular arrangements are related to patient-level outcomes. Many current findings have been derived from inconsistent, informal or qualitative, or even invalid statistical approaches. Studies using formal, inferential statistical analyses have focused on summarizing the spatial arrangement of cells within a circle of radius r using spatial summary statistics.17,18,19,20 These summary statistics, such as Ripley’s K,21 quantify the degree to which the cells exhibit clustering, dispersion, or complete spatial randomness (CSR). This neglects the complex geometric arrangements cells form, for example, around dense tumor regions, necrotic areas, or blood vessels. This geometric information may be clinically or biologically relevant and may be difficult to detect using summary statistics such as Ripley’s K. In addition, these methods rely on the assumption of homogeneity or the idea that the number of cells per unit area is constant irrespective of location. If this assumption is violated, as it often is due to gaps or tears in the tissue during sample processing,22 we may lose power to detect an effect on clinical outcomes. Finally, existing approaches are also restricted to a single clinical-outcome type. Several recent approaches work for binary outcomes (e.g., treatment response)20,23,24 but do not accommodate other important outcomes, such as right-censored survival or quantitative outcomes. The lack of appropriate and adaptable tools poses a serious challenge, limiting full utilization and interpretation of the rich data arising in spatial proteomic studies.
The goal of our work is to characterize the geometry of how cells organize in tissue and examine if this geometric characterization associates with patient-level outcomes, such as survival and treatment response. To identify clinically relevant geometric structures among cells, we combine the topological data analysis (TDA) technique known as persistent homology (PH)25 with kernel association testing to produce the topological kernel association test (TopKAT). Operationally, TopKAT uses PH to characterize the size and number of homologies, namely, connected components and loops, formed by cells in tissue. Then, TopKAT compares whether samples with similar topological or geometric cell structures also exhibit similar clinical phenotypes using kernel association testing. While TDA has seen some basic utility in clinical omics,26,27,28,29 it is notably absent from mainstream analyses of spatial single-cell technologies, and existing deployments have not been integrated with formal statistical hypothesis testing. To address this, we link PH with the nonparametric kernel association testing framework commonly used for other genomic data types.30,31,32 While we illustrate TopKAT using multiplexed spatial proteomics, this method can be applied to other data types that yield cell-level resolution, such as spatial transcriptomics, to produce biological insights. We demonstrate through simulation studies that TopKAT offers higher power and better type I error control than existing alternatives, which utilize spatial summary statistics. Further, we apply TopKAT to data from two studies of triple-negative breast cancer (TNBC)7,8 to illustrate the importance of topological structure among cells in describing patient survival and response to immunotherapy treatment.
Results
Overview of TopKAT
TopKAT is a global test for the association between topological (geometric) structures within the spatial distribution of cells in segmented and phenotyped cell-level data and sample-level outcomes, adjusting for potential confounders. TopKAT can accept data from any profiling modality as long as the data have been processed to identify cell boundaries (segmentation) and phenotypes. The motivation of TopKAT is that the geometry of the spatial distribution of cells may characterize a phenotype with clinical implications.
Operationally, TopKAT follows a three-step procedure (Figure 1). First, TopKAT captures the topological structure of the cells within each image using PH and by applying a Rips filtration.33 The Rips filtration is used to reveal the presence, size, and quantity of topological structures known as degree-0 and degree-1 homologies. Degree-0 homologies refer to connected components and degree-1 homologies refer to loops or rings. Since our single-cell images are two-dimensional, we are interested in detecting connected components (which represent dense masses of cells) and loops (which represent regions where cells arise along the boundary of a ring but are absent from the center). The topological structure of each image is characterized in a summary statistic termed a persistence diagram.33
Figure 1.
Flow chart illustrates the steps of TopKAT
The input data are n segmented cell-level images (0). TopKAT then characterizes the topological structures created by cells in each image using a Rips filtration, which generates a nested sequence of graphs with cells connected by edges if they are no more than 2 apart (1). The detected topological structures are summarized for each image using persistence diagrams (2). TopKAT then calculates the distance between each pair of persistence diagrams, which is converted to a similarity and quantified in kernel matrices for connected components and loops (3). Each column in the kernel matrix represents the similarity between a specific persistence diagram and all other persistence diagrams in the dataset. The kernel matrices are used in a kernel machine regression model to test for an association between the persistence diagrams and a clinical outcome.
The second step is to compute a pairwise distance matrix quantifying the distance (or dissimilarity) between each pair of persistence diagrams. We construct a distance matrix for each homology (connected components and loops) that allows us to describe the differences between samples based on each type of topological structure. We then convert these distance matrices to kernel matrices, which describe the pairwise similarity between persistence diagrams.
Finally, we use a kernel testing framework34 to test for an association between the similarity in persistence diagrams and similarities in continuous, binary, or survival outcomes, adjusting for covariates. To include information from each homology group, we consider a series of weighted combinations of the kernel matrices for the connected components and loops. We iterate across each potential combination and aggregate the resulting p values using the Cauchy combination test.35 If similarities among the persistence diagrams (quantity and size of homologies) align with similarities in the outcome, topological structures within the TME are associated with outcomes, such as treatment response or survival.
There are several advantages to using TopKAT. First, PH describes the geometry of the spatial distribution of cells, which can be used to compare the topological differences between samples with distinct outcomes. This description of the spatial distribution of cells is robust to perturbations in the data due to technical artifacts.33 This is essential because our analysis is downstream of a complex normalization, segmentation, and phenotyping pipeline, which may contribute to noise in the images, such as mislabeled cells. Second, PH can be used to identify global structures within the TME, such as large masses of immune cells or regions within the TME where immune cells are unable to penetrate. These attributes are difficult to capture with other approaches, such as spatial summary statistics.22 These global structures have been shown to be clinically informative in breast cancer and colorectal cancer (CRC),7,8,36 which we explore below and in Notes S7 and S8.
TopKAT shows performance gains in simulation study
We evaluated the validity (type I error control) and power of TopKAT on simulated images and compared it to existing approaches to assess the advantage of using topological information to predict clinical outcomes. We studied three variations of TopKAT: in the first, we only considered similarities in the number and size of connected components across images; in the second, we only considered similarities among loops; and in the third, we aggregated the kernel matrices across similarities in both homology groups using the Cauchy combination test (Note S2). We compared TopKAT to existing approaches grounded in the spatial point process model, including SPOT,17 FunSpace,18 and SPF,19 which we henceforth refer to as “spatial methods.” SPOT, FunSpace, and SPF calculate a spatial summary statistic called Besag’s L, denoted as L(r), to characterize the degree to which cells exhibit clustering, dispersion, or complete spatial randomness (CSR) within a circle of radius r. The pipeline to relate L(r) to a clinical outcome differs by method: SPOT calculates L(r) at a sequence of r values, tests for an association with a clinical outcome at each r, and aggregates the p values; FunSpace calculates L(r) across a sequence of r values to yield L(r) functions for each image and decomposes the variation among these functions using functional principal-component analysis within a regression model; and SPF similarly calculates L(r) functions for each image and uses generalized functional regression models to associate these functions with a clinical outcome. The different summary statistics produced by these approaches are visualized in Figure 2.
Figure 2.
Comparing the spatial summary statistics used in TopKAT vs. SPOT, SPF, and FunSpace
TopKAT uses the persistence diagram to summarize the topological structures created by cells. SPOT, SPF, and FunSpace use Besag’s L, a second-order summary statistic of the spatial point process model, to characterize the clustering, or lack thereof, among cells.
We simulated n = 100 samples exhibiting different geometric structures among the cells. We split the samples into two groups of 50 and selected a different geometric structure to randomly simulate within each group. This included random numbers of squares (“square”), loops (“loop”), bivariate Gaussian distributions (“clusters”), simulated tissue images using the scSpatialSIM R package37 (“simulated tissue”), and CSR. Examples of these images are provided in Note S5. The simulation conditions are referred to based on the geometric structures chosen for each group, e.g., squares vs. loop or loop vs. CSR. To assess power, survival outcomes for these two groups were simulated from exponential distributions with different rates, with a hazard ratio of 2. To assess validity, the outcomes across both groups were simulated from the same exponential distribution. We simulated survival outcomes, though TopKAT accommodates continuous and binary outcomes, which are illustrated in our data applications.
The results are given in Table 1. Across all conditions, TopKAT offers higher power and, across most conditions, more tightly controls type I error than the spatial approaches. This suggests that (1) topological information (connected components and loops) captures the different geometric and spatial arrangements among the cells and (2) our choice of distance measure and kernel function captures the similarities and differences in this topological information between images. TopKAT far exceeds the spatial methods in power under scenarios in which loops were simulated (loop vs. CSR or square vs. loop). All three variations of TopKAT exhibited power around 0.89, whereas the spatial methods exhibited power between 0.52 and 0.69. This is because TopKAT explicitly detects the presence and size of loops, while the spatial methods capture cell density within a specific radius. TopKAT, however, also detects differences in cell density, as shown by the square vs. CSR and cluster vs. CSR conditions. Here, the images contained either tight masses of cells (squares, which reflect large connected components), diffuse masses of cells (clusters, which also reflect connected components), or random noise (CSR). TopKAT and the spatial methods showed comparable power (around 0.88 for TopKAT and between 0.78 and 0.85 for the spatial methods). Finally, TopKAT is sensitive to detecting differences due to tissue structure. Under simulated tissue, TopKAT detected that differences in tissue structure were associated with sample-level phenotypes. Identifying these structural associations is essential for real applications.29
Table 1.
Power and type I error rates for TopKAT vs. spatial methods (SPOT, SPF, and FunSpace) across a range of simulated images for a survival outcome
| Square vs. CSR |
Loop vs. CSR |
Square vs. loop |
Cluster vs. CSR |
Simulated tissue |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | Power | Type I error | Power | Type I error | Power | Type I error | Power | Type I error | Power | Type I error |
| TopKAT (dim 0) | 0.899 | 0.044 | 0.894 | 0.054 | 0.830 | 0.054 | 0.886 | 0.048 | 0.877 | 0.050 |
| TopKAT (dim 1) | 0.882 | 0.050 | 0.895 | 0.060 | 0.863 | 0.059 | 0.885 | 0.045 | 0.876 | 0.052 |
| TopKAT (omnibus) | 0.907 | 0.048 | 0.895 | 0.058 | 0.851 | 0.055 | 0.884 | 0.047 | 0.878 | 0.052 |
| SPOT | 0.770 | 0.051 | 0.610 | 0.055 | 0.593 | 0.052 | 0.845 | 0.052 | 0.800 | 0.059 |
| SPF | 0.754 | 0.059 | 0.609 | 0.063 | 0.685 | 0.067 | 0.781 | 0.058 | 0.726 | 0.069 |
| FunSpace | 0.681 | 0.047 | 0.516 | 0.054 | 0.578 | 0.058 | 0.825 | 0.050 | 0.669 | 0.059 |
TopKAT (dim 0) refers to TopKAT using only the kernel matrix derived from the persistence of dimension-0 homologies (connected components). TopKAT (dim 1) refers to TopKAT using the kernel matrix derived from persistence of dimension-1 homologies (loops).
Within TopKAT, we compared using topological information from each homology group: only the connected components (“dim 0” in Table 1) or only the loops (“dim 1” in Table 1). We contrasted these with an omnibus test across both groups (“omnibus” in Table 1). Intuitively, in scenarios with dense regions of cells (e.g., square vs. CSR), testing solely based on similarities among the connected components should yield more power. We observed this to be the case, as TopKAT (dim 0) power (0.899) slightly exceeded the power of TopKAT (dim 1) (0.882) in this condition. In scenarios where loops were simulated, detecting only loops should be more powerful than detecting only connected components. We found that this was the case, though the gain in power was slight (0.895 for TopKAT [dim 1] vs. 0.894 for TopKAT [dim 0]). Overall, neglecting the “correct” topological structure for each scenario did not lead to a marked reduction in power. For example, using only the loops yielded a power of 0.88 (vs. 0.90 for TopKAT [dim 0]) under square vs. CSR. This reflects the subtle topological information available in these images. For example, in images simulated under CSR, there are small loops present among the randomly scattered cells, which is what distinguishes these images from large square masses. In general, however, TopKAT, using an omnibus approach, offered similar or higher power compared with using a single homology group.
TopKAT reveals clinically relevant geometric structures in TNBC
We applied TopKAT to multiplexed ion beam imaging time-of-flight (MIBI-TOF) data obtained from a study of TNBC.8 This study used MIBI-TOF to probe the spatial expression of 36 proteins in breast tumor tissue biopsied from 38 patients with TNBC. The goal of the study was to explore the associations between the cellular structure of the TME and clinical endpoints, including overall survival. The data contain images selected as representative of the cellular structure within each tumor sample. Cells were phenotyped as either immune cells (CD4 T cells, CD8 T cells, CD3 T cells, natural killer cells, B cells, macrophages, dendritic cells, neutrophils, or monocytes) or non-immune cells (tumor cells, epithelial cells, mesenchymal cells, or endothelial cells).
Keren et al. discovered that the biopsies exhibited distinct “structured” TMEs. Biopsies could be categorized as (1) compartmentalized, where immune and tumor cells segregated into distinct regions from each other (Figure 3A); (2) mixed, where immune and tumor cells colocalized (Figure 3B); or (3) cold, where few immune cells were detected within the tumor compartment of the biopsy. These categories further exhibited differential survival outcomes, with compartmentalization associated with improved survival, underscoring the clinical relevance of the cellular structure of the TME. Of the n = 38 samples, 18 exhibited mixed TMEs, 15 exhibited compartmentalization, and 5 were immune cold. These categories were determined based on a “mixing score,” which was computed as the number of immune-tumor cell interactions divided by the number of immune-immune cell interactions. The resulting categories (mixed, compartmentalized, and cold) were provided in the published data. The goals of our analysis were to (1) characterize the global structure of immune cells among the biopsies using PH and (2) relate this structure to overall survival.
Figure 3.
Constructing the persistence diagrams derived from the immune cells for two example mixed and compartmentalized tumor microenvironments from a study of triple-negative breast cancer
Examples of compartmentalized (patient 16) and mixed (patient 2) samples show distinct spatial arrangements of immune cells. Two steps within the Rips filtration applied to these images are shown for ϵ = 25 and ϵ = 75. Finally, persistence diagrams constructed using the immune cell locations in these images are shown at the bottom. These diagrams show the lifespans of connected components and loops formed by immune cells. In the compartmentalized TME, connected components and loops tend to persist longer than in the mixed TME. Loops also tend to be larger in compartmentalized TMEs, as evidenced by their longer lifespans.
In Note S6, we describe how TopKAT significantly differentiates between mixed, compartmentalized, and cold TMEs. Here, we focus on mixed and compartmentalized samples and closely examine the topological structures inherent to immune cells among these biopsies. We illustrate two example samples, one compartmentalized and one mixed, in Figure 3. Comparing these two, the compartmentalized TME contained connected components and loops that persisted longer than in the mixed TME. To see if this trend persisted across compartmentalized and mixed TMEs, we calculated the average maximum lifespan among connected components and among loops for compartmentalized and mixed TMEs. We calculated this by selecting the connected component and loop from each image with the longest lifespan. We then averaged these among the compartmentalized TMEs and among the mixed TMEs. This provides a measure of the average size of the largest connected component and loop within these groups. Indeed, across samples, the average maximum lifespan for connected components in compartmentalized TMEs was slightly larger than in mixed TMEs (179.69 vs. 171.66) and was much larger for loops (215.29 vs. 164.45). This is intuitive: compartmentalized TMEs exhibit large masses of immune cells that persist longer as connected components within the Rips filtration than in mixed TMEs. Compartmentalized TMEs may also show large gaps or loops representing regions where tumor cells resided that prevented immune cells from infiltrating.
We then followed the procedure given in the interpretation section to identify the distance between cells at which mixed and compartmentalized TMEs showed the most distinctions in topological structure. TopKAT yielded the smallest p value at a distance of 73.6 (Figure S2). In Figures 4A and 4B, we illustrate the simplicial complex built based on the immune cells at this distance on the same two example samples. These figures show a close recapitulation of the immune cell structures visualized in Figures 3A and 3B. Then, using the full sample set, we explored how often immune cells were connected at a distance of 73.6. On average, compartmentalized TMEs exhibited a large number of connections within B cells, CD4 T cells, macrophages, and CD8 T cells (Figure 4C). These connections potentially reflect the presence of tertiary lymphoid structures, which arise at sites of consistent inflammation.36 On average, mixed TMEs exhibited connections among these immune cell types but also included more diverse connections among CD3 T cells, regulatory T cells (Tregs), and neutrophils (Figure 4D).
Figure 4.
Exploring the connections among immune cells at a clinically relevant distance from a study of triple-negative breast cancer
We obtained the distance at which mixed and compartmentalized tumor microenvironments (TMEs) are most distinct topologically, which occurs at distance 2 ϵ = 73.6.
(A and B) Two example compartmentalized (A) and mixed (B) samples with this distance overlaid are shown.
(C and D) We then illustrate the average number of connections between immune cell phenotypes at this distance among compartmentalized and mixed TMEs, respectively.
Finally, we examined if similarities in the topology of each TME predicted overall survival. Consistent with the original study findings, we observed that mixed, segregated, and cold TMEs corresponded to distinct survival outcomes (p = 0.011). Note that the spatial methods considered in our simulation study (SPOT, SPF, and FunSpace) could not consistently recapitulate these results. For these approaches, we considered a range of radii between 0 and 512. We did not find significant associations between the spatial arrangement of immune cells and survival using these approaches. Only SPOT and FunSpace were able to detect pairwise differences in immune cell structure between mixed and cold TMEs (both methods) and compartmentalized and cold TMEs (FunSpace only).
In Note S7, we describe using TopKAT to analyze imaging mass cytometry data from another study in TNBC to examine how the TME differs between responders and non-responders to immunotherapy treatment.7 In Note S8, we describe using TopKAT to analyze CODEX data from a study of CRC to examine how tertiary lymphoid structures differ between two CRC subtypes.11
Discussion
Spatial profiling with cell-level resolution, including multiplexed spatial proteomics, can reveal the cellular architecture of the TME, which is known to associate with clinical outcomes. There is an absence of valid and robust statistical methods for testing for this association and a lack of consistency among methods across studies. To address this gap, we combined PH and kernel association testing in TopKAT. TopKAT is a global test of whether similarities in the quantity and scale of homologies among the single-cell images align with similarities in a clinical outcome to determine if topological structure among the cells is clinically informative.
TopKAT contributes to both statistical methodology and our ability to derive insights from spatially resolved data with cell-level resolution, particularly multiplexed spatial proteomics. In terms of statistical methodology, TopKAT accommodates statistical inference based on the persistence diagram, a summary statistic often used descriptively,33 and offers a unified framework for hypothesis testing by leveraging a fast and simple kernel. As a result, TopKAT facilitates unique clinical and biological insights from spatially resolved, cell-level data through a number of advantages. First, in contrast to existing approaches, leveraging TDA focuses on how cells form geometric structures, including loops, rather than the relationships between pairs of cells. Second, PH detects these structures across a range of scales, offering a holistic assessment of how cells are organized. These advantages allow us to compare different samples on the basis of the presence, absence, and size of geometric features, revealing which samples are, geometrically speaking, more similar. Third, PH possesses a “smoothing” effect in the sense that it mitigates noise, such as mislabeled cells or technical artifacts, within images of the heterogeneous biopsies. This is due to the advantage of PH, whereby small perturbations to the input data lead to small or negligible changes in the detection of topological features.38 PH thus produces a rich, yet robust, description of how cells are arranged in tissue. Fourth, integration with kernel approaches allows one to utilize these frameworks for rigorous and robust statistical inference across various study designs, clinical outcome types, and analytical objectives.
We illustrated the power and type I error rate of TopKAT on simulated data and found that TopKAT outperforms existing methods for revealing the relationship between cell structures in the image and clinical outcomes. This was especially apparent when the cell architecture contained loops that PH is well suited to capture. TopKAT may be modified to up- or down-weight the contribution of each degree-k homology to suit the hypotheses of the investigator, or investigators may choose to explore only a single homology degree that pertains to their interests. In our simulations, we found that aggregating across connected components and loops offered similar or higher power than considering one dimension alone. This may be particularly beneficial in scenarios where the importance of each homology is unknown. However, if a priori knowledge suggests a certain homology may be more strongly associated with sample-level outcomes, the weighted aggregation of the kernel matrices may be modified to suit. An example scenario where this could arise is in samples known to exhibit regions of necrosis, where immune cells may not be detected.
There are several limitations and directions to expand upon this work. First, TopKAT does not distinguish between cell types in the construction of the filtration. In our two TNBC applications, we used TopKAT to examine the spatial architecture of cells irrespective of phenotype. After the fact, we examined which immune cell types were highly connected within the simplicial complex formed at a clinically relevant distance between cells. Examining these connections may reveal important structures among distinct cell phenotypes. However, the interpretation based on the TopKAT p value is still agnostic to cell type. TopKAT could be run on each cell type separately or on any subset of the cell types, though this would not distinguish the contributions of each cell type in constructing the homologies. An alternative would be to use multiparameter PH, which could account for both cell type and distance between cells in the construction of the persistence diagram.26 We also presented TopKAT based on the construction of a Rips filtration, the use of a total dissimilarity based on the lifespans of homologies, and a Gower’s centered kernel. These choices yielded a fast and intuitive test but could be modified. Different distances or dissimilarities may uncover more pronounced differences among the persistence diagrams, such as the commonly used Wasserstein or bottleneck distances.25 However, we opted for the total dissimilarity metric to compare persistence diagrams because of its computational speed. Additionally, other kernels, such as an exponential kernel, may better capture the relationship between the topological structure of the cells and patient outcomes. The impact of these choices is beyond the scope of this article but warrants further evaluation. Another limitation is that this method does not accommodate multiple regions of interest imaged within the same biopsy. This could be addressed by considering an aggregation of the persistence diagrams across images within a biopsy, but this remains to be explored in future work. Another interesting direction to take this approach is to treat the maximum value of used in the construction of the persistence diagrams as a tuning parameter and use cross-validation to choose an optimal value that yields the best association between PDs and outcomes. However, this will require careful consideration of how to define the maximum value across images of varying dimensions and how to optimally perform cross-validation across images. Finally, TopKAT is a global test of association between the topological structure of the images and outcomes. One could consider, say, examining if the average lifespan of a particular homology, such as the connected components, associates with patient outcomes and estimate the effect of an increase in the average lifespan on the outcome.
Methods
Notation
Assume we have n two-dimensional multiplexed spatial proteomics images obtained from tissue biopsies collected from n individuals. Each image should be represented as a matrix of M cells with spatial coordinates for each segmented and phenotyped cell. In our case, we are interested in examining the arrangement of cells in the TME and use language pertaining to this application. However, this methodology is appropriate for any study involving the spatial arrangement of cells in tissue. Each image is represented by a matrix of two-dimensional (x,y) coordinates for the location of each cell. Let y:n × 1 represent a vector of patient-level outcomes, which may be continuous, binary, or right-censored survival endpoints. Let X:n × p represent a matrix of p clinical covariates, e.g., sex or age, that we would like to adjust for in our kernel association test.
Our goal is to test for an association between the topological features within each image and clinical outcomes, adjusting for covariates. To achieve this, we proceed in the following steps.
-
(1)
We first summarize the topological information contained in each image using persistence diagrams. We obtain these diagrams using PH and by constructing a Vietoris-Rips filtration.25
-
(2)
We then construct an n × n pairwise distance matrix quantifying the distance between each pair of persistence diagrams. We obtain a distance matrix for each homology group (connected components and loops). To facilitate the use of kernel machine regression, we convert the distance matrices to kernel matrices, which quantify the similarity between persistence diagrams for each homology.
-
(3)
Finally, we use kernel machine regression to relate the similarity in persistence diagrams to clinical outcomes, adjusted for covariates.
We describe these steps in more detail below.
Computing PH
The first step in the TopKAT pipeline is to summarize the topological information in each image using PH. PH is a TDA technique that captures the size of connected components (homologies of degree 0) and loops (homologies of degree 1) within a two-dimensional point cloud.33 To capture these features, PH relies on constructing an evolving graph of nodes and vertices where the nodes represent each point in the point cloud (in our case, cells in an image). The graph, known as a simplicial complex, evolves based on a scale parameter that is varied from 0 to ∞. As changes, connected components and loops will appear (birth) and disappear (death) in the resulting graph, and the difference between the birth and death scales is termed the lifespan of a feature. Geometric features with longer lifespans correspond to larger (i.e., “more persistent”) features of the data. The most persistent connected components and loops are the ones we hope to capture and that will introduce the most information in our subsequent kernel association test.
The process of varying to generate a sequence of simplicial complexes is called constructing a filtration. We compute a Vietoris-Rips filtration (also known as a Rips filtration) because it is fast, efficient, and intuitive. Let S represent a two-dimensional spatial image of cells. For ≥ 0, the simplicial complex generated within the Rips filtration is defined as
| (Equation 1) |
In plain terms, the simplicial complex at is a subset of vertices, σ, of S for which any two vertices are at most 2 apart. To construct V(S), we first compute the -neighborhood graph of S.33 This graph is defined as
| (Equation 2) |
where d(i,j) is the Euclidean distance between the centroids of two cells i and j. Given the -neighborhood graph, we then calculate its clique complex, which involves identifying cliques within the graph. These are collections of vertices for which each pair of vertices is connected by an edge.33 Thus, as grows, we obtain a sequence of nested subsets, , which allows us to study the evolution of connected components and loops in this space. We recommend allowing to increase within each image until all cells are connected to detect topological features with any lifespan. However, users may choose a maximum value of that is clinically or biologically meaningful, such as detecting topological features within a certain pre-specified distance.
The values of at which each geometric feature is born and dies are recorded and summarized using a persistence diagram.33 Formally, a persistence diagram, Z, is a multiset of points defined as
| (Equation 3) |
where hj represents the homology degree of the jth point, bj represents the distance at which the jth feature is born, and dj represents the distance at which the jth feature dies. |D| counts the total number of connected components and loops detected during filtration, and Δ refers to all the points along the diagonal y = x line.39 We will use the persistence diagram as our summary statistic to capture the geometric information about the spatial distribution of cells in each multiplexed image. Computation of PH is efficiently implemented in the TDAstats R package.40
Note that the construction of the Rips filtration is independent of cell type. Users may filter images to a single cell type of interest or may include all those belonging to a certain class, such as all T cells or all immune cells. If applied to a single cell type, the persistence diagrams will characterize the topological structure created by that cell type. If applied to multiple cell types, the persistence diagrams will describe the topological structure created by all cells without differentiating among the subtypes. In the case of applying TopKAT to multiple cell types, users may explore the cell-cell relationships more deeply as described in the interpretation section below.
Distance and kernel matrix construction
We would like to use our estimated persistence diagrams, Z=(Z1, …,Zn), as covariates in a test of association between topological structure in our images and clinical outcomes. However, the structure of persistence diagrams makes them challenging to treat as covariates in a standard linear model. To circumvent this issue, we use a kernel machine learning framework and use the persistence diagrams into this framework via the construction of a kernel matrix. To obtain kernel matrix K, which describes the similarity between persistence diagrams, we must first construct a pairwise distance matrix between the samples and convert this to a kernel matrix. Since PH emphasizes uncovering “persistent” degree-0 and degree-1 homologies, we use the following dissimilarity measure, which we term total dissimilarity:
| (Equation 4) |
where is the maximum number of homologies of degree k between Zi and Zj and is the lifespan of the pth longest-living homology of degree k in persistence diagram Z·, in which is the distance at which this feature died and is the distance at which this feature was born. To compute this distance, the features of degree k are first ordered from highest to lowest lifespan. Any difference in the number of features is filled in with zeros. The sum of deviations between ordered lifespan statistics is computed as given in Equation 4. Dk(Zi,Zj) emphasizes a similarity in (1) quantity and (2) lifespan of degree-k homologies found in persistence diagrams Zi and Zj. Intuitively, Zi and Zj, with a similar number of degree-k features with similar lifespans, will have a comparatively small Dk(Zi,Zj). This yields two distance matrices corresponding to each of the degree-k homologies, k = 0,1. These distance matrices, denoted Dk, are converted to kernel matrices using a Gower’s centered kernel:
| (Equation 5) |
where I:n × n is the identity matrix and 1 is an n × 1 column vector of 1 s. We convert the Dk to kernel matrices to leverage kernel machine regression, which we discuss in the next section.
Kernel machine regression
We now would like to relate the persistence diagrams, Z, to a clinical outcome, y, via a kernel machine regression model. Here, we focus on a survival outcome, but this framework extends to continuous and binary outcomes, which are discussed in Notes S3 and S4, respectively. For a survival outcome, suppose we observe , where Yi = min(Ti,Ci) is the recorded event time (either the time of the survival event Ti or the censoring time Ci) and Δi = I(Ti ≤ Ci) is an event indicator. To test if Z is associated with survival time, we use a kernel machine Cox proportional hazards model:
| (Equation 6) |
where β represents the effect of known confounders on the log-hazard for survival and f(·) is a smooth, unknown, and centered vector-valued function that is assumed to belong to a space spanned by a positive-definite kernel K(·,·).34 The kernel function, K(·,·), quantifies the similarities between samples Zi and Zj,i ≠ j which, in our case, are persistence diagrams. We are interested in testing if f(Z) has an effect on the log-hazard for survival. Intuitively, if similarities between persistence diagrams align with similarities in outcomes, we should expect to see a non-zero effect of f(Z) on y.31
Within this kernel machine Cox regression model, the null and alternative hypotheses we wish to test are
| (Equation 7) |
To test H0, we leverage the relationship between kernel machine Cox regression and linear mixed modeling.41 We can rewrite the model given in Equation 6 as
| (Equation 8) |
where (β,α) are unknown parameters to be estimated. Our null hypothesis then becomes H0:Kkα = 0. Maximizing the penalized partial likelihood corresponding to Equation 8 is equivalent to estimating a Cox survival model with a random intercept for each sample:
| (Equation 9) |
where is a vector of random effects with covariance proportional to the kernel matrix for homology degree k, Kk. Therefore, testing H0:Kkα = 0 is the same as testing H0:τ = 0. We can test this null hypothesis using the variance-component score test, which requires only fitting the null model under τ = 0, λ(t;X,Z) = λ0(t)exp(Xβ). This implies that the test is a valid statistical test even if an undesirable kernel is chosen. However, the choice of kernel will impact power.31,42 An exploration of the properties of a desirable kernel is beyond the scope of this work.
This variance-component score test statistic for the kernel machine Cox model based on degree k is
| (Equation 10) |
where represents a vector of Martingale residuals under H0, where , in which is Breslow’s estimator of the baseline hazard function under the null and is the estimator of the baseline survival function. To handle tied survival times, we use Efron’s approximation.43 More information is given in Plantinga et al.42 Under H0, Qk has an asymptotic distribution of a mixture of distributions. The p value can then be computed analytically using the Davies method.44
In Notes S1 and S2, we discuss a small sample correction and accommodating kernel matrices from both connected components and loops. Both approaches were used in the TNBC application described above. We also discuss adaptations of TopKAT for continuous and binary outcomes in Notes S3 and S4.
Interpretation
PH examines the longevity of topological features in each image throughout the filtration. We consider a stepwise approach to determining the distance between cells, 2 , within the Rips filtration, where we observe the strongest association between persistence diagrams and outcomes.
We first select the maximum distance at which the death of a topological feature was observed across all n persistence diagrams. We call this distance 2 max. We then consider a sequence of length T of distances between 0 and 2 max. For each t for t = 0, …,T (0 = 0), we threshold the persistence diagrams to only features that were born and had died during filtration by distance 2 t. We perform the kernel machine regression test described above for each t and store the resulting omnibus p value. At the end, we identify which distance yielded the smallest p value. We denote the selected distance 2 best. 2 best is then interpreted as the maximum distance between cells at which the lifespans and number of topological features were most strongly associated with the outcome. Note that in this case, we are treating the p value as a statistic and would not recommend interpreting these as p values in the traditional sense.
Once the “best” distance is obtained, we can conduct further post hoc analyses. For example, we can consider which cell types are highly connected within the resulting simplicial complex constructed at the identified distance.
Resource availability
Lead contact
Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Sarah Samorodnitsky (samoros@mskcc.org).
Materials availability
This study did not generate new materials.
Data and code availability
-
•
The data for the TNBC application given in the main manuscript can be found at https://www.angelolab.com/mibi-data. The data for the TNBC application given in Note S7 can be found at https://zenodo.org/records/7990870. The data for the CRC application given in Note S8 can be found at https://data.mendeley.com/datasets/mpjzbtfgfr/1.
-
•
The software to implement TopKAT is available in an R package available at https://sarahsamorodnitsky.github.io/TopKAT/. The version of the R package published with this article is archived on Zenodo at https://doi.org/10.5281/zenodo.17273281.45
Acknowledgments
This work was supported in part by NIH grant U10 CA180819 and The Hope Foundation for Cancer Research.
Author contributions
S.S. conceived and implemented the method. K.C., A.L., W.L., N.Z., Y.-C.C., and M.C.W. provided input on method development and interpretation of the results. All authors contributed to writing and reviewing the manuscript.
Declaration of interests
K.C. is a shareholder in Geneoscopy LLC, Georgiamune, and AME Therapeutics and has received consulting fees from Geneoscopy LLC, PACT Pharma, Tango Therapeutics, Flagship Labs 81 LLC, the Rare Cancer Research Foundation, Noetik, the Jaime Leandro Foundation, Georgiamune, and AME Therapeutics.
Published: January 9, 2026
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.patter.2025.101456.
Supplemental information
References
- 1.Bodenmiller B. Highly multiplexed imaging in the omics era: understanding tissue structures in health and disease. Nat. Methods. 2024;21:2209–2211. doi: 10.1038/s41592-024-02538-6. [DOI] [PubMed] [Google Scholar]
- 2.Lewis S.M., Asselin-Labat M.L., Nguyen Q., Berthelet J., Tan X., Wimmer V.C., Merino D., Rogers K.L., Naik S.H. Spatial omics and multiplexed imaging to explore cancer biology. Nat. Methods. 2021;18:997–1012. doi: 10.1038/s41592-021-01203-6. [DOI] [PubMed] [Google Scholar]
- 3.Chen Y., Liu Y., Han L. Spatial landscape of the tumor immune microenvironment. Trends Cancer. 2023;9:459–460. doi: 10.1016/j.trecan.2023.03.006. [DOI] [PubMed] [Google Scholar]
- 4.Lundberg E., Borner G.H.H. Spatial proteomics: a powerful discovery tool for cell biology. Nat. Rev. Mol. Cell Biol. 2019;20:285–302. doi: 10.1038/s41580-018-0094-y. [DOI] [PubMed] [Google Scholar]
- 5.Quail D.F., Joyce J.A. Microenvironmental regulation of tumor progression and metastasis. Nat. Med. 2013;19:1423–1437. doi: 10.1038/nm.3394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.de Visser K.E., Joyce J.A. The evolving tumor microenvironment: From cancer initiation to metastatic outgrowth. Cancer Cell. 2023;41:374–403. doi: 10.1016/j.ccell.2023.02.016. [DOI] [PubMed] [Google Scholar]
- 7.Wang X.Q., Danenberg E., Huang C.S., Egle D., Callari M., Bermejo B., Dugo M., Zamagni C., Thill M., Anton A., et al. Spatial predictors of immunotherapy response in triple-negative breast cancer. Nature. 2023;621:868–876. doi: 10.1038/s41586-023-06498-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Keren L., Bosse M., Marquez D., Angoshtari R., Jain S., Varma S., Yang S.R., Kurian A., Van Valen D., West R., et al. A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell. 2018;174:1373–1387.e19. doi: 10.1016/j.cell.2018.08.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ali H.R., Chlon L., Pharoah P.D.P., Markowetz F., Caldas C. Patterns of immune infiltration in breast cancer and their clinical implications: a gene-expression-based retrospective study. PLoS Med. 2016;13 doi: 10.1371/journal.pmed.1002194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Danenberg E., Bardwell H., Zanotelli V.R.T., Provenzano E., Chin S.F., Rueda O.M., Green A., Rakha E., Aparicio S., Ellis I.O., et al. Breast tumor microenvironment structures are associated with genomic features and clinical outcome. Nat. Genet. 2022;54:660–669. doi: 10.1038/s41588-022-01041-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Schürch C.M., Bhate S.S., Barlow G.L., Phillips D.J., Noti L., Zlobec I., Chu P., Black S., Demeter J., McIlwain D.R., et al. Coordinated cellular neighborhoods orchestrate antitumoral immunity at the colorectal cancer invasive front. Cell. 2020;182:1341–1359.e19. doi: 10.1016/j.cell.2020.07.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jackson H.W., Fischer J.R., Zanotelli V.R.T., Ali H.R., Mechera R., Soysal S.D., Moch H., Muenst S., Varga Z., Weber W.P., Bodenmiller B. The single-cell pathology landscape of breast cancer. Nature. 2020;578:615–620. doi: 10.1038/s41586-019-1876-x. [DOI] [PubMed] [Google Scholar]
- 13.Karimi E., Yu M.W., Maritan S.M., Perus L.J.M., Rezanejad M., Sorin M., Dankner M., Fallah P., Doré S., Zuo D., et al. Single-cell spatial immune landscapes of primary and metastatic brain tumours. Nature. 2023;614:555–563. doi: 10.1038/s41586-022-05680-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Damond N., Engler S., Zanotelli V.R.T., Schapiro D., Wasserfall C.H., Kusmartseva I., Nick H.S., Thorel F., Herrera P.L., Atkinson M.A., Bodenmiller B. A map of human type 1 diabetes progression by imaging mass cytometry. Cell Metab. 2019;29:755–768.e5. doi: 10.1016/j.cmet.2018.11.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Maity S., Huang Y., Kilgore M.D., Thurmon A.N., Vaasjo L.O., Galazo M.J., Xu X., Cao J., Wang X., Ning B., et al. Mapping dynamic molecular changes in hippocampal subregions after traumatic brain injury through spatial proteomics. Clin. Proteomics. 2024;21:32. doi: 10.1186/s12014-024-09485-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Method of the year 2024: spatial proteomic. Nat. Methods. 2024;21:2195–2196. doi: 10.1038/s41592-024-02565-3. [DOI] [PubMed] [Google Scholar]
- 17.Samorodnitsky S., Campbell K., Ribas A., Wu M.C. A spatial omnibus test (spot) for spatial proteomic data. Bioinformatics. 2024;40 doi: 10.1093/bioinformatics/btae425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Vu T., Seal S., Ghosh T., Ahmadian M., Wrobel J., Ghosh D. Funspace: A functional and spatial analytic approach to cell imaging data using entropy measures. PLoS Comput. Biol. 2023;19 doi: 10.1371/journal.pcbi.1011490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Vu T., Wrobel J., Bitler B.G., Schenk E.L., Jordan K.R., Ghosh D. Spf: a spatial and functional data analytic approach to cell imaging data. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1009486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Canete N.P., Iyengar S.S., Ormerod J.T., Baharlou H., Harman A.N., Patrick E. spicyr: spatial analysis of in situ cytometry data in r. Bioinformatics. 2022;38:3099–3105. doi: 10.1093/bioinformatics/btac268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ripley B.D. Modelling spatial patterns. J. Roy. Stat. Soc. B. 1977;39:172–192. [Google Scholar]
- 22.Wrobel J., Harris C., Vandekar S. Statistical analysis of multiplex immunofluorescence and immunohistochemistry imaging data. Methods Mol. Biol. 2023;2629:141–168. doi: 10.1007/978-1-0716-2986-4_8. [DOI] [PubMed] [Google Scholar]
- 23.Seal S., Neelon B., Angel P.M., O’Quinn E.C., Hill E., Vu T., Ghosh D., Mehta A.S., Wallace K., Alekseyenko A.V. Spaceanova: Spatial co-occurrence analysis of cell types in multiplex imaging data using point process and functional anova. J. Proteome Res. 2024;23:1131–1143. doi: 10.1021/acs.jproteome.3c00462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Dayao M.T., Trevino A., Kim H., Ruffalo M., D’Angio H.B., Preska R., Duvvuri U., Mayer A.T., Bar-Joseph Z. Deriving spatial features from in situ proteomics imaging to enhance cancer survival analysis. Bioinformatics. 2023;39:i140–i148. doi: 10.1093/bioinformatics/btad245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Edelsbrunner H., Harer J.L. American Mathematical Society; 2022. Computational Topology: An Introduction. [Google Scholar]
- 26.Vipond O., Bull J.A., Macklin P.S., Tillmann U., Pugh C.W., Byrne H.M., Harrington H.A. Multiparameter persistent homology landscapes identify immune cell spatial patterns in tumors. Proc. Natl. Acad. Sci. 2021;118 doi: 10.1073/pnas.2102166118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Benjamin K., Bhandari A., Kepple J.D., Qi R., Shang Z., Xing Y., An Y., Zhang N., Hou Y., Crockford T.L., et al. Multiscale topology classifies cells in subcellular spatial transcriptomics. Nature. 2024;630:943–949. doi: 10.1038/s41586-024-07563-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Aukerman A., Carrière M., Chen C., Gardner K., Rabadán R., Vanguri R. Persistent homology based characterization of the breast cancer immune microenvironment: A feasibility study. J. Comput. Geometry. 2022;12:183–206. [Google Scholar]
- 29.Crawford L., Monod A., Chen A.X., Mukherjee S., Rabadán R. Predicting clinical outcomes in glioblastoma: an application of topological and functional data analysis. J. Am. Stat. Assoc. 2020;115:1139–1150. [Google Scholar]
- 30.Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhao N., Chen J., Carroll I.M., Ringel-Kulka T., Epstein M.P., Zhou H., Zhou J.J., Ringel Y., Li H., Wu M.C. Testing in microbiome-profiling studies with mirkat, the microbiome regression-based kernel association test. Am. J. Hum. Genet. 2015;96:797–807. doi: 10.1016/j.ajhg.2015.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ghosh T., Lui V., Rudra P., Seal S., Vu T., Hsieh E., Ghosh D. The cytokernel user’s guide. dim. 2022;126:8. [Google Scholar]
- 33.Otter N., Porter M.A., Tillmann U., Grindrod P., Harrington H.A. A roadmap for the computation of persistent homology. EPJ Data Sci. 2017;6:17–38. doi: 10.1140/epjds/s13688-017-0109-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Liu D., Lin X., Ghosh D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics. 2007;63:1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Liu Y., Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 2020;115:393–402. doi: 10.1080/01621459.2018.1554485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Schumacher T.N., Thommen D.S. Tertiary lymphoid structures in cancer. Science. 2022;375 doi: 10.1126/science.abf9419. [DOI] [PubMed] [Google Scholar]
- 37.Soupir, A., Wilson, C., Creed, J., Wrobel, J., Ospina, O., and Fridley, B (2023). scSpatialSIM: A Point Pattern Simulator for Spatial Cellular Data. https://github.com/FridleyLab/scSpatialSIMrpackageversion0.1.3.3.
- 38.Cohen-Steiner D., Edelsbrunner H., Harer J. Proceedings of the twenty-first annual symposium on Computational geometry. Association for Computing Machinery; 2005. Stability of Persistence Diagrams; pp. 263–271. [Google Scholar]
- 39.Berry E., Chen Y.C., Cisewski-Kehe J., Fasy B.T. Functional summaries of persistence diagrams. J. Appl. Comput. Topol. 2020;4:211–262. [Google Scholar]
- 40.Wadhwa R.R., Williamson D.F.K., Dhawan A., Scott J.G. Tdastats: R pipeline for computing persistent homology in topological data analysis. J. Open Source Softw. 2018;3:860. doi: 10.21105/joss.00860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Cai T., Tonini G., Lin X. Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics. 2011;67:975–986. doi: 10.1111/j.1541-0420.2010.01544.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Plantinga A., Zhan X., Zhao N., Chen J., Jenq R.R., Wu M.C. Mirkat-s: a community-level test of association between the microbiota and survival times. Microbiome. 2017;5:17. doi: 10.1186/s40168-017-0239-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Efron B. The efficiency of cox’s likelihood function for censored data. J. Am. Stat. Assoc. 1977;72:557–565. [Google Scholar]
- 44.Davies R.B. The distribution of a linear combination of χ2 random variables. Appl. Stat. 1980;29:323–333. [Google Scholar]
- 45.Samorodnitsky, S. (2025). sarahsamorodnitsky/topkat: v1.0.0. Zenodo. 10.5281/zenodo.17273281. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
The data for the TNBC application given in the main manuscript can be found at https://www.angelolab.com/mibi-data. The data for the TNBC application given in Note S7 can be found at https://zenodo.org/records/7990870. The data for the CRC application given in Note S8 can be found at https://data.mendeley.com/datasets/mpjzbtfgfr/1.
-
•
The software to implement TopKAT is available in an R package available at https://sarahsamorodnitsky.github.io/TopKAT/. The version of the R package published with this article is archived on Zenodo at https://doi.org/10.5281/zenodo.17273281.45




