Abstract
Crohn’s disease (CD) is a chronic and relapsing inflammatory condition that affects segments of the gastrointestinal tract. CD activity is determined by histological findings, particularly the density of neutrophils observed on Hematoxylin and Eosin stains (H&E) imaging. However, understanding the broader morphometry and local cell arrangement beyond cell counting and tissue morphology remains challenging. To address this, we characterize six distinct cell types from H&E images and develop a novel approach for the local spatial signature of each cell. Specifically, we create a 10-cell neighborhood matrix, representing neighboring cell arrangements for each individual cell. Utilizing t-SNE for non-linear spatial projection in scatter-plot and Kernel Density Estimation contour-plot formats, our study examines patterns of differences in the cellular environment associated with the odds ratio of spatial patterns between active CD and control groups. This analysis is based on data collected at the two research institutes. The findings reveal heterogeneous nearest-neighbor patterns, signifying distinct tendencies of cell clustering, with a particular focus on the rectum region. These variations underscore the impact of data heterogeneity on cell spatial arrangements in CD patients. Moreover, the spatial distribution disparities between the two research sites highlight the significance of collaborative efforts among healthcare organizations. All research analysis pipeline tools are available at https://github.com/MASILab/cellNN.
Keywords: Cell spatial analysis, Pattern recognition, Crohn’s disease
1. INTRODUCTION
Crohn’s disease (CD) is a complex inflammatory bowel disease (IBD) affecting the gastrointestinal tract, characterized by persistent and recurring bowel inflammation.1 The prevalence of IBD has been on the rise, leading to increased medical expenditures. Notably, the medical cost of CD was estimated to be $3.48 billion per year in 2015, and it is projected to reach $3.72 billion per year by 2025, constituting a significant portion of the overall US national costs.2 The Gut Cell Atlas Crohn’s Disease Consortium is an ambitious initiative supported by The Leona M. and Harry B. Helmsley Charitable Trust, aiming to develop comprehensive cellular reference maps for CD. The primary focus of this initiative is to compare tissues from CD patients in comparison to healthy controls (https://www.gutcellatlas.helmsleytrust.org/). By mapping different human cell types and analyzing gene and protein expression in the context of response to anatomical locations and CD, this project offers a unique opportunity to advance our understanding of the human gut and its implications in CD.
In tumor research, spatial analysis using H&E-stained images has been extensively explored. For instance, Failmezger et al. revealed key tumor microenvironment features through topological tumor graphs in melanoma specimens,3 while Xu et al. assessed tumor mutational burden and immune infiltrates in bladder cancer patients using H&E and iHC imaging.4 They also developed a method to detect prognostic tumor infiltrating lymphocytes (TIL) density in colorectal carcinoma patients.5 Saltz et al. generated TIL maps from H&E images, correlating them with survival in diverse tumor types.6 However, applying spatial analysis to inflammatory bowel disease (IBD) remains under investigation.
Determination of CD activity depends on histological findings, particularly the density of neutrophils observed via hematoxylin and eosin (H&E) staining.7 There are many ways to assess the samples from CD patients. For instance, pathologists can understand the cell neighborhood changes with a zoom in and out on the biopsies. The invasion of neutrophils is known to be a sign of active inflammation and is a well-known problem.8 Pathologist-assigned disease severity scores for CD biopsies are often given at the slide level, though the disease features that resulted in the scoring might not present homogeneously across the slide. As Figure 1 shows, we can see that the high density of neutrophils is not shown in all areas of the tissue on a slide. Can we quantify the relationship of between cell cells, beyond the magnitude of the neutrophils or other cell types is the primary motivation of this work.
In this study, we deviate from conventional morphological feature-based CD activity classification. Instead, our focus is on delving into and defining a graph-based metric aimed at enabling spatial analysis. Our core hypothesis is that distinct patterns might characterize the relationships among various cell types within the context of the CD. Utilizing established tools to identify six specific cell types from H&E images, we introduce an innovative approach to capture the local spatial characteristics of each cell. In brief, we create a 10-cell neighborhood matrix that outlines the arrangement of neighboring cells for each individual cell. Through the utilization of t-SNE for non-linear spatial projection in both scatter-plot and Kernel Density Estimation (KDE) contour-plot formats, our research delves into disparities in cellular environments, particularly those linked to the odds ratio of spatial patterns between CD active and control groups. The contribution of this study is three-fold:
We propose a graph-based signature to represent a local arrangement signature for each cell.
We develop a comprehensive visualization workflow to identify the relationship of signature patterns among active CD and normal CD, and healthy control cohorts.
We present investigations from two research institutes with heterogeneous data acquisition, focusing on the rectum.
2. METHOD
Our objective is to comprehend alterations in the cellular neighborhood. To accomplish this, it is essential to establish biomarkers capable of detecting the nuanced local histological orientation and interrelationships among co-located cells. Here, we investigate a potential avenue for achieving this goal. Initially, we outline a graph-based spatial characterization for each cell. Subsequently, we introduce a visualization technique designed to handle a substantial cell count. Lastly, the quantification method is elucidated. The comprehensive analysis workflow is depicted in Figure 3.
2.1. Data pre-processing
Prior to generating the spatial signature of each cell, we first segment the whole slide image (WSI) using a pre-trained segmentation deep learning model (HoverNet 9) from the Colon Nuclei Identification and Counting (CoNIC) Challenge dataset,10 which can identify six cell types: neutrophils, epithelial cells, lymphocytes, plasma cells, eosinophils, and connective tissue. The HoverNet CoNIC pre-trained segmentation model operates only on patches with size of 256×256 pixels under 20× magnification. So we we employ the CLAM 11 method to remove the background of the WSI and divide the gigapixel image into relevant size of patches, segment each patch (in 20×) using the pre-trained model, and then merge all the patches back into the original WSI space to collect the global coordinate of the cells.
2.2. Graph-based spatial signature
For building the spatial signature, we aim to find 11-nearest neighbor for each cell, converting the 11 neighbors into a count matrix for the six cell types. Figure 2 depicts two cell samples of the same type but with different local arrangements. To achieve this, we utilize the KD-tree algorithm,12,13 which creates a binary tree partitioning the cell coordinate space into smaller regions for fast searching of nearest neighbor points in multi-dimensional spaces. Each cell is assigned with 11 indexes, including itself, resulting in a total count of 10 when excluding the cell itself. To clean up feature noise, we remove edge cases of cells containing less than 10 exclusive neighbors (10-NN).
2.3. Visualization
We aggregate the count matrix of 10-NN signatures for all cells in the target dataset, comprising CD active and control groups, into a large matrix, considering that each WSI may contain over ten thousand cells. Subsequently, we utilize a 2-D standard t-SNE with the KL divergence as the cost function.14 The t-SNE technique effectively maintains the relative distances between neighboring data points, accentuating local structure over global structure. Initially, when employing t-SNE scatter plots, it becomes straightforward to visualize the distinct exclusive regions for the two categories. However, due to the vast number of data points, numerous overlaps are expected, making it difficult to convey the data point density through the scatter plot. As a result, we opt to transform the t-SNE embedding into KDE contour plots to identify regions of interest with diverse probabilities. This visualization approach establishes a non-linear spatial space encompassing all cell data points from the input whole dataset.
2.4. Quantification
We investigate any regions of interest (ROI) of the t-SNE visualization using bounding box (BBox). To comprehend the occurrence probability of the specific 10-nearest neighbor (10-NN) pattern in two different CD activity groups, we calculate the odds ratio. The odds ratio is defined as the fraction of cells from CD activity group 1 divided by the fraction of cells from CD activity group 2 within the BBox.
3. EXPERIMENTS AND RESULTS
3.1. Dataset
CD can manifest anywhere in the gastrointestinal tract. In our study, we applied our proposed workflow to two datasets acquired from different institutions, with a specific focus on the anatomical region of the rectum. The first dataset was obtained from the Emory University School of Medicine (Emory), also in the Gut Cell Atlas Crohn’s Disease Consortium, and includes 8 biopsies from children. Among these, 4 biopsies are from a healthy control group, while the remaining 4 biopsies are classified as CD active.
The second dataset comprises 143 biopsies stained with H&E, obtained in a deidentified form from Vanderbilt University Medical Center (VUMC) under Institutional Review Board (IRB) approval, specifically Vanderbilt IRB #191738 and #191777.15 All biopsies were collected from adult CD patients and scored by a single pathologist, resulting in 97 biopsies classified as normal and 46 biopsies marked as active, categorized into subcategories of mild, moderate, and severe. It is worth noting that the VUMC dataset lacks a healthy control group. Thus, for a “normal” comparator group, the term “CD normal” refers to patients diagnosed with CD, but the pathologic review of collected tissues were normal, and showed no acute or chronic changes on pathology due to medical therapies.
3.2. Experiment design
For each cell type category, which includes neutrophils, epithelial cells, lymphocytes, plasma cells, eosinophils, and connective tissue, we conduct t-SNE visualization and compute the odds ratios for the research institutes separately. Table 1 presents a summary of the data points for each site and the six cell subtype categories. To explore the potential ROIs, we investigate them in both the t-SNE scatter plots, where we identify exclusive regions using BBox, and the t-SNE KDE contour plots, where our attention is on areas with relatively high probabilities. All experiments were executed on a workstation boasting 96 CPU cores, 250 GiB of RAM, and an NVIDIA RTX A6000 GPU.
Table 1:
Institute 1: VUMC | Institute 2: Emory | ||||
---|---|---|---|---|---|
| |||||
CD normal | CD active | Healthy control | CD active | ||
| |||||
Count | Sample biopsies |
97 | 46 | 4 | 4 |
Neutrophils (neu) |
16,424 | 33,103 | 384 | 1,022 | |
Epithelial cell (epi) |
5,528,718 | 2,409,172 | 50,479 | 209,950 | |
Lymphocytes (lym) |
4,037,761 | 2,946,595 | 37,326 | 379,809 | |
Plasma cell (pla) |
738,568 | 607,164 | 9,651 | 21,486 | |
Eosinophils (eon) |
50,173 | 61,557 | 1,148 | 1,437 | |
Connective tissue (con) |
1,406,897 | 887,472 | 17,838 | 60,172 |
3.3. Results
We provide the outcomes of spatial pattern analysis pertaining to individual cell types across the two institutes as follows.
Neutrophils (Figure 4).
The VUMC dataset reveals that CD active tissues tend to exhibit a higher count of 10-NN shapes containing neutrophils and lymphocytes, as well as plasma, with a moderate amount of connective tissue. Moreover, it indicates that when lymphocytes dominate the surroundings without epithelial cells, CD active occurrences surpass those in CD normal tissue. CD normal tissues, on the other hand, tend to have a greater involvement of epithelial cells. In the Emory dataset, it is evident that lymphocytes play a crucial role in the 10-NN interaction with neutrophils. Interestingly, even when both lymphocytes and epithelial cells are dominant, the prevalence remains higher in CD active cases. Furthermore, the dominance of lymphocytes and plasma is associated with a high occurrence rate of healthy controls.
Eosinophils (Figure 5).
In the VUMC dataset, there is a notable significance of lymphocytes and plasma in CD active tissues, whereas eosinophils in CD normal tissues tend to exhibit a higher tendency towards involvement with epithelial and plasma components. However, the Emory dataset reveals a contrasting pattern.
Connective (Figure 6).
In the VUMC dataset, shape patterns between CD active and CD normal tissues are remarkably similar. CD active tissues show moderate elevation in lymphocyte and connective tissue presence, while CD normal tissues display increased involvement of epithelial elements. Conversely, the Emory dataset presents greater diversity. It mirrors the VUMC CD active pattern; moreover, dominant connective tissue aligns with healthy controls, and higher plasma presence associates with the control group.
Plasma (Figure 7).
No distinct pattern is identified that is specific to CD active tissues. In the case of CD normal tissues, there is a tendency for an even distribution of involvement among epithelial, lymphocyte, plasma, and connective tissue. This pattern is similarly observed in the Emory dataset. However, when the surroundings solely consist of lymphocytes and plasma, this combination appears more frequently in CD active cases.
Plasma (Figure 8).
In the VUMC dataset, no notable pattern difference in lymphocyte distribution between CD active and CD normal tissues is discerned, where lymphocytes dominate the 10-NN arrangement. This similarity in pattern is also observed in the Emory dataset for CD active cases. However, within the Emory dataset, healthy control cases exhibit a more varied pattern, with the additional presence of plasma and connective tissue in the local environment, alongside lymphocytes.
Epithelial (Figure 9).
Both datasets from the different institutions consistently indicate that epithelial cells predominantly interact with other epithelial cells, regardless of whether the tissue is CD active, CD normal, or from a healthy control.
4. DISCUSSION AND CONCLUSION
In this paper, we delve into the exploration of a broader morphometric aspect within the local arrangement of cells, shifting the focus from neutrophil counting to the realm of CD investigation. Our proposed approach introduces a graph-based signature that is customized to portray a distinct local arrangement for each cell. Additionally, we have meticulously designed a comprehensive visualization workflow that illuminates the intricate interconnections among signature patterns within both the CD active patient and CD normal cohorts among adults, as well as between the CD active patient cohort and healthy controls among children. Our study encompasses investigations carried out across two distinct research institutes, centered around the rectum region. The diversified conclusions stemming from shape patterns, extending beyond neutrophils, can be attributed to the diverse array of data acquisition methods. For instance, the VUMC dataset primarily encompasses adult samples and lacks healthy patient tissue, whereas the control group comprises CD biopsies classified as normal. In contrast, the Emory dataset is sourced from pediatric samples, introducing an additional layer of variation in the data collection process.
In conclusion, our findings indicate that cells do not consistently co-locate in the same manner between CD active and CD normal cohorts for adults rectum dataset, or between CD active and healthy control for children. To delve deeper into these distinct shape patterns, we can leverage RNA-seq to identify cell types, unveil gene expression patterns, facilitate biomarker discovery, enable spatial mapping, and uncover disease mechanisms. Moreover, integrating morphological spatial features, such as considering actual cell distances in the neighborhood of the spatial signature, could potentially enhance our analysis. The observed disparities in spatial distributions across the two research institutes underscore the significance of collaborative endeavors among healthcare institutions.
Supplementary Material
5. ACKNOWLEDGMENTS
This research was supported by The Leona M. and Harry B. Helmsley Charitable Trust grant G-1903–03793 and G-2103–05128, NSF CAREER 1452485, NSF 2040462, and in part using the resources of the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University, Nashville, TN. The Vanderbilt Institute for Clinical and Translational Research (VICTR) is funded by the National Center for Advancing Translational Sciences (NCATS) Clinical Translational Science Award (CTSA) Program, Award Number 5UL1TR002243–03. The National Institute of Diabetes and Digestive and Kidney Diseases, the Department of Veterans Affairs I01BX004366, I01CX002171, and I01CX002573, the Vanderbilt University Medical Center institutional funding and Patient-Centered Outcomes Research Institute (PCORI; contract CDRN-1306–04869), NIH grants T32GM007347, R01DK135597, and R01DK103831 also support this work. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We extend gratitude to NVIDIA for their support by means of the NVIDIA hardware grant. ChatGPT was utilized for proofreading and grammar checking, and the results were validated by human review to ensure accuracy of facts and intent.
REFERENCES
- [1].Baumgart DC and Sandborn WJ, “Crohn’s disease,” The Lancet 380(9853), 1590–1605 (2012). [DOI] [PubMed] [Google Scholar]
- [2].Hamdeh S, Aziz M, Altayar O, Olyaee M, Murad MH, and Hanauer SB, “Early vs late use of anti-tnfa therapy in adult patients with crohn disease: a systematic review and meta-analysis,” Inflammatory bowel diseases 26(12), 1808–1818 (2020). [DOI] [PubMed] [Google Scholar]
- [3].Failmezger H, Muralidhar S, Rullan A, de Andrea CE, Sahai E, and Yuan Y, “Topological tumor graphs: a graph-based spatial model to infer stromal recruitment for immunosuppression in melanoma histology,” Cancer research 80(5), 1199–1209 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Xu H, Cong F, and Hwang TH, “Machine learning and artificial intelligence–driven spatial analysis of the tumor immune microenvironment in pathology slides,” European Urology Focus 7(4), 706–709 (2021). [DOI] [PubMed] [Google Scholar]
- [5].Xu H, Cha YJ, Clemenceau JR, Choi J, Lee SH, Kang J, and Hwang TH, “Spatial analysis of tumor-infiltrating lymphocytes in histological sections using deep learning techniques predicts survival in colorectal carcinoma,” The Journal of Pathology: Clinical Research 8(4), 327–339 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, Samaras D, Shroyer KR, Zhao T, Batiste R, et al. , “Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images,” Cell reports 23(1), 181–193 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Pai RK, Lauwers GY, and Pai RK, “Measuring histologic activity in inflammatory bowel disease: why and how,” Advances in anatomic pathology 29(1), 37–47 (2022). [DOI] [PubMed] [Google Scholar]
- [8].Gao S-Q, Huang L-D, Dai R-J, Chen D-D, Hu W-J, and Shan Y-F, “Neutrophil-lymphocyte ratio: a controversial marker in predicting crohn’s disease severity,” International journal of clinical and experimental pathology 8(11), 14779 (2015). [PMC free article] [PubMed] [Google Scholar]
- [9].Graham S, Vu QD, Raza SEA, Azam A, Tsang YW, Kwak JT, and Rajpoot N, “Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images,” Medical image analysis 58, 101563 (2019). [DOI] [PubMed] [Google Scholar]
- [10].Graham S, Jahanifar M, Vu QD, Hadjigeorghiou G, Leech T, Snead D, Raza SEA, Minhas F, and Rajpoot N, “Conic: Colon nuclei identification and counting challenge 2022,” arXiv preprint arXiv:2111.14485 (2021). [Google Scholar]
- [11].Lu MY, Williamson DF, Chen TY, Chen RJ, Barbieri M, and Mahmood F, “Data-efficient and weakly supervised computational pathology on whole-slide images,” Nature Biomedical Engineering 5(6), 555–570 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. , “Scikit-learn: Machine learning in python,” the Journal of machine Learning research 12, 2825–2830 (2011). [Google Scholar]
- [13].De Berg M, [Computational geometry: algorithms and applications], Springer Science & Business Media; (2000). [Google Scholar]
- [14].Van der Maaten L and Hinton G, “Visualizing data using t-sne.,” Journal of machine learning research 9(11) (2008). [Google Scholar]
- [15].Bao S, Chiron S, Tang Y, Heiser CN, Southard-Smith AN, Lee HH, Ramirez MA, Huo Y, Washington MK, Scoville EA, et al. , “A cross-platform informatics system for the gut cell atlas: integrating clinical, anatomical and histological data,” in [Medical Imaging 2021: Imaging Informatics for Healthcare, Research, and Applications], 11601, 8–15, SPIE (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.