Abstract
Tumors are complex assemblies of cellular and acellular structures patterned on spatial scales from microns to centimeters. Study of these assemblies has advanced dramatically with the introduction of methods for highly multiplexed tissue imaging methods. These reveal the intensities and spatial distributions of 20-100 proteins in 103–107 cells per specimen in a preserved tissue microenvironment. Despite extensive work on extracting single-cell image data, all tissue images are afflicted by artifacts (e.g., folds, debris, antibody aggregates, optical effects, image processing errors) that arise from imperfections in specimen preparation, data acquisition, image assembly, and feature extraction. We show that artifacts dramatically impact single-cell data analysis, in extreme cases, preventing meaningful biological interpretation. We describe an interactive quality control software tool, CyLinter, that identifies and removes data associated with imaging artifacts. CyLinter greatly improves single-cell analysis, especially for archival specimens sectioned many years prior to data collection, including those from clinical trials.
Keywords: CyLinter, multiplex image analysis, quality control (QC), single-cell
INTRODUCTION
Normal and tumor tissues are complex assemblies of many cell types whose proportions and properties are controlled by cell-intrinsic molecular programs and interactions among components of the tumor microenvironment (TME). For example, the initiation and progression of diseases such as cancer depend on competition between immunoediting by tissue resident or circulating immune cells, and immunosuppression by tumor cells. The relatively recent development of highly multiplexed tissue imaging methods (e.g., MxIF, CyCIF, CODEX, 4i, mIHC, MIBI, IBEX, and IMC)1-7 has made it possible to collect single-cell data on 20-100 proteins and other biomolecules in preserved 2D and 3D tissue microenvironments4,8-11. When these images are segmented and staining intensities are quantified, it is possible to generate single-cell data on cell types, their functional states, and their spatial interactions. Such data are powerful complements to that obtained from dissociative methods such as scRNA-seq12-14. Imaging approaches compatible with the formaldehyde-fixed, paraffin-embedded (FFPE) specimens universally acquired for clinical diagnosis are particularly powerful because they can tap into large archives of human biopsy and resection specimens15,16 and also assist in the study of mouse models of disease17.
Although machine learning methods operating at the pixel level can extract diagnostic features from histology images, particularly images of specimens stained with Hematoxylin and Eosin (H&E)—a common stain in clinical pathology18—most high-plex imaging studies aim to collect single-cell data that can be interpreted mechanistically. This requires segmenting images to identify individual cells and their features. The resulting single-cell data are recorded in “spatial feature tables”, which are analogous to count tables in scRNA-seq19. In their simplest form, spatial feature tables generated by pipelines such as MCMICRO (an automated multiplex image assembly and feature extraction pipeline)19 contain the X,Y coordinates of cells (commonly the centroids of the nuclei) and their integrated signal intensities20. Cell types (e.g., cytotoxic T cells immunoreactive to CD45, CD3 and CD8 antibodies) and their spatial locations are then inferred from these tables and spatial analysis performed to identify recurrent short- and long-range interactions significantly associated with an independent variable such as drug response, disease progression, or genetic perturbation.
High-plex imaging of human cohorts has been performed using both tissue microarrays (TMAs), which comprise 0.3 to 1.5 mm diameter “cores” from dozens to hundreds of clinical specimens arrayed on a slide, or by whole-slide imaging of single specimens up to 4 cm2. The latter is required for clinical research and diagnosis both to achieve sufficient statistical power21 and as an FDA requirement22. Analyzing imaging data from TMAs and whole-slide images requires specialized image processing algorithms (ideally organized into computational pipelines)23 because datasets can contain as many as 109 cells. In this paper, we show that accurate processing of tissue images is complicated by the presence of imaging artifacts that contribute significant noise to single-cell data, confounding many types of image-derived, single-cell analysis. Tissue folds, slide debris (e.g., lint), and staining artifacts are commonly observed in histological specimens, especially those stored on glass slides for an extended period (e.g., archival samples stored for several years). Unfortunately, this is a common situation in the setting of clinical trial specimens. In our study, this scenario is represented by 25 slides from the TOPACIO clinical trial of Niraparib in Combination with Pembrolizumab in Patients with Triple-negative Breast Cancer or Ovarian Cancer (NCT02657889)24, which was completed in 2021. We demonstrate the impact of artifacts on data analysis by acquiring data from TOPACIO tissue specimens and by re-analyzing high-plex datasets from several recently published studies. We then develop an approach to removing single-cell data affected by microscopy artifacts using an interactive software tool (CyLinter, code and tutorial at https://labsyspharm.github.io/cylinter/) that is integrated into the Python-based Napari image viewer25. Finally, we show that CyLinter can salvage otherwise uninterpretable multiplex imaging data, from the TOPACIO trial for example. Our findings suggest that artifact removal should become a standard component of image processing pipelines used for spatial profiling of tissues.
RESULTS
Identifying recurrent image artifacts in multiplex IF images
To categorize imperfections and image artifacts commonly encountered in high-plex images of tissue, we examined five datasets collected using three different imaging methods: (1) 20-plex CyCIF26 images of 25 triple-negative breast cancer (TNBC) specimens, collected from patients in the TOPACIO clinical trial24; (2) a 22-plex CyCIF image of a colorectal cancer (CRC) resection21; (3) a 21-plex TMA dataset23 comprising 123 healthy and cancerous tissue cores; (4) 16-plex CODEX27 images of two sections from a single specimen of head and neck squamous cell carcinoma (HNSCC); and (5) a 19-plex mIHC28 image of normal human tonsil23 (Extended data Fig. 1a-e and Supplementary Table 1). CyCIF images were collected at 20x magnification (0.65μm/pixel) and processed using the MCMICRO29 image analysis pipeline to yield stitched and registered multi-tile image files and their associated single-cell spatial feature tables. Single-cell data were visualized as UMAP embedding following clustering with HDBSCAN—an algorithm for hierarchical density-based clustering30. Images were also inspected by experienced microscopists and board-certified histopathologists to identify imaging artifacts.
All specimens comprised tissue sections cut at 5 μm thickness and mounted on slides in the standard manner. This involves sectionizing FFPE blocks with a microtome and floating sections on water prior to mounting on glass slides. Even in the hands of skilled histologists, this process can introduce folds in the tissue. We identified multiple instances of tissue folds in large and small specimens (such as TMA cores and core biopsies; Fig. 1a and Extended data Fig. 2a). Moreover, we found that cells within tissue folds corresponded to discrete clusters in UMAP feature space due to higher-than-average signal intensities across imaging channels relative to unaffected regions of tissue (Fig. 1a and Extended data Fig. 2b-c). Slide debris in the shape of lint fibers and hair were also common (Extended data Fig. 2d), as were bright antibody aggregates exhibiting non-specific staining patterns in the tissue that formed discrete clusters in UMAP space (Fig. 1b). Despite having relatively low numbers of segmented cells, regions of necrotic tissue also exhibited high levels of background antibody labeling (Extended data Fig. 2e). Some samples also contained air bubbles that were likely introduced when coverslips were overlayed on specimens prior to imaging (asterisks; Extended data Fig. 2f). Artifacts such as tissue folds and air bubbles can be reduced, but not completely eliminated, by skilled experimentalists.
Additional artifacts were incurred at the time of image acquisition. These included out-of-focus image tiles (in many cases due to sections not lying completely flat on the slide; Extended data Fig. 2g), fluctuations in background intensity between image tiles (Extended data Fig. 2h), and miscellaneous image aberrations that significantly increased signal intensities over image background and led to the formation of discrete clusters in UMAP space (Fig. 1c, Extended data Fig. 2i). We also observed errors in image tile alignment (Extended data Fig. 2j) and image registration (Extended data Fig. 2k), which represent image processing steps that are critical for deriving precise single-cell data. These stitching and registration errors can sometimes be solved computationally, although poor overall tissue quality and nuclear stain over-saturation can limit the accuracy of the recovered data.
Some artifacts are uniquely associated with cyclic imaging methods such as CyCIF26,26,31, CODEX27, and mIHC28 that generate high-plex images through repeated rounds of lower-plex imaging followed by fluorophore dissociation or inactivation. For instance, tissue movement (Fig. 1d) and progressive deterioration (Fig. 1e) over imaging cycles causes cells present in early rounds of imaging to be lost in later cycles. These cells appear negative for all subsequent markers, confounding cell type assignment and leading to artifactual clusters in feature space (Fig. 1f). The factors determining the extent of tissue loss from specimen to specimen remain incompletely understood, but we have found that tissue damage can occur from dewaxing or antigen retrieval32 and tissue sections with relatively low surface area (e.g., the fine-needle biopsies in TOPACIO samples 70, 89, 95, 96) or low cellularity (e.g., adipose tissue) are especially prone to tissue movement and cell loss.
In some cases, the origins of a given artifact were unknown, but based on our observations and published histology studies33,34, many likely arise from a combination of 1) pre-analytical variables (generally defined as variables arising prior to staining a specimen), 2) unwanted fluorescent objects introduced during staining, imaging, and washing steps (e.g., lint and antibody aggregates), 3) errors in data acquisition, and 4) the intrinsic properties of the tissue itself. Overall, we found that specimens from the TOPACIO dataset were the most severely affected by these artifacts, whereas the CRC specimen, from Lin et al.,21 which had been freshly-sectioned and carefully processed, was least affected. Only one slide was available from each TOPACIO patient due to high demand for specimens collected during the course of the clinical trial, making repeat imaging impossible. Thus, it was particularly important to correct imaging artifacts in the TOPACIO specimens.
Microscopy artifacts obscure the analysis and interpretation of tissue-derived, single-cell data
Clustering with HDBSCAN yielded 22 clusters for the CRC dataset (~9.8x105 cells total) with 0.7% of cells remaining unclustered due to ambiguous features (Fig. 2a). Silhouette analysis35 showed that four clusters (6, 15, 17, and 21) remained under-clustered (i.e., contained cells with negative silhouette scores) despite parameter tuning (Fig. 2b). Agglomerative hierarchical clustering based on mean marker intensities revealed four meta-clusters (Fig. 2c) that, upon initial inspection, appeared to correspond to tumor (meta-clusters A, B), stromal (C), and immune cells (D). However, multiple clusters contained cells with unexpected marker combinations. For example, cluster 6 cells contained cells from both immune and stromal lineages; inspection of the original image confirmed a mix of B cells, T cells, and stromal cells in this cluster (Fig. 2d). The formation of clusters 9 and 11 appeared to be caused by bright antibody aggregates in the desmin (Fig. 2e) and vimentin (Extended data Fig. 3a) channels, respectively, whereas contaminating lint fibers led to the formation of cluster 12 (Fig. 2f). Cell detachment was evident in cluster 14 (Fig. 2g), and cluster 10 comprised a domain of vimentin-positive tissue of unknown origin (Extended data Fig. 3b). Three additional artifactual clusters (2, 8, and 19; Fig. 2h) were caused by a region of tissue that was apparently not exposed to anti-CD3 and anti-CD45RO antibodies in imaging cycle 3. We reasoned that this artifact was likely due to human error during the performance of a complex 3D imaging study21.
To systematically inspect the individual cells comprising CRC clusters, we extracted and organized into image galleries 20 x 20 μm (30 x 30 pixel) image patches of randomly selected cells from each cluster (Online Supplementary Fig. 1, https://www.synapse.org/#!Synapse:syn24193163/files). To make the galleries easier to interpret, we limited the channels displayed to the three most highly expressed protein markers per cluster (based on a cluster-normalized heatmap; Fig. 2c). Clusters not overtly affected by artifacts (e.g., 0, 1, 3, 7, and 16) overwhelming contained cells with a consistent morphology and staining pattern. For example, CRC cluster 0 comprised a phenotypically homogenous group of keratinocytes (Fig. 2i), while CRC cluster 1 represented normal epithelial crypt cells with high-levels of E-Cadherin at intercellular junctions (Fig. 2j). Some clusters contained a single type of cell, but with remarkably non-uniform signal intensities. For example, cluster 3 cells expressed a combination of markers consistent with memory helper T cells (i.e. CD45, CD4, and CD45RO) but with high variation among replicate thumbnails (Fig. 2k,l). These cells were drawn from multiple tissue regions, not a single region affected by artifacts (Extended data Fig. 3c) and manual adjustment of image contrast on a per-channel and per-cell basis made the cells appear more uniform (Fig. 2m and Extended data Fig. 3d). As expected for this T cell subset, CD45, CD4, and CD45RO staining was significantly correlated in individual cells (R=0.5 to 0.67; Extended data Fig. 3e), explaining the tight cluster in the UMAP embedding, but with sufficient variation that made multi-channel images of the same cell type look quite different. We believe that this reflects natural biological variation—not simply dataset noise.
The 20-plex TOPACIO dataset gave rise to 492 HDBSCAN clusters (among a subset of ~3.0 x 106 cells drawn from the ~1.9 x 107 total segmented nuclei), with 875,204 (29%) of cells remaining unclustered and exhibiting no discernable spatial pattern (Fig. 3a and Extended data Fig. 4a). Most clusters were associated with positive silhouette scores, indicating a good fit (Extended data Fig. 4b). While some clusters arose from cells in a single specimen, the majority (441/492) contained cells from more than half of the 25 TOPACIO samples (Extended data Fig. 4c). This included many small clusters containing fewer than 3,000 cells (392/492, Extended data Fig. 4d). Agglomerative hierarchical clustering based on mean marker intensities revealed six meta-clusters (Fig. 3b). However, the heatmap exhibited an unexpected pattern of dichotomous marker expression with very bright signals for some markers and very dim signals for others. The exception was meta-cluster C, comprising the majority of cells (~1.7 x106), which exhibited bright signals across all channels (Fig. 3b) and was located towards the center of the UMAP embedding (Extended data Fig. 4e). On further inspection, we found that a significant image alignment problem at the bottom of patient sample 55 gave rise to the single clustering (15) comprising meta-cluster A (Extended data Fig. 4f). Meta-clusters B, D, E, and F found to be caused by the presence of cells with channel intensities at or near zero resulting from image background subtraction (see Supplementary Note 1). Overall, the same type of analysis that generated interpretable data for the CRC specimen, albeit with some artifacts, yielded nearly unintelligible results with the TOPACIO dataset.
Visual inspection of the 156,300 individual channel image tiles comprising the TOPACIO dataset in down-sampled images revealed that ~5,487 tiles (3.5%) were affected by either antibody aggregates, illumination aberrations, or slide debris, with FOXP3 being the most affected channel in which artifacts were present in >30% of tiles across all samples (Fig. 3c), which was likely due to insufficient antibody washing prior to imaging. Artifacts tended to be less abundant in gross tissue resections compared to fine-needle and punch-needle biopsies, but were uncorrelated with patient response to therapy, as expected (Fig. 3d). Image patch galleries drawn at random from 48 of the 492 TOPACIO clusters also revealed numerous tissue and imaging artifacts including bright fluorescent signals, over-saturated nuclear counterstain, poor segmentation, and low fluorescent signals (Fig. 3e-g and Online Supplementary Fig. 2). Thus, the number of artifacts in the TOPACIO specimens was substantially higher than in the CRC specimen and we speculated that this might explain the uninterpretability of much of the data.
Identifying and removing noisy single-cell data with CyLinter
To remove imaging artifacts from tissue images we developed the CyLinter plugin for the Napari25 multi-channel image viewer. The tool consists of a set of Python-based QC modules that process single-cell data for identification and removal of image artifacts using computer-assisted human review (Fig. 4a) and can be incorporated into the MCMICRO image processing pipeline29. CyLinter takes four files as input for each specimen: 1) a stitched and registered multiplex image (TIFF/OME-TIFF format), 2) a single-channel binary image showing the boundaries between segmented cells, 3) a cell segmentation mask generated by MCMICRO or a stand-alone segmentation algorithm, and 4) a spatial feature table (CSV format)20 comprising the location and computed signal intensities for each segmented cell within an image derived from the segmentation mask (Extended data Fig. 5a-d, respectively). With a dataset comprising multiple images and spatial feature tables, CyLinter aggregates single-cell data into one Pandas (Python) dataframe36. During QC, cells affected by artifacts are then removed from the dataframe. CyLinter is flexible, as it allows QC modules to be run iteratively and progress to be bookmarked within and between modules.
The first CyLinter module, selectROIs, is used to view a multi-channel image to identify obvious artifacts, such as regions of tissue damage, antibody aggregates, and large illumination aberrations (Extended data Fig. 5e). Lasso tools native to the Napari image viewer are used to define regions of interest (ROIs) corresponding to artifacts and affected cells are removed from the spatial feature table (see https://labsyspharm.github.io/cylinter/ for details). We found that negative selection (in which highlighted cells are dropped from further analysis) worked effectively for the CRC image (Fig. 4b), but the TOPACIO dataset was affected by too many artifacts to use such an approach. Thus, CyLinter implements an optional positive ROI selection mode, in which users select regions of tissue devoid of artifacts for retention in the dataset (Fig. 4c). Although human identification of artifacts is effective, it can be slow. Therefore, CyLinter includes a companion algorithm that works with the selectROIs module to automatically flag likely artifacts for human review (Extended data Fig. 5f-i). The algorithm is based on classical image processing approaches (see Methods) that identify features with intensities lying outside the distribution of biological signals (e.g., illumination aberrations, antibody aggregates, and tissue folds). More sophisticated machine learning models such as multi-layer perceptrons and other neural networks are now being developed for integration into the CyLinter QC pipeline37,38,39.
CyLinter’s dnaIntensity module allows users to inspect histogram distributions of per-cell mean nuclear intensities within each cycle. Nuclei at the extreme left side of the distribution often correspond to cells lying outside of the focal plane (Fig. 4d), while those to the right correspond to poorly segmented cells and those within tissue folds (Fig. 4e); this module redacts data based on user assigned lower and upper thresholds (Extended data Fig. 5j). Instances of over and under-segmentation can be identified based on the area of each segmentation instance (typically expressed in number of pixels) followed by their removal using the dnaArea module (Extended data Fig. 5k). This method was particularly effective at removing many over-segmented cells in the CRC image (Fig. 4f) and under-segmented cells which were common among tightly-packed columnar epithelial cells in normal colon specimens (e.g., EMIT TMA core 84; Fig. 4g).
In cyclic imaging methods, nuclei are re-imaged every cycle40,41 and CyLinter’s cycleCorrelation module exploits this to identify cells that were lost or substantially damaged during imaging by computing histograms of log10-transformed ratios of DNA intensity between the first and last imaging cycles for a particular image (Extended data Fig. 5l). Cells that are lost give rise to a discrete peak with log10[DNA1/DNAn] > 0 and gating the histogram eliminates these cells from the data table (Fig. 4h). A further module (pruneOutliers) exploits the fact that artifacts are often (but not always) associated with brighter signals relative to real ones. With the pruneOutliers module, it is possible to simultaneously visualize histograms of per-cell signal intensities from all samples in a tissue batch, then assign lower and upper percentile cutoffs (Fig. 4i and Extended data Fig. 5m). Not all channels will contain artifacts, and cells falling outside of the thresholds can therefore be visualized in tissues to ensure that selected data points are indeed artifacts. We have found that this approach is particularly effective at removing cells affected by antibody aggregates which can be small, and tedious to curate through the selectROIs module.
Correcting for bias in user-guided histology QC via unsupervised cell clustering
Human-guided artifact detection is subject to errors and biases. CyLinter therefore implements a metaQC module (Extended data Fig. 5n) that performs unsupervised clustering on equal combinations of redacted and retained data. Cells flagged for redaction that fall within predominantly clean clusters in the retained data can be added back to the dataset, while those retained in the dataset that co-cluster with predominantly noisy cells (which were presumably missed during QC) can be removed from the data table. Like the metaQC module, CyLinter’s clustering module (Extended data Fig. 5o) allows users to perform UMAP42 or t-SNE43 data dimensionality reduction and HDBSCAN30 density-based clustering to identify discrete cell populations in high-dimensional feature space. After achieving an optimal cluster solution, the setContrast module adjusts per-channel image contrast settings (Extended data Fig. 5p) that are applied to all tissues in a batch. The curateThumbnails module then selects individual cells at random from each cluster and curates image galleries for visual confirmation of clusters as bona fide cell states or residual dataset noise (Extended data Fig. 5q). Together, these additional clustering-based QC steps allow a user to revise any prior cleaning and clustering modules using a more objective approach than image inspection alone.
Impact of CyLinter-based quality control on the TOPACIO and CRC datasets
Applying CyLinter to the CRC dataset resulted in the removal of ~23% of total cells (Fig. 5a). Over-segmentation was the largest problem, affecting ~16% of cells, with 2% or less of the data being dropped by any of one of the remaining QC modules. Thus, use of better segmentation algorithms (many of which are in development) would in principle have allowed 93% of the data to be retained. Using CyLinter to perform HDBSCAN clustering on the cleaned CRC dataset, we identified 78 clusters (Fig. 5b)—56 more than the pre-QC CRC clustering (Fig. 2a). Silhouette scores were predominantly positive, suggesting an optimal clustering (Fig. 5c) and agglomerative hierarchical clustering yielded six meta-clusters with marker expression patterns corresponding to populations of tumor cells (meta-cluster A; Fig. 5d), stromal cells (B), memory T cells (C), macrophages (D), B cells (E), and effector T cells (F). Using curateThumbnails we confirmed that all 78 clusters were free of visual artifacts (Fig. 5e,f and Online Supplementary Fig. 3) and reasoned that the increase in the number of clusters in the post-QC CRC embedding was due to the removal of pre-QC outliers that constrained the remainder of the cells to a relatively narrow region of the UMAP feature space. For example, pre-QC CRC cluster 6 (Fig. 2d) resolved in the post-QC embedding into seven discrete sets of cells with distinct markers and spatial locations (Fig. 5g,h, Extended data Fig. 6a and Online Supplementary Fig. 3). We concluded that even small numbers of artifacts corresponding to outliers in image intensity dramatically affect data interpretation.
In the case of the TOPACIO dataset, CyLinter removed 84% of the total cells. The majority (~53%) were removed during the process of positive ROI selection, in which selected regions of tissue largely devoid of obvious artifacts were curated (Fig. 6a). Bright outliers attributed to antibody aggregates, cell detachment, mis-segmentation, and dim/over-saturated nuclei accounted for ~14%, 12%, 4%, and 1% of redacted data, respectively. Overall, the post-QC TOPACIO dataset comprised ~3.0 x 106 cells (~16% of total segmentation instances in the pre-QC dataset) and HDBSCAN clustering identified 43 clusters in the UMAP embedding (Fig. 6b). Silhouette analysis revealed positive scores for cells in most clusters except for those in cluster 42 (Extended data Fig. 6b) – the largest cluster in the embedding. Agglomerative hierarchical clustering based on mean marker intensities yielded four meta-clusters corresponding to stromal (meta-cluster A; Fig. 6c), tumor (B), lymphoid (C), and myeloid (D) cells. Using CyLinter’s curateThumbnails module to inspect image patches of cells from each cluster, we found that most cells had a high degree of concordance in morphology and marker expression among replicates and were consistent with extant cell types (Online Supplementary Fig.4). For example, post-QC TOPACIO cluster 0 corresponded to cells with small, round, nuclei with intense plasma membrane staining for CD4 and nuclear staining for FOXP3 (Fig. 6d), consistent with regulatory T cells; while cells in cluster 42 had high panCK and moderate ECAD staining (the latter at cell-cell junctions) indicative of de-differentiating breast epithelial cells (Fig. 6e). Coloring the post-QC UMAP embedding by select pre-QC cluster labels confirmed that many pre-QC clusters were in fact composed of different cell types (Fig. 6f). For example, pre-QC TOPACIO cluster 174 contained cells that resolved into at least 14 different post-QC clusters representing an array of different lymphoid, tumor, and stromal cell populations. Multiple other pre-QC clusters exhibited a similar pattern. These data show that abundant imaging artifacts in the TOPACIO dataset not only resulted in an unrealistically large number of clusters, but that these clusters were still under-clustered insofar as they contained cells of different type. We found that redacted cells from both the CRC and TOPACIO datasets showed no discernable pattern in their location, suggesting minimal sampling bias in the QC of these two datasets (Extended data Fig. 6c,d).
Despite significant improvement in both post-QC CRC and TOPACIO clustering, visual inspection revealed clusters with unexpected immunomarker expression patterns. For example, cells in post-QC CRC cluster 13 had high levels tumor/epithelial markers such as Keratin, ECAD, and PCNA, as well as intermediate levels of T cell markers such as CD3, CD45RO, CD45, and CD8α (Extended data Fig. 6e). There is no known cell type that expresses these marker combinations and we found that cluster 13 contained keratin+ tumor cells neighboring CD8α+ T cells (Extended data Fig. 6f) resulting in some pixels being incorrectly assigned to neighboring cells, a phenomenon referred to as lateral spillover44. Future integration of tools for correcting for lateral spillover such as (REDSEA)44 into CyLinter may help with this problem, but until then, these instances must be identified by visual inspection.
DISCUSSION
In this paper we show that artifacts commonly present in highly multiplexed tissue images can have a dramatic impact on single-cell analysis. Artifacts such as tissue folds, lint or hair, antibody aggregates etc. are often outliers with respect to intensity and/or shape and give rise to elements in spatial feature tables that substantially interfere with clustering algorithms such as HDBSCAN, lead to uninterpretable elements in UMAP embeddings, and obscure accurate cell type assignment. Inspection of CyCIF, CODEX, and mIHC images suggests that artifacts can be subdivided into: 1) those intrinsic to the specimen itself (e.g. tissue folds), 2) those arising during staining and image acquisition (e.g. antibody aggregates), and 3) those arising during image-processing (e.g. segmentation errors). The first class is unavoidable and does not usually interfere with visual review by human experts because humans can easily discern and ignore these types of artifacts, but it can dramatically affect computational analysis, particularly of large whole-slide images. The second and third classes of artifacts can be minimized by careful experimental practices and good instrumentation, and we find that as few as 5-10% of cells need to be removed in the best cases. However, specimens that have been mounted on slides many years prior to imaging and stored under suboptimal conditions are more problematic. Unfortunately, specimens of this type are commonly encountered in correlative studies of clinical trial data and high demand for trial specimens means that only one slide is often available for each specimen so it is impossible to go back and fix errors that may arise. In these situations, it is imperative that robust artifact and QC procedures be used so that results can be obtained from invaluable clinical specimens.
Quality control is recognized as a critical step in the acquisition of single cell scRNA-Seq data, and a robust ecosystem of QC tools has been developed in that domain45,46. CyLinter is among the first tools to be developed for QC of single-cell data from highly multiplexed tissue images. It works with any TIFF/OME-TIFF files and their corresponding spatial feature tables (CSV format) and is available both as a stand-alone plugin for the popular Napari multi-channel image viewer25 and as a component of the MCMICRO19 pipeline. CyLinter relies on human visual review, making it effective with a wide range of image and specimen types. However, this process can be time consuming. To make the QC task easier, CyLinter includes a variety of computational approaches for efficiently identifying artifacts or, in some cases, selecting only artifact-free regions of tissue. Currently, CyLinter QC of a ~20-specimen dataset by a single reviewer can take a few days; this compares favorably with several weeks to collect data, and a month or more to perform detailed spatial analysis. In the future, many of these tasks will likely first be performed by machine learning models trained to identify different classes of artifacts. However, we have found that training such classifiers is quite difficult and unsupervised methods have failed thus far. In contrast, many classical image processing approaches – gating for example – have proven effective. Our long-term plan is to incorporate semi-automated machine learning into the human-in-the-loop CyLinter pipeline with a particular focus on automated QC for large batches of similar specimens. It may also be possible to correct for some errors and impute missing values rather than simply redacting data points – the current approach is intentionally conservative in this regard.
When datapoints are removed from a dataset there is always concern that key findings will be biased. This is particularly true in cases such as the TOPACIO dataset in which the majority of the cells were redacted. The same problem holds for scRNA-Seq, although much of the problem arises prior to sequencing, for example during tissue dissociation, microfluidic or flow cytometry sorting, and library preparation45,47. In the case of tissue imaging, it is possible to inspect redacted datapoints for patterns. Analysis can then be conditioned on any biases identified. In the case of clinical cohorts, it is particularly important to ensure that that data redaction during QC does not affect one subgroup or trial more than another.
Our experience with over 1,000 whole-slide images from dozens of tissue and tumor types, has continued to demonstrate the necessity of ongoing visual review of high-plex tissue images, ideally with assistance from a trained pathologist. Any hypothesis generated through analysis of data in a spatial feature table must be confirmed through careful inspection of the underlying images, much as sequence polymorphisms were historically validated by review and presentation of the primary data (i.e., gels, sequencer output, etc.). The existing generation of spatial feature tables not only contains errors and omissions, but it also poorly represents much of the morphological information in an image. Continued innovation in QC procedures and image processing algorithms is needed to overcome these issues.
METHODS
Software Implementation
CyLinter software is written in Python3, archived on the Anaconda package repository, versioned controlled on Git/GitHub (https://github.com/labsyspharm/cylinter), instantiated as a configurable Python Class object, and validated for Mac and PC operating systems. The tool can be installed at the command line using the Anaconda package installer (see the CyLinter website: https://labsyspharm.github.io/cylinter/ for details) and is executed with the following command: cylinter configuration.yml, where configuration.yml is an experiment-specific YAML configuration file. An optional --module flag can be passed before specifying the path to the configuration file to begin the pipeline at a specified module. More details on configuration settings can be found at the CyLinter website and GitHub repository (https://github.com/labsyspharm/cylinter49). The tool uses the Napari image viewer50 for image browsing and annotation tasks. The tool also uses numerical and image-processing routines from multiple Python data science libraries, including pandas, numpy, matplotlib, seaborn, SciPy, scikit-learn, and scikit-image. OME-TIFF files are read using tifffile and processed into multi-resolution pyramids using a combination of Zarr and dask routines that allow for rapid panning and zooming of large (hundreds of GB) images. The CyLinter pipeline consists of multiple QC modules, each implemented as a Python function, that perform different visualization, data filtration, or analysis tasks. Several modules return redacted versions of the input spatial feature table, while others perform analysis tasks such as cell clustering. CyLinter is freely-available for academic re-use under the MIT license. A minimal example dataset consisting of 4 tissue cores from the EMIT TMA22 dataset can be downloaded from the Synapse data repository (Synapse ID: syn52468155) by following instructions at the CyLinter website (https://labsyspharm.github.io/cylinter/exemplar/). All CyLinter analyses presented in this work were performed on a commercially available 2019 MacBook Pro equipped with eight 2.4 GHz Intel Core i9 processors (5.0GHz Turbo Boost) and 32GB 2400MHz DDR4 memory. Imaging data analyzed in this study were stored on and accessed from an external hard drive with 12TB capacity. Implemented software versions are as follows: Python 3.11.5, CyLinter 0.0.47.
Automated Artifact Detection in CyLinter
An algorithm consisting of classical image analysis steps was designed to automatically identify prevalent artifacts commonly found in highly multiplexed images (e.g., illumination aberrations, antibody aggregates, and tissue folding). The model is applied on a channel-by-channel basis and works on down-sampled versions of each channel, rescaling pixel values to uint8 bit depth for efficient processing. A series of operations in mathematical morphology consisting of erosion and local mean smoothing followed by dilation are applied to transform each down-sampled image channel. These three steps utilize a disk kernel, where the kernel size is a user-defined parameter assumed to have a diameter on the order of 3-5 single cells, conditional on image pixel size. This kernel is then expanded to find local maxima seed points corresponding to putative artifacts. Each artifact is extracted via a flood fill operation according to a specific tolerance parameter that is adjusted in real-time by the user. The union of the flood fill regions produces a binary artifact mask that is resized to the original image dimensions; cells falling within mask boundaries are then dropped from the corresponding spatial feature table.
t-CyCIF
The CyCIF approach to multiplex imaging involves iterative cycles of antibody incubation with tissue, imaging, and fluorophore deactivation as described previously26; protocols and methods related to CyCIF are available on Protocols.io (see “Detailed Experimental Protocols” below). Briefly, multiplex CyCIF images were collected using a RareCyte CyteFinder II HT Instrument equipped with a 20x (0.75 NA) objective and 2x2 pixel binning. This setup allowed for the acquisition of 4-channel image tiles with dimensions 1280x1080 pixels and a corresponding pixel size of 0.65 μm/pixel. All four channels are imaged during each round of CyCIF, one of which is always reserved for nuclear counterstain (Hoechst or DAPI) to visualize cell nuclei. RCPNL files containing 16-bit imaging data were generated (one per image tile) during each imaging cycle.
Image Processing
Raw microscopy image tiles (RCPNL files) for the datasets described in this study were processed into stitched, registered, and segmented OME-TIFF51 files using the MCMICRO image-processing pipeline29. Corresponding cell x feature CSV files (i.e., spatial feature tables) were also generated by MCMICRO. Specific algorithms implemented in MCMICRO for the processing of each dataset are as follows: BaSiC—a Fiji/ImageJ plugin for background and shading correction used to perform flatfield and darkfield image correction52; ASHLAR—a program for seamless mosaic image processing across imaging cycles40; Coreograph (used for the EMIT dataset, https://github.com/HMS-IDAC/UNetCoreograph)—for dearraying the mosaic TMA image into individual TIFF and CSV files per core; UnMICST41—used for cell segmentation; employs the U-Net53 deep learning architecture for semantic segmentation; S3segmenter (https://github.com/HMS-IDAC/S3segmenter); MCQuant (https://github.com/labsyspharm/quantification) for per cell feature extraction including X,Y spatial coordinates, segmentation areas, mean marker intensities, and nuclear morphology attributes.
TOPACIO
The TOPACIO dataset used in this study consists of 25 de-identified formalin-fixed, paraffin embedded (FFPE) tissue sections (5 μm thick) of triple-negative breast cancer from patients enrolled in the TOPACIO clinical trial (ClinicalTrials.gov Identifier: NCT02657889). Specimens were collected via one of three different biopsy methods: fine needle, punch needle, or gross tumor resection and procured from Tesaro and Merck pharmaceutical companies as part of the recently-completed trial. Slides were mounted onto Superfrost Plus glass microscope slides (Fisher Scientific, 12-550-15) then dewaxed and antigen-retrieved using a Leica BOND RX Fully Automated Research Stainer prior to multiplex data acquisition by CyCIF. Images were acquired at 20x magnification with 2x2 binning over 10 CyCIF cycles using 27 markers (19 plus Hoechst evaluated in this study).
CRC
The CRC dataset consists of a whole-slide section (1.6cm2) of human colorectal adenocarcinoma tissue (section# 097) imaged at 20x magnification with 2x2 binning over 10 CyCIF cycles using 24 markers across 10 CyCIF cycles (21 plus Hoechst evaluated in the current study) collected as part of the Human Tumor Atlas Network (HTAN).
EMIT TMA22
The EMIT TMA dataset consists of human tissue specimens from 42 patients organized as a multi-tissue microarray (HTMA427) under an excess tissue protocol (clinical discards) approved by the IRB at Brigham and Women's Hospital (BWH IRB 2018P001627). Two (2) 1.5 mm diameter cores were acquired from each of 60 tissue regions with the goal of acquiring one or two examples of as many tumors as possible (with matched normal tissue from the same resection when feasible). Overall, the TMA contains 123 cores including 3 “marker cores” consisting of normal kidney cortex which were added to the TMA in an arrangement that makes it possible to orient the overall TMA image. Not including the marker cores 44 cores were from males and 76 were from females between 21 and 86 years-of-age. The EMIT TMA22 dataset was acquired at 20x magnification with 2x2 binning over 10 CyCIF cycles using 27 markers (20 plus Hoechst evaluated in the current study) and is available for download from the Synapse data repository (https://www.synapse.org/#!Synapse:syn22345750).
HNSCC (CODEX)
The HNSCC CODEX dataset consists of two sections of the same deidentified specimen of head & neck squamous carcinoma (HNSCC) imaged at 20x magnification with 2x2 binning over 9 CODEX cycles using 15 markers plus DAPI.
Normal Tonsil (mIHC)
The mIHC dataset consists of a deidentified whole-slide tonsil specimen from a 4-year-old female of European ancestry procured from the Cooperative Human Tissue Network (CHTN), Western Division, as part of the HTAN SARDANA Trans-Network Project and imaged at 20x magnification with 2x2 binning (0.5 μm/pixel) over 5 mIHC cycles using 18 markers plus Hoechst.
Detailed Experimental Protocols
1. FFPE Tissue Pre-treatmet Before t-CyCIF on Leica Bond RX V.2 (dx.doi.org/10.17504/protocols.io.bji2kkge)
2. Tissue Cyclic Immunofluorescence (t-CyCIF) V.2 (dx.doi.org/10.17504/protocols.io.bjiukkew)
Ethics and IRB Statement
The research described in this manuscript was performed on previously published imaging data in part obtained through the Human Tumor Atlas Network (HTAN) and tissue samples from the recently completed TOPACIO clinical trial (ClinicalTrials.gov Identifier: NCT02657889) which was conducted in accordance with ethical principles founded in the Declaration of Helsinki. This study received central approval by the Dana-Farber Cancer Institute (DFCI) institutional review board, protocol 15-550, and/or relevant competent authorities at each clinical trial site. All patients provided written informed consent to participate in the study. All samples and data have been deidentified for the work performed at Harvard Medical School, approved under Institutional Review Boards (IRB) protocol 19-0186. The research complies with all relevant ethical regulations, was reviewed and approved by the IRBs at HMS and DFCI and is considered Non-Human Subjects Research.
Extended Data
Supplementary Material
Acknowledgements
This work was supported by the Ludwig Cancer Research and the Ludwig Center at Harvard (P.K.S., S.S.) and by NIH NCI grants U2C-CA233280, and U2C-CA233262 (P.K.S., S.S.). Development of computational methods and image processing software is supported by a Team Science Grant from the Gray Foundation (P.K.S., S.S.), the Gates Foundation grant INV-027106 (P.K.S.), the David Liposarcoma Research Initiative (P.K.S., S.S.) and the Emerson Collective (P.K.S.). S.S. is supported by the BWH President’s Scholars Award. We gratefully acknowledge Juliann Tefft for superb editorial support; Kai Wucherpfennig and Sascha Marx for providing the HNSCC CODEX dataset; Zoltan Maliga and Connor Jacobson for providing CyCIF EMIT TMA22 images; and the Dana-Farber/Harvard Cancer Center for use of the Specialized Histopathology Core, which provided TMA construction and sectioning services. We also thank Yu-An Chen for assisting in the collection of CyCIF data from the SARDANA-097 tissue sample performed as part of the National Cancer Institute (NCI) Human Tumor Atlas Network (HTAN).
Footnotes
Competing Interests
P.K.S. is a cofounder and member of the Board of Directors of Glencoe Software, a member of the Board of Directors for Applied Biomath and a member of the Scientific Advisory Board for RareCyte, NanoString and Montai Health; he holds equity in Glencoe, Applied Biomath and RareCyte. P.K.S. is a consultant for Merck, and the Sorger lab has received research funding from Novartis and Merck in the past 5 years. PKS declares that none of these relationships have influenced the content of this manuscript. E. A. M. reports compensated service on Scientific Advisory Boards for Astra Zeneca, BioNTech and Merck; uncompensated service on Steering Committees for Bristol Myers Squibb and Roche/Genentech; speakers’ honoraria and travel support from Merck Sharp & Dohme; and institutional research support from Roche/Genentech (via an SU2C grant) and Gilead. She also reports research funding from Susan Komen for the Cure for which she serves as a Scientific Advisor, and uncompensated participation as a member of the American Society of Clinical Oncology Board of Directors. J. L. G. serves or has previously served on advisory boards and/or as a scientific advisory board member for Array BioPharma/Pfizer, AstraZeneca, BD Biosciences, Carisma, Codagenix, Duke Street Bio, GlaxoSmithKline, Kowa, Kymera, OncoOne and Verseau Therapeutics, and has research grants from Array BioPharma/Pfizer, Duke Street Bio, Eli Lilly, GlaxoSmithKline and Merck. The other authors declare no competing interests.
Reporting Summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data Availability Statement
New data associated with this paper is available at the HTAN Data Portal (https://data.humantumoratlas.org). Previously published data is through public repositories. See Supplementary Table 1 for a complete list of datasets and their associated identifiers and repositories. Online Supplementary Figures 1-4 and the CyLinter demonstration dataset can be accessed at Sage Synapse (https://www.synapse.org/#!Synapse:syn24193163/files)
Code Availability Statement
CyLinter source code is available for academic re-use under the MIT open-source license agreement at Github (https://github.com/labsyspharm/cylinter)49. Code used to produce the findings of the study is available at https://github.com/labsyspharm/cylinter-paper.
References
- 1.Angelo M. et al. Multiplexed ion beam imaging of human breast tumors. Nat. Med. 20, 436–442 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gerdes M. J. et al. Highly multiplexed single-cell analysis of formalin-fixed, paraffin-embedded cancer tissue. Proc. Natl. Acad. Sci. U.S.A. 110, 11982–11987 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Giesen C. et al. Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nat. Methods 11, 417–422 (2014). [DOI] [PubMed] [Google Scholar]
- 4.Goltsev Y. et al. Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging. Cell 174, 968–981.e15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gut G., Herrmann M. D. & Pelkmans L. Multiplexed protein maps link subcellular organization to cellular states. Science 361, (2018). [DOI] [PubMed] [Google Scholar]
- 6.Tsujikawa T. et al. Quantitative Multiplex Immunohistochemistry Reveals Myeloid-Inflamed Tumor-Immune Complexity Associated with Poor Prognosis. Cell Rep 19, 203–217 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lin J.-R. et al. Highly multiplexed immunofluorescence imaging of human tissues and tumors using t-CyCIF and conventional optical microscopes. eLife 7, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Färkkilä A. et al. Immunogenomic profiling determines responses to combined PARP and PD-1 inhibition in ovarian cancer. Nat Commun 11, 1459 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Launonen I.-M. et al. Single-cell tumor-immune microenvironment of BRCA1/2 mutated high-grade serous ovarian cancer. Nat Commun 13, 835 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Schürch C. M. et al. Coordinated Cellular Neighborhoods Orchestrate Antitumoral Immunity at the Colorectal Cancer Invasive Front. Cell 182, 1341–1359.e19 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wagner J. et al. A Single-Cell Atlas of the Tumor and Immune Ecosystem of Human Breast Cancer. Cell 177, 1330–1345.e18 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Burger M. L. et al. Antigen dominance hierarchies shape TCF1+ progenitor CD8 T cell phenotypes in tumors. Cell 184, 4996–5014.e26 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gaglia G. et al. Temporal and spatial topography of cell proliferation in cancer. Nature Cell Biology 24, 316–326 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Nirmal A. J. et al. The spatial landscape of progression and immunoediting in primary melanoma at single cell resolution. Cancer Discov candisc.1357.2021 (2022) doi: 10.1158/2159-8290.CD-21-1357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Burger M. L. et al. Antigen dominance hierarchies shape TCF1+ progenitor CD8 T cell phenotypes in tumors. Cell 184, 4996–5014.e26 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gaglia G. et al. Temporal and spatial topography of cell proliferation in cancer. Nat Cell Biol 24, 316–326 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gaglia G. et al. Lymphocyte networks are dynamic cellular communities in the immunoregulatory landscape of lung adenocarcinoma. Cancer Cell 41, 871–886.e10 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fletcher C. D. M. Diagnostic histopathology of tumors. (2013). [Google Scholar]
- 19.Schapiro D. et al. MCMICRO: a scalable, modular image-processing pipeline for multiplexed tissue imaging. Nat Methods 19, 311–315 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Schapiro D. et al. MITI minimum information guidelines for highly multiplexed tissue images. Nat Methods 19, 262–267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lin J.-R. et al. Multiplexed 3D atlas of state transitions and immune interaction in colorectal cancer. Cell 186, 363–381.e19 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Health, C. for D. and R. Technical Performance Assessment of Digital Pathology Whole Slide Imaging Devices. U.S. Food and Drug Administration; http://www.fda.gov/regulatory-information/search-fda-guidance-documents/technical-performance-assessment-digital-pathology-whole-slide-imaging-devices (2019). [Google Scholar]
- 23.Schapiro D. et al. MCMICRO: a scalable, modular image-processing pipeline for multiplexed tissue imaging. Nat Methods 1–5 (2021) doi: 10.1038/s41592-021-01308-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Vinayak S. et al. Open-label Clinical Trial of Niraparib Combined With Pembrolizumab for Treatment of Advanced or Metastatic Triple-Negative Breast Cancer. JAMA Oncology 5, 1132–1140 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chiu C.-L., Clack N. & Community T. N. napari: a Python Multi-Dimensional Image Viewer Platform for the Research Community. Microscopy and Microanalysis 28, 1576–1577 (2022). [Google Scholar]
- 26.Lin J.-R. et al. Highly multiplexed immunofluorescence imaging of human tissues and tumors using t-CyCIF and conventional optical microscopes. Elife 7, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Goltsev Y. et al. Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging. Cell 174, 968–981.e15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tsujikawa T. et al. Quantitative Multiplex Immunohistochemistry Reveals Myeloid-Inflamed Tumor-Immune Complexity Associated with Poor Prognosis. Cell Rep 19, 203–217 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Schapiro D. et al. MCMICRO: A scalable, modular image-processing pipeline for multiplexed tissue imaging. http://biorxiv.org/lookup/doi/10.1101/2021.03.15.435473 (2021) doi: 10.1101/2021.03.15.435473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.McInnes L., Healy J. & Astels S. hdbscan: Hierarchical density based clustering. JOSS 2, 205 (2017). [Google Scholar]
- 31.Lin J.-R., Fallahi-Sichani M. & Sorger P. K. Highly multiplexed imaging of single cells using a high-throughput cyclic immunofluorescence method. Nat Commun 6, 8390 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Antigen Retrieval - an overview ∣ ScienceDirect Topics. https://www-sciencedirect-com.ezp-prod1.hul.harvard.edu/topics/medicine-and-dentistry/antigen-retrieval.
- 33.Bancroft’s theory and practice of histological techniques. (Elsevier, 2019). [Google Scholar]
- 34.Histologic preparations: common problems and their solutions. (College of American Pathologists, 2009). [Google Scholar]
- 35.Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math 20, 53–65 (1987). [Google Scholar]
- 36.McKinney W. Data Structures for Statistical Computing in Python. in 56–61 (2010). doi: 10.25080/Majora-92bf1922-00a. [DOI] [Google Scholar]
- 37.Ruff L. et al. A Unifying Review of Deep and Shallow Anomaly Detection. Proc. IEEE 109, 756–795 (2021). [Google Scholar]
- 38.Prabhakaran S. et al. Addressing persistent challenges in digital image analysis of cancerous tissues. http://biorxiv.org/lookup/doi/10.1101/2023.07.21.548450 (2023) doi: 10.1101/2023.07.21.548450. [DOI] [Google Scholar]
- 39.Shen D., Wu G. & Suk H.-I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Muhlich J., Chen Y.-A., Russell D. & Sorger P. K. Stitching and registering highly multiplexed whole slide images of tissues and tumors using ASHLAR software. http://biorxiv.org/lookup/doi/10.1101/2021.04.20.440625 (2021) doi: 10.1101/2021.04.20.440625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yapp C. et al. UnMICST: Deep learning with real augmentation for robust segmentation of highly multiplexed images of human tissues. http://biorxiv.org/lookup/doi/10.1101/2021.04.02.438285 (2021) doi: 10.1101/2021.04.02.438285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.McInnes L., Healy J. & Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 [cs, stat] (2018). [Google Scholar]
- 43.van der Maaten et al. Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9, 2579–2605 (2008). [Google Scholar]
- 44.Bai Y. et al. Adjacent Cell Marker Lateral Spillover Compensation and Reinforcement for Multiplexed Images. Front Immunol 12, 652631 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Denisenko E. et al. Systematic assessment of tissue dissociation and storage biases in single-cell and single-nucleus RNA-seq workflows. Genome Biology 21, 130 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.McCarthy D. J., Campbell K. R., Lun A. T. L. & Wills Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Li H. & Humphreys B. D. Single Cell Technologies: Beyond Microfluidics. Kidney360 2, 1196–1204 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sternberg. Biomedical Image Processing. Computer; 16, 22–34 (1983). [Google Scholar]
- 49.Baker Gregory. CyLinter. (2021) doi: 10.5281/ZEN0D0.7186909. [DOI] [Google Scholar]
- 50.Sofroniew Nicholas et al. napari: a multi-dimensional image viewer for Python. (2022) doi: 10.5281/ZENODO.3555620. [DOI] [Google Scholar]
- 51.Goldberg I. G. et al. The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome Biol 6, R47 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Peng T. et al. A BaSiC tool for background and shading correction of optical microscopy images. Nat Commun 8, 14836 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Ronneberger O., Fischer P. & Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. in Medical Image Computing and Computer-Assisted Intervention – MICCAI2015 (eds. Navab N., Hornegger J., Wells W. M. & Frangi A. F.) vol. 9351 234–241 (Springer International Publishing, 2015). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
New data associated with this paper is available at the HTAN Data Portal (https://data.humantumoratlas.org). Previously published data is through public repositories. See Supplementary Table 1 for a complete list of datasets and their associated identifiers and repositories. Online Supplementary Figures 1-4 and the CyLinter demonstration dataset can be accessed at Sage Synapse (https://www.synapse.org/#!Synapse:syn24193163/files)
CyLinter source code is available for academic re-use under the MIT open-source license agreement at Github (https://github.com/labsyspharm/cylinter)49. Code used to produce the findings of the study is available at https://github.com/labsyspharm/cylinter-paper.