Abstract
The ability to align individual cellular information from multiple experimental sources is fundamental for a systems-level understanding of biological processes. However, currently available tools are mainly designed for single-cell transcriptomics matching and integration, and generally rely on a large number of shared features across datasets for cell matching. This approach underperforms when applied to single-cell proteomic datasets due to the limited number of parameters simultaneously accessed and lack of shared markers across these experiments. Here, we introduce a cell-matching algorithm, matching with partial overlap (MARIO) that accounts for both shared and distinct features, while consisting of vital filtering steps to avoid suboptimal matching. MARIO accurately matches and integrates data from different single-cell proteomic and multimodal methods, including spatial techniques and has cross-species capabilities. MARIO robustly matched tissue macrophages identified from COVID-19 lung autopsies via codetection by indexing imaging to macrophages recovered from COVID-19 bronchoalveolar lavage fluid by cellular indexing of transcriptomes and epitopes by sequencing, revealing unique immune responses within the lung microenvironment of patients with COVID.
Subject terms: Data integration, Statistical methods, Systems biology, Proteomic analysis
MARIO is a robust tool for accurately matching multimodal single-cell datasets.
Main
The rapid developments in single-cell technologies have fundamentally transformed the investigation of complex biological systems. The ability to individually measure the genomic1, epigenomic2, transcriptomic3, and proteomic4 states at the single-cell level marks an exciting era in biology. Single-cell transcriptomics and targeted proteomics are the two main approaches commonly used to delineate cell populations and infer functionality or disease states. Single-cell transcriptomics is theoretically able to assess the entire transcriptome of a target cell, with 5,000–10,000 unique gene transcripts captured on average for each cell. A key drawback of this method is the relative sparseness of the data generated, particularly for less abundant genes. On the other hand, antibody-based single-cell proteomics has gradually progressed over the years, from the initial detection of a handful of protein targets5,6, to about 40 targets via mass cytometry7, over 100 protein targets via sequencing8,9 and, most recently, more than 40 protein targets spatially resolved in their native tissue context10–13. Emerging sequencing-based approaches such as cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) and RNA expression and protein sequencing assay can simultaneously probe the RNA and protein levels for each single cell, albeit with the tradeoff of dissociating cells from their original spatial location.
Given the frequent overlap in proteins measured across dissociated single cells via sequencing, and intact tissues via antibody-imaging, an orthogonal approach would leverage information from one modality to inform the other. Such an effort would use biological measurements obtained on one modality (for example, CITE-seq) to inform cells measured using another modality (for example, codetection by indexing or CODEX) for a comprehensive assessment of the localization of both proteins and RNAs within tissue samples, hence it is vital to have the ability to align individual cells across these experiments.
Several computational approaches for integrative analysis of single-cell data across multiple modalities currently exist14–18. However, most of these methods are tailored toward single-cell sequencing-based analysis, such as single-cell RNA-sequencing (scRNA-seq) and single-cell assay for transposase-accessible chromatin sequencing, and are not designed for protein-based assays as the limited shared features across proteomic datasets are orders of magnitude smaller than those in single-cell sequencing datasets, and the signals within these limited shared features alone are typically insufficient to produce high-quality and interpretable pairwise cell-matching results. In addition, the intrinsically greedy (and thus at most locally optimal) nature of the mutual nearest neighborhood (mNN) matching algorithm routinely used in available methods limits the ability to fully use the correlation structure within the distinct protein features. Thus, there is an unmet need for a new strategy specifically designed for matching and integrating single-cell datasets based on limited but robust proteomic parameters.
To meet this need, we have developed matching with partial overlap (MARIO): the matching process leverages both shared and distinct features between datasets, and is nongreedy by optimizing a global objective. We additionally developed two quality control steps, the matchability test and joint regularized filtering, to avoid suboptimal matching and prevent uninterpretable over-integration. Benchmarking of MARIO across various single-cell proteomic data generated from different modalities (cytometry by time of flight (CyTOF), CITE-seq, and CODEX) and from different species (human and nonhuman primates (NHPs)) demonstrated consistent outperformance of cell–cell-matching accuracy over available methods. Finally, we matched macrophages from a CODEX multiplex immunofluorescence lung autopsy dataset to CITE-seq bronchoalveolar lavage fluid (BALF) macrophage cells using MARIO to uncover a spatially orchestrated immune conditioning by complement-expressing macrophages and neutrophils in COVID-19. To make MARIO freely available to the public, we implemented the algorithm in the Python package MARIO, along with an R version available online at https://github.com/shuxiaoc/mario-py.
Results
Matching single cells using partially shared features
There are unique challenges in the implementation of a cell-matching algorithm using proteomic information. First, each study is often bespoke and rarely shares identical antibody panels, although a portion of the proteins measured is generally the same. Thus, the matching process must be able to achieve stable pairing of cells with the limited number of features; this is in contrast to transcriptomics data where often several hundreds to thousands of shared features are available16,17. Second, underlying correlations between shared and distinct features often exist within and between datasets as a result of panel design and fundamental biological principles. It is therefore pertinent to incorporate information from both shared and distinct protein features. Third, the matching problem corresponds to a well-defined objective function. The mNN-type algorithms can be thought of solving this objective function in a greedy fashion, but often a global optimum is unattainable (see the Methods for mathematical details). As such, the matching problem should be solved to attain the global optimum rather than a local optimum. Finally, key quality control steps are crucial to ensure the accuracy and interpretability of the postulated cell–cell-matching results.
To address these challenges, we developed MARIO: a robust framework that accurately matches cells across single-cell proteomic datasets for downstream analysis (Fig. 1a,b). MARIO first performs a pairwise cell matching using shared features. To do this, after proper transformation, normalization and batching, we use singular value decomposition on shared features to construct a cross-data distance matrix based on the Pearson correlation of the reduced matrix. An initial cell–cell pairing is then obtained by solving a linear assignment problem that searches for a distance-minimizing injective map between the two collections of cells. The two datasets are next aligned using this initial matching, and both shared and distinct features of the two datasets are projected onto a common subspace using canonical correlation analysis (CCA)19, as it incorporates the hidden correlations between different proteomic features not shared between the datasets. A cross-dataset distance is then obtained using the canonical scores, and a refined matching is obtained via linear assignment on the new distance. By taking the means of the top sample canonical correlations as a proxy of matching quality, MARIO then finds the best convex combination weight to interpolate the initial and refined matchings. This allows users to data-adaptively backtrack toward the initial matching when the refined matching becomes unreliable (Fig. 1c).
After obtaining the interpolated matching, MARIO next performs a matchability test to determine whether or not the datasets selected for integration by the user are suitable for such a joint analysis. The matchability test is performed by flipping the sign of each row of the two datasets with some flipping probability, so that most of underlying inter-dataset correlations (if these exist) are abrogated. This process is repeated several times to build a distribution of the background canonical correlations of the samples with a low underlying correlation. Comparison of the deviation of the sample canonical correlations from the background distribution reveals whether strong underlying information exists to connect the datasets (Fig. 1d).
Although datasets passing the matchability test are highly correlated, the matching at the individual cell level could still be erroneous. To address this problem, we developed a process termed jointly regularized filtering to automatically filter out low-quality matches without a priori biological knowledge. The filtering process is carried out by optimizing a regularized k-means objective. This objective is a superposition of two parts, where the first part contains individual k-means clustering objectives for both datasets and the second part penalizes the Hamming distance between the two individual cluster label vectors and a hypothesized ‘global’ label vector. Use of such a strategy stems from our hypothesis that although the populations being measured in two different experiments may contain modality-specific characteristics (thus the existence of ‘individual’ cluster labels), both originate from a biologically analogous population (thus the existence of a global cluster label that is close to the two individual cluster labels). If, for a matched pair of cells, the individual labels obtained by joint regularized clustering are not the same, this matched pair is likely spurious and thus disregarded (Fig. 1d). After this filtering step, the resulting individually matched cells are subjected to CCA, and the canonical scores are used as the reduced components in the final embeddings. We implemented generalized CCA to achieve joint embedding of more than two datasets, and subsequently used the gCCA canonical scores as dimensionally reduced components in the final embeddings (Fig. 1e). Mathematical details can be found in the Methods.
To verify the merit of MARIO in a ‘ground-truth’ setting, we tested the matching performance on simulated data with high-granularity cell types. We used Symsim20 to simulate single-cell epitome-like datasets21: data with 20 cell populations from two different modalities, with a total of 60 features generated. To mimic scenarios of different antibody panel setup, 20 out of 60 features were shared across two datasets, with a further 20 distinct features each. Indeed, MARIO showed improved matching capability in the simulated ground-truth case with limited protein features available and shared across datasets (MARIO 81.54%, second best Scanorama 63.3%) (Extended Data Fig. 1).
Matching and integration of multimodal single-cell protein datasets
We evaluated the performance of MARIO on two distinctive datasets generated using individual cells isolated from healthy human bone marrow. The first is a sequencing-based CITE-seq dataset consisting of 29,007 cells stained with an antibody panel of 29 markers17, and the second is a mass cytometry-based CyTOF dataset consisting of 102,977 cells stained with an antibody panel of 32 markers22. Twelve markers were common to both datasets. MARIO successfully matched and aligned these two datasets as shown by visual inspection (Fig. 2a). The intricate data structures were preserved post-MARIO integration, with clear separation of cells belonging to phenotypically distinctive populations in dimensionally reduced t-distributed stochastic neighbor embedding (t-SNE) plots (Fig. 2b). The original cell-type annotations based on the shared low-level annotation (Fig. 2b, top left), and on preexisting annotations from each dataset (Fig. 2b, top right and bottom left) were highly conserved after MARIO integration. Subsequent joint clustering of the post-MARIO integrated data using the canonical correlation scores also corroborated highly accurate cell-type delineation (Fig. 2b, bottom right).
We next designed three different scenarios to further characterize the integration performance of MARIO and to compare its performance against the single-cell integration methods Seurat17, fastMNN14 and Scanorama15. In the first case, shared protein markers were removed from each dataset individually (in an accumulative fashion and in alphabetical order) to simulate the distinctive antibody panel designs across datasets. MARIO consistently outperformed other methods in terms of matching accuracy, independently of the excluded protein targets (full 12-shared panel total accuracy MARIO, 96.01%; second beset Scanorama, 91.46%; dropping eight shared antibodies MARIO, 91.45%; second best Scanorama, 71.22%) (Fig. 2c and Extended Data Fig. 2a).
We additionally evaluated the integration quality among these methods, using metrics, including Structure alignment score, Silhouette F1 score, adjusted Rand index (ARI) F1, Cluster mixing score, and lower dimensional embedding, based on each method’s post-integration latent space scores (Extended Data Fig. 2a,b and Supplementary Figs. 1 and 2). In addition, we removed shared protein markers as previously tested, but in the order of importance score (Methods), where less important markers were dropped first. This process mimics the natural logic of building antibody panels, and in such a dropping scheme, MARIO still consistently outperformed other methods in terms of matching accuracy (Supplementary Figs. 3 and 4).
In the second test, random noise was gradually spiked into the datasets to simulate the variability of intrinsic signal-noise in real world data. The matchability test implemented in MARIO was able to detect and alert the user when data quality was insufficient for confident matching (Fig. 2d). In contrast, the elevated noise resulted in an increase in the number of cells being forcefully paired in other tested methods (reaching close to 100%), albeit with low accuracy (ranging from 50 to 80% in accuracy).
In the third scenario, an entire group of cell types was removed from the destination dataset (that is, the set being matched to) to mimic fluctuations of cell-type composition between datasets. MARIO outperformed all other tested methods by successfully suppressing the incorrect matching of these missing cell types (Fig. 2e).
Given that the matching accuracy for CyTOF to CITE-seq cell pairs among all the main cell types with MARIO was consistently high (Supplementary Fig. 5a); this allowed confident inference of the transcriptome within the single cells measured using CyTOF from their CITE-seq counterparts. We confirmed that the expression patterns of cell type-specific markers were in good agreement between CyTOF proteins, CITE-seq proteins and CITE-seq RNA transcripts (Fig. 2f,g and Supplementary Fig. 5b,c). Moreover, the expression patterns of CD45RO protein and S100A4 and CCR7 RNAs from CITE-seq assisted the delineation of memory and naive CD4 T cell subtypes in the integrated dataset, which was individually unavailable for manual annotation in the CyTOF dataset alone. Therefore, this integrated analysis better defines cell states than do these modalities individually.
We subsequently evaluated the performance of MARIO on two healthy human peripheral blood mononuclear cell (PBMC) datasets measured using CITE-seq and CyTOF. Fifteen proteins were common across these two datasets. MARIO successfully integrated the two datasets (Extended Data Fig. 3a) with high accuracy (Extended Data Fig. 3b). Our results reveal that the expression of key genes on both protein (CyTOF and CITE-seq) and RNA (CITE-seq) levels are in high agreement with their corresponding phenotypic cell-of-origin assignments (Extended Data Fig. 3c). Further benchmarking using the three cases described above showed similar superior matching accuracy for MARIO regardless of antibody panel setup (Extended Data Fig. 4a; for full 15-antibody shared panel total accuracy, MARIO at 90.62% and second best, Seurat 87.55%; for dropping eight shared antibodies, for total accuracy, MARIO 86.34% and second best, Scanorama 81.03%). In evaluation of suppression of over-integration due to poor quality data, mNN methods force matched almost all cells with accuracy below 70%, whereas MARIO alerted the user of poor data quality (Extended Data Fig. 4b). Third, integration with MARIO, but not with mNN methods, was robust even with extensive cell-type composition changes (Extended Data Fig. 4c,d and Supplementary Fig. 6).
Cross-species analysis reveals species and stimuli-specific responses
We performed MARIO matching of four CyTOF datasets from studies in which (1) human whole blood cells were isolated from individuals challenged with H1N1 virus23, (2) human whole blood cells were stimulated with IFNγ24, (3) rhesus macaque whole blood cells were stimulated with IFNγ and (4) cynomolgus monkey whole blood cells were stimulated with IFNγ (Fig. 3a). Dataset 1 was generated using 42 markers, and datasets 2–4 were generated using 39 markers. We observed a high degree of concordance between cell types when visualizing the human–human and human-NHP datasets via t-SNE using MARIO integrated canonical scores (Fig. 3a). In contrast, datasets without MARIO integration process showed an unhomogenized pattern in the t-SNE visualization, indicating the necessity of performing MARIO integration for robust cross-comparisons across these four datasets (Fig. 3b). MARIO cell-type assignment was accurate among different cell types (Supplementary Fig. 7a). There were minimal differences, as measured using Euclidean distance, between paired cells calculated by canonical scores (Supplementary Fig. 7b).
Successful application of MARIO for robust matching and integration across three species and two stimulation conditions granted the opportunity to visually observe subtle changes in expression patterns across different cell types and datasets (Fig. 3c and Supplementary Fig. 7c). We observed an increase in proliferation of CD4 T cells in human blood cells after both influenza viral challenge and IFNγ stimulation, as marked by the upregulation of Ki-67, but no increase in proliferation was detected after stimulation of NHP blood cells. We also observed the upregulation of pSTAT3 in the natural killer cell population within human and NHP samples treated with IFNγ compared to human participants challenged with influenza, although overall pSTAT3 expression was higher in the influenza group. These results are consistent with previous observations25–27. Finally, there was an increased p38 expression in all cell types across all samples, reflective of the conserved functionality of p38 during cell inflammatory and stress responses28,29. In contrast, using the t-SNE plots from preintegration data proved hard to visually identify such an effect (Supplementary Fig. 7d).
Our benchmarking results showed superior matching accuracy using MARIO regardless of antibody panel setup. When using 39 shared antibodies, the total accuracy was 93.26% for MARIO and 86.20% for the second best method (Seurat); when eight shared antibodies were dropped, the total accuracy for IFNγ treatment was 86.79% for MARIO and 82.23% for the second best method (Scanorama) (Extended Data Fig. 5). The mNN methods forced matching of almost 100% of the cells with an accuracy less than 70% with increased spike-in noise, whereas MARIO alerted the user of insufficient information for matching (Supplementary Fig. 8a). MARIO, unlike the mNN methods we tested, was robust in resisting cell-type composition changes (Supplementary Figs. 8b, 9 and 10). Additionally, we removed shared protein markers as previously tested in the order of their importance score, with MARIO consistently outperforming other methods in matching accuracy (Supplementary Figs. 11 and 12).
We similarly applied this strategy to data from IL-4-stimulated human and NHP whole blood cells, and compared them to human influenza viral challenge blood cells (Supplementary Fig. 13a,b). On IL-4 stimulation, we saw an upregulation of Ki-67 in human CD4 T cells but not NHP cells, much akin to IFNγ stimulation, and high expression of pSTAT3 in the natural killer of IL-4-stimulated blood cells, but not in PBMCs from humans challenged with influenza (Supplementary Fig. 13c). In line with IFNγ stimulation, the p38 response was consistent across species and treatments. Our results consistently showed that, regardless of antibody panel setup, MARIO had superior matching accuracy (Supplementary Fig. 14), prevented over-integration (Supplementary Fig. 15a), was robust to cell-type composition changes (Supplementary Fig. 15b) and generated accurate lower dimensional embedding (Supplementary Figs. 15c and 16).
Accurate tissue architectural reconstruction via matching
Matching cells from sequencing modalities on to multiplex proteomics imaging data has been previously attempted using existing integration algorithms (for example, STvEA using Seurat v.2)30. We reasoned that a highly accurate cell matching and integration from MARIO could infer the spatial localization of transcripts within individual cells. We performed MARIO on spatially resolved data from murine splenic cells collected using antibody-based CODEX imaging (28 protein markers)13 and data from dissociated murine splenic cells assayed using CITE-seq (206 protein markers)31; 28 protein markers (all the markers in the CODEX dataset) were shared.
We first visually verified successful MARIO matching and integration using dimensionally reduced t-SNE plots (Fig. 4a). Cell–cell matching was accurate across different cell types (Extended Data Fig. 6a). This enabled accurate single-cell information transfer between cells measured using CITE-seq and CODEX spatially resolved cells. We visually observed highly concordant spatial organization of cell types annotated using CODEX or CITE-seq information and further observed a clear distribution pattern of single-cell transcript expression levels (based on matched individual CITE-seq cells for CODEX cells) corresponding to their expected spatial localization in the spleen (Fig. 4b and Extended Data Fig. 6b). For example, Il7r is concentrated in the T cell zone as expected32; Myc and Cxcr5 are localized to activated and proliferating T and B cells within the germinal center33,34; Ms4a1 and Bhlhe41 are highly expressed in the B cell zone and B cells in the red pulp region35–38 and Il1b is expressed outside the B cell zone39. t-SNE (visualized using CODEX proteins alone) overlays of the matched protein and RNA expression confirmed expected RNA expression profiles within given cell types (Extended Data Fig. 6c).
We next sought to further refine cells from the B lymphocyte lineage by gating the B cell population from the CODEX dataset. Four subpopulations of B cells were identified: transitional type 1 B cells, marginal zone B cells, mature B cells and follicular/germinal center B cells. Visual inspection of the spatial location of these four subtypes of B cell confirmed localization within mouse spleens consistent with previous observations (Extended Data Fig. 6d)40,41. MARIO matching thus enabled a detailed examination of the differentially expressed transcripts within these B cell subtypes resolved by CODEX, revealing a distinctive transcriptional program reflective of their phenotype (Fig. 4c)42–44. These genes were significantly upregulated (P adjusted < 0.05, Wilcoxon Test, two-sided) in the corresponding gated populations of CODEX B cells. In addition, we also confirmed the B cell subtypes as originally annotated by transcriptomic information from Gayaso et al. successfully localized to corresponding spatial niches after MARIO matching (Extended Data Fig. 6e).
For this CODEX to CITE-seq matching, MARIO had matching accuracy superior to mNN methods (Extended Data Fig. 7a). For the full 28-antibody shared panel, the total accuracy for MARIO was 87.76% and the second best method (fastMNN) was 87.40%. Dropping eight shared antibodies resulted in total accuracies of 85.31% for MARIO and the second best method (fastMNN) was 82.01%. MARIO prevented over-integration due to poor quality data, whereas the mNN methods forced matching (Extended Data Fig. 7b). MARIO was also robust in resisting changes to cell-type composition (Extended Data Fig. 7c,d and Supplementary Fig. 17). Additionally, we removed shared protein markers as previously tested, but in the order of importance score, and MARIO still consistently outperformed other methods in matching accuracy (Supplementary Fig. 18).
Generation of a multi-omic COVID-19 lung molecular atlas
We reasoned that the ability to perform integrative and inferential analysis across biologically analogous clinical cohorts, measured at different institutions with varying technologies, would further our understanding of the facets of COVID-19 biology, including a study in which BALF samples were subjected to CITE-seq45. We additionally profiled 76 lung tissue regions from 23 individuals who succumbed to COVID-19 using CODEX imaging with 54 markers (Supplementary Table 1) and observed the abundance of macrophages in both CITE-seq (15.8% of total cells) and CODEX (31.3% of total cells) cohorts. The large overlap in antibody panels of both studies allowed the robust matching and subsequent functional interrogation of macrophages with high granularity (Fig. 5a).
We were able to stratify the macrophages into two populations based on their transcriptional signatures of complement pathway activity (Fig. 5b; C1Q low and high). Such stratification is challenging without using MARIO matching and solely relying on macrophage-related protein markers, including canonical M1 and M2 markers (Supplementary Fig. 19). However, protein expression of these two classes of macrophages partially corresponded to an M1 phenotype for C1Q low macrophages, and an immunosuppressive M2 phenotype for C1Q high macrophages (Fig. 5b). We further observed that the C1Q high transcriptional program was enriched in antigen processing and presentation, whereas that of the C1Q low population consisted of several immune chemotaxis and migration pathways including that of neutrophil chemoattractants (Extended Data Fig. 8a). The top differentially expressed transcripts included CXCL8, CCL7, and TMEM176B, with previously described roles in regulating neutrophil recruitment and migration46–48. The roles of proteins encoded by IL1B, S100A8, and CCL2 in the recruitment of aberrant neutrophils have been recently elucidated in NHP and mice models of SARS-CoV-2 lung pathology49, and are also reflected by elevated transcript levels in C1Q low macrophages (Extended Data Fig. 8b).
In the five previously established functional clusters of interferon stimulated genes (ISG)50,51, we observed distinctive ISG transcriptional programs in C1Q low and high macrophages (Extended Data Fig. 8c; P adjusted <0.05, Wilcoxon Test, two-sided) across all clusters (Extended Data Fig. 8c, ISG clusters 1-05). Our results indicate the activation of the innate immunological pathway, including several previously characterized genes (SERPINB9, CKAP4, CCL2 and SPHK1)52–55, in C1Q low macrophages to inhibit SARS-CoV-2 replication and entry. The failure to subsequently regulate and dampen this innate response resulted in unchecked host immune responses and collateral tissue damage for C1Q low macrophages, while C1Q high macrophages have elevated complement cascade activation (for example, LGALS3BP56) and express genes correlating with mild rather than severe COVID-19 symptoms (for example, SIGLEC1, ref. 57).
In line with the transcriptional signatures for aberrant neutrophil infiltration, we noted a correlation between the presence of C1Q low macrophages and increased infiltrating neutrophils (Fig. 5c–e; ρ = −0.453, P < 0.01). This elevated neutrophil presence was also confirmed visually (Fig. 5f,g and Extended Data Fig. 9a). Spatial cell–cell interaction analysis showed differences in these two subclasses of macrophages and their proximity to other cell types, such as high frequency of C1Q high macrophages proximal to CD4 and CD8 T cells, B cells, myeloid cells and other macrophages (Extended Data Fig. 9b). We next centered C1Q high and low macrophages for an anchor analysis58 to understand the microenvironment as a function of distance around these two groups of macrophages. Our analysis confirmed the distinctive microenvironments around these macrophages, as evident from the differential organization of macrophages, plasma cells, vasculature and CD8 T cells (Extended Data Fig. 9c).
We finally performed protein and nucleic acid in situ imaging (PANINI)58 to visualize the messenger RNA of a complement marker, C1QA, the neutrophil marker CD15 and the macrophage marker CD68 on COVID-19 tissue microarray sections to experimentally validate the spatially resolved gene-expression patterns predicted by MARIO (Fig. 5h). We confirmed the robust expression patterns of C1QA mRNA, CD68 and CD15 proteins in the tissue sections (Extended Data Fig. 9d). We observed a significant correlation between the percentages of experimentally validated C1Q High macrophages and MARIO-predicted C1Q High macrophages percentage, both at the patient level (P = 0.019, ρ = 0.574) and at the per tissue core level (P < 0.01, ρ = 0.521, Spearman-ranked test, Fig. 5i,j). In line with anchor analysis from MARIO-inferred data, we confirmed a significantly decreased neutrophil density around C1Q high macrophages in the PANINI validation experiment (Fig. 5k). The RNA spatial pattern from our PANINI experiment, performed on a separate, nonadjacent section of the same patient tissue core, recapitulated the prediction from the MARIO-matched data (Fig. 5l,m). The spatial correlation between MARIO-predicted and PANINI-validated expression levels of C1QA in macrophages was highly consistent (C1QA signal per region P < 0.01, ρ = 0.597, Spearman-ranked test, Fig. 5n). This ρ value was close to the maximum possible spatial correlation of the tissue structure as determined using cell density per region (P < 0.01, ρ = 0.602, Extended Data Fig. 9e), validating the highly accurate inferential capabilities of MARIO.
Parameter choices, computational resource usage and algorithmic alternatives
MARIO is generally highly robust with respect to different parameter choices for running (Supplementary Figs. 20 and 21). Given the globally optimal nature of the core matching algorithm implemented in MARIO, the time required to run the MARIO pipeline is cubically related to the number of cells. To circumvent this, in the actual implementation of the pipeline matching is automatically performed in batches, thus the time and memory usage is linear rather than cubic, in relation to the dataset size. We also further developed a sparsification technique that reduces the search space to accelerate the matching process. Empirically, we found that MARIO can be run on datasets with moderate sample sizes within reasonable time and memory usage (Supplementary Fig. 22). We also observed that the distance matrix constructed in MARIO (using Pearson correlation) is computationally efficient and generally produces better matching outcome compared to more complicated distance matrices (Extended Data Fig. 10). We also tested alternative algorithms, such as optimal transport (SpaOTsc59) as another potential approach for matching of cells beyond the scope of this work (Supplementary Fig. 23).
Discussion
MARIO is a powerful matching and integration framework for single cells that allows the retention of distinct features. It is particularly suitable for the integration of single-cell proteomic datasets with limited antibody panel overlap. The analysis pipeline builds on several rigorous mathematical advances. First, the matching is constructed by globally (rather than locally) optimizing over a new distance matrix that incorporates both the explicit correlations in shared features and the hidden correlations among distinct features. Second, the accuracy and robustness of the matching are ensured by two theoretically principled quality control processes: the matchability test and jointly regularized filtering60. Third, the integrated embeddings are obtained via CCA or gCCA, which incorporates the information in both the shared and distinct features.
In spite of the clear advantages of MARIO, it has some technical limitations. First, the accuracy and robustness may come at the cost of longer analysis times compared to mNN-based approaches. Second, the prerequisite of performing such matching across datasets is that these datasets should be very similar, thus if certain cell types or cell states are missing in one modality, the matching and integration performance can potentially be affected. Third, although in all the benchmarking scenarios tested in the paper MARIO showed better tolerance of the antibody panel difference between datasets being matched comparing to mNN-based methods, the matching accuracy will still eventually drop below a biologically relevant level when too little information is shared across datasets. Thus, the exact minimal requirement for matching will depend on each dataset itself, marker panels and the biological goal the user wants to achieve. Fourth, while the distance matrix constructed in MARIO defaults to using Pearson correlation, to better accommodate specific requirements from future users, we supplied the option to use nonlinear kernels (Laplacian) instead of Pearson to construct the distance matrix, per user’s choice. Last, linear assignment (MARIO), mNN (for example, Seurat, Scanorama, fastMNN and more) and various other recent methods (for example, SpaOTsc) are all capable of matching cells across modalities. Future iterations of these approaches will be of broad interest to the field.
The need to study biological processes within their tissue context is increasingly evident, with direct relevance to the physiological context of health and disease. The ability to match similar biological samples measured using distinctive single-cell assays will be paramount for hypothesis generation and guidance for experimental design. We are confident that MARIO will serve as a useful methodology and resource for the community with direct applications to a plethora of experimental platforms and biological contexts.
Methods
Complete methods, including details of the data analysis process and extensions of the method summarized below, are available in the Supplementary Notes.
MARIO pipeline
Before the input of MARIO, data were encouraged to go through standard preprocessing pipeline (for example, normalization and scaling) suggested by their originated modality. Suppose the two datasets are denoted as X and Y, where consists of nx cells and features and consists of ny cells and features. The matching implemented by MARIO is a linear assignment problem, therefore requires nx ≤ ny. If data size input does not fulfill such a requirement, X can be randomly segmented into equal-sized batches, and matching will be performed on each batch, as per the user’s request. Among all the features, features are shared across both datasets, whereas the rest of the features are distinct to either X or Y. Thus, we can write both datasets as horizontal concatenations of a shared part and a distinct part:
The cell matching between X and Y is defined as an injective map Π, represented as a binary matrix of dimension nx × ny, such that if and only if the ith cell in X shares a similar biological state to the th cell in Y.
Initial matching with shared features
We first construct an initial estimator of Π using shared features alone. The procedure starts by denoising the shared parts via thresholding their singular values. Consider the singular value decomposition of the vertical concatenation of and :
where the vertical concatenation of and collects the left singular vectors, is a diagonal matrix that collects the singular values in descending order, and collects the right singular vectors. Let be the number of components to keep. We then compute the denoised version of and by
respectively, where for a matrix A, we let A⋅,1:r denote its first r columns and for a diagonal matrix D, we let D1:r denote the submatrix formed by taking its first r rows and columns. We then construct a cross-data distance matrix , whose entries are given by
where is the Pearson correlation coefficient between the ith row of and the th row of . The initial estimator of Π is given by:
1 |
where for two matrices A and B, we let denote the Frobenius inner product. This optimization problem is an instance of minimal weight bipartite matching (also known as rectangular linear assignment problem) in the literature61. We refer readers to ref. 62 for the optimality of this procedure.
Refined matching with distinct features
Given the initial matching , we can approximately align cells in X and Y: the rows of X and correspond to pairs of cells with similar biological states, up to mismatches induced by the estimation error of . Such an approximate alignment opens up the possibility of estimating the latent representations of X and Y by CCA.
Let be the number of components to keep. Collecting top sample canonical vectors into matrices
the latent representation of X can be estimated by the sample canonical scores of X. The same projection can be done on Y data by computing .
We can now compute the cross-data distance matrix directly on the latent space, whose entries are
We finally solve for a refined matching by
2 |
Interpolation of initial and refined matchings
The quality of the refined matching is highly contingent on the quality of the distinct features. If the distinct features are extremely noisy, incorporation of them may hurt the performance, in which case it is more desirable to revert back to the initial matching . We developed a data-adaptive way of deciding how much distinct information shall be incorporated when we estimate the matching from the data.
To start with, we cut the unit interval [0, 1] into grids (for example, {0, 0.1, …, 0.9, 1}). For each λ on the grid, we interpolate the two kinds of distance matrix by taking their convex combination
from which we can solve for the λ-interpolated matching
3 |
Note that and . After aligning X and Y using , we compute top k-sample canonical correlations (in the MARIO package, defaulted to 10), whose mean is taken as a proxy of the quality of . We then select the best according to this quality measure and use afterward.
Quality control
Test of matchability
In extreme cases, the two datasets X and Y may not have any correlation at all, and thus any attempt to integrate both datasets would give unreliable results. For example, some methods, when applied to uncorrelated datasets, would pick up the spurious correlations and hence resulting in over-integration. A robust procedure should be able to warn the users when the resulting matching estimator might be of low quality. We develop a rigorous hypothesis test, termed matchability test, for this purpose.
The matchability test starts by repeatedly drawing B independent and identically distributed copies of nx-dimensional (potentially asymmetric) Rademacher random vectors and another B independent and identically distributed copies of ny-dimensional Rademacher random vectors . That is, for each 1 ≤ b ≤ B, we have , and is +1 with probability 1 − pflip and is − 1 otherwise for any 1 ≤ i ≤ n*, where * is the placeholder for either mathttx or mathtty. The parameter pflip (denoted as flip_prob in MARIO package and defaulted to 0.2) controls the ‘sensitivity’ of the test—a lower value of pflip means that a more accurate matching is needed to pass the matchability test. For every b, we generate a fake pair of datasets by flipping the signs of each row of X and Y:
After such a sign-flipping procedure, most of the correlation between X and Y (if exists) is destroyed, but the intra-dataset covariance structures of both X and Y are preserved. As a result, if we run any matching algorithm with X(b) and Y(b) as the input, the resulting estimator would be of low quality, in the sense that if we align X(b), Y(b) using and run CCA, the resulting sample canonical correlations will be small. In our implementation, we calculate the mean of top_k (and defaulted to 10), which we denote as .
The matchability test proceeds by running the same algorithm on the real datasets X, Y, aligning them using the estimator , and calculates the mean of top_k sample canonical correlations, which we denote as . The final P value for testing the null that X and Y are uncorrelated is given by the proportion of that are larger than the observed .
Jointly regularized filtering of low-quality matched pairs
Even if the two datasets X and Y are highly correlated (and thus the matchability test gives a small P value), the estimated matching might still be error-prone. For example, consider the case where some cell type exists in X but is completely absent in Y. We developed an algorithm that automatically filters out the low-quality matched pairs in .
Assume there are K cell types present in either X or Y. In the MARIO package we default K to 10. Let be the unknown ground-truth cell-type labels of X and , respectively. The fact that X and Y have passed the matchability test tells that zx and zy should agree on most coordinates. However, it is possible that there exists a sparse subset of {1, …, nx} on which zx and zy disagree, and our goal is to detect this sparse subset and disregard them in downstream analyses. To achieve this goal, we consider the following regularized k-means clustering objective:
where ∥ ⋅ ∥2 is the ℓ2 norm and is the indicator function. The above objective function is composed of two parts. The first part is the classical k-means objective for X and Y, and the second part is a regularization term that penalizes when the estimated X label and Y label are too far away from a global label .
After solving the above objective function, if , then there is evidence that the matched pair is spurious, and is thus disregarded in the downstream analyses. The parameter ρ controls the strength of regularization: if ρ = 1 − 1/K, then there is no regularization at all, whereas if ρ = 0, we effectively require . Thus, we can naturally control the ‘intensity’ of such a filtering procedure by choosing a suitable ρ. Under a hierarchical Bayesian model, the parameter ρ has a rather intuitive interpretation as the probability of disagreement between individual labels and global labels60.
We solve the regularized k-means clustering objective via a warm-started block coordinate descent algorithm. The algorithm starts by computing initial estimators of zx, zy via spectral clustering63: we compute the sample canonical scores of X and , average them, and apply the classical k-means clustering on top K eigenvectors of the averaged score to get . We then let .
Suppose at iteration t, the current estimators of zx, zy are given by , respectively. We run block coordinate descent as follows:
- Given , the current estimators of {μk}, {νk} are given by
for any 1 ≤ k ≤ K. - Given , the next estimators of z⋆, zx, zy are given by
for any 1 ≤ i ≤ nx. The above problem is solved via an enumeration procedure. We first hypothesize that for some 1 ≤ k ≤ K. We then solve for the best by enumerating all K possible choices of labels. The same thing can be done to solve for the best . Hence, we can compute the best value of the above objective function under the hypothesis that . We can then solve for the global optimal by enumerating and comparing the objective values under every possible hypothesized value of . Given the global optimal , the global optimal and can be extracted.
In our implementation, we run the above procedure for 20 iterations.
The objective function of MARIO
In this subsection, we formulate the whole MARIO pipeline into a single optimization problem. Let X and Y be the two data matrices (rows as cells and columns as features). Without loss of generality, we assume that X has at most as many rows as Y. Thus, there are more or as many cells in Y compared with X. Suppose there are n rows in X and m rows in Y, then n ≤ m. MARIO is an algorithm aimed at solving the following optimization problem:
Here is the centering matrix, and S(n, m) is the collection of all binary n-by-m matrices such that there are (m − n) zero columns and each of the remaining m columns has one and only entry equal to one, and each row has one and only one entry equal to one. That is, MARIO aims at simultaneously finding the cell–cell correspondence matrix and two linear transformations A and B such that after projecting the data matrices X and Y to a common latent space using A and B, and selecting a subset of rows of YB and matching them to the rows of XA in a one-to-one fashion, the trace inner product between XA and YB is maximized. By the definition of S(n, m), the matrix selects n rows of YB and then finds a bijection between the selected rows of YB and rows of XA.
Suppose both A and B are of rank k. The objective function of the optimization problem is a combination of the top k CCA objective function and the (unbalanced) linear assignment problem objective function:
When given, solving for optimal A and B means simultaneously solving for top k canonical correlation loading vectors for the pair (X, Y);
When A and B are given, solving for means exactly solving a linear assignment problem.
Downstream analysis after cell matching
Joint embedding
After running jointly regularized filtering on the best interpolated estimator , we get a pair of aligned datasets , whose rows correspond to cells of similar types and n is the number of remaining cell–cell pairs after filtering. Then, we run CCA on X⋆, Y⋆ and collect the first n pairs of sample canonical scores (scaled within dataset) as the final embeddings. Note that other standard methods for joint embedding that take row-wise aligned datasets can also be applied.
Label transfer via k-nearest-neighbors matching
The interpolated distance can be used to do label transfer via k-nearest-neighbors. Suppose we know the cell types for all cells in Y but the corresponding labels in X are missing. Then for the ith cell in X, we can predict its label by finding the k-nearest cells in Y according to and taking the majority vote.
Systematic benchmarks
Benchmarking on the matching quality
Three scenarios were tested during the benchmarking process:
Sequentially dropping shared features between datasets, to test the robustness of the algorithm regardless of the antibody panel design. There are two dropping sequence used: first scheme is dropping antibodies based on their names, in alphabetical order (for example, CD1c is dropped before CD3); the second scheme is dropping by importance score, where less important antibodies were dropped first, mimicking real world antibody panel design. To roughly assess the order of importance of the antibodies at distinguishing cell states, a random forest model for each dataset was trained to predict cell types from marker expression, with the function randomForest in R package randomForest, with default parameters. Then a permutation feature importance test was performed to determine the effects of variables in the model, using function varImp with default parameters in R package caret, hence getting a score for each protein. The importance scores for the protein (shared) were then averaged between the datasets for matching, and ranked from low to high. For the cross-species and murine spleen dataset, only the top 50% important shared markers were considered.
Stimulating poor quality data by adding increasing levels of random noise to both datasets, to test the robustness of the algorithm in terms of over-integration. Gaussian random noise with mean 0 and standard deviations of 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3 and 1.5 were added to the normalized values of all protein channels.
Intentionally dropping cell types in the dataset being matched against, to test the robustness of the algorithm regardless of the cell-type composition difference between datasets.
In all three scenarios described above, all other compared methods used the exact same set of cells tested by MARIO. For cross-species data (related to Fig. 3 and Supplementary Fig. 7) only H1N1 challenged human and X-species cynomolgus monkey were benchmarked.
The following metrics were used in the benchmarking process:
Matching accuracy: this was calculated by the percentage of cells in X that have paired correctly with the same cell type in Y, based on the individual dataset’s cell-type annotations.
Matching proportion: this was calculated by the percentage of cells in X that had a match in Y after quality control steps.
Structure alignment score: this measures how much structural information is preserved after data integration. Let Dfull be the matrix whose (i, j)th entry is the Euclidean distance between the ith row and the jth row of X. Similarly, let Dpartial be the matrix whose (i, j)th entry is the Euclidean distance between the ith row and the jth row of the embedding of X. The structure alignment score for the ith cell in X is defined as the Pearson correlation between the ith row of Dfull and the ith row of Dpartial. The structure alignment score for X is then defined as the average of the scores over all cells in X. The structure alignment score for Y can be similarly obtained. The final structure alignment score is the average of the scores for X and Y.
Silhouette F1 score: this has been described in ref. 64 and is an integrated measure of the quality of dataset mixing and information preservation. In brief, two preliminary scores slt_mix and slt_clust were obtained, and the Silhouette F1 score was calculated as 2 ⋅ slt_mix ⋅ slt_clust/(slt_mix + slt_clust). Here, slt_mix is a measure of dataset mixing and is defined as one minus normalized Silhouette width with the label being dataset index, this is a measure of mixing; slt_clust is a measure of information preservation and is defined as the normalized Silhouette width with label being cell-type annotations. All Silhouette widths were computed using the silhouette() function from R package cluster.
ARI F1 score: this is an integrated measure of the quality of dataset mixing and information preservation64. The definition is similar to that of Silhouette F1 score, except that we compute ARI instead of the Silhouette width. All ARI scores were computed using the function adjustedRandIndex() in R package mclust.
Average mixing score: this is a measure of dataset mixing based on the Kolmogorov–Smirnov statistic. For each cluster, the subsets of cells corresponding to that cluster were extracted from the embeddings of X and Y, respectively. For each coordinate of the embeddings, one minus the Kolmogorov–Smirnov statistic was computed. The mixing score for that cluster was then computed by taking the median of one minus the Kolmogorov–Smirnov statistic for each coordinate. The average mixing score is defined as the average of mixing scores over all clusters.
Error avoidance score: this measures the performance of the quality control process and is specific to the benchmarking scenario 3 (intentionally dropping cell types). For each cell type dropped, the corresponding error avoidance score is defined as , where a is the number of cells in X that are of that type and have survived the quality control process (that is, a match involving that cell type has occurred), and b is the total number of cells of that type X. A higher value of this score indicates that erroneous matching toward deleted cell types has been better avoided.
During benchmarking, all datasets were downsampled. The bone marrow dataset (Fig. 2) was downsampled to 40,000 cells (8,000 and 32,000 for X and Y); the PBMC dataset (Supplementary Fig. 3) was downsampled to 25,000 cells (5,000 and 20,000 for X and Y); the X-Species H1N1/IFN-gamma dataset (Fig. 3) was downsampled to 40,000 cells (8,000 and 32,000 for X and Y); the X-Species H1N1/IL-4 dataset (Supplementary Fig. 7) was downsampled to 40,000 cells (8,000 and 32,000 for X and Y) and the murine spleen dataset (Fig. 4) was downsampled to 25,000 cells (5,000 and 20,000 for X and Y). All methods used the same set of cells.
Parameters used for benchmarking are as follows. For benchmarking of MARIO, we used a consistent set of parameters across all datasets: n_components_ovlp = 10 (or the maximum number available); n_components_all = 20 (or the maximum available), sparsity = 5,000, bad_prop = 0.1 or 0.2, n_batch = 1. For other methods, the input of data were all values normalized per feature within each dataset (Same as MARIO input data, except Liger/UINMF where their own custom normalization is required). Only mNN-based methods (Scanorma, Seurat, fastMNN) were included in the comparison of matching accuracy and matching proportion. For Seurat, three versions were compared (principal components analysis (PCA), CCA and reciprocal PCA). For computation of SAM, ASW, ARI and avgMix, the first 20 (or maximum available) components of MARIO CCA scores or reduced values from other methods were used. For visualization, t-SNE plots were produced using the first ten components for all methods. In some rare cases, certain methods produced NAs (Not Avaliable) in the integrated values for limited number of cells, which were replaced with 0 for downstream analysis. Detailed information of the benchmarking process can be retrieved from the deposited code in our GitHub repository.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41592-022-01709-7.
Supplementary information
Acknowledgements
We thank S. Bendall, S. Rodig and members of the Nolan and Jiang laboratories for helpful discussions. B.Z. is supported by a Stanford Graduate Fellowship. This work was funded in part by grants from the National Institutes of Health nos. R01AI149672 (S.J. and G.P.N.), the Bill & Melinda Gates Foundation grant no. INV-002704 (S.J. and G.P.N.), grant no. OPP1113682 (G.P.N.), COVID-19 Pilot Award (S.J., D.R.M. and G.P.N.), the Fast Grant Funding for COVID-19 Science (S.J., D.R.M. and G.P.N.), the Botnar Research Centre for Child Health Emergency Response to COVID-19 grant (S.J., D.R.M., G.P.N., M.S.M. and A.T.), the Hope Foundation to G.P.N., the US Food and Drug Administration Medical Countermeasures Initiative contract nos. HHSF223201610018C and 75F40120C00176 (G.P.N.), the Parker Institute for Cancer Immunotherapy (G.P.N.), the Rachford and Carlota A. Harris Endowed Professorship (G.P.N.), the National Institute Of Allergy And Infectious Diseases of the National Institutes of Health under award number DP2AI171139 (S.J.) and the Gilead Research Scholar in Hematologic Malignancies (S.J.). This article reflects the views of the authors and should not be construed as representing the views or policies of the FDA, NIH, BMGF, Botnar Foundation or other institutions that provided funding.
Extended data
Author contributions
Conceptualization was done by B.Z., S.C., Z.M., G.P.N. and S.J. Algorithm development and implementation were carried out by S.C., B.Z. and Z.M. Analysis was done by B.Z., S.C., Y.B., H.C., G.L., I.T.L., Y.G. and S.J. Contribution of key reagents and tools was from N.M., G.V., D.R.M., A.T. and M.S.M. The project was supervised by S.J., G.P.N. and Z.M. Both B.Z. and S.C. contributed equally and have the right to list their name first in their CV.
Peer review
Peer review information
Nature Methods thanks Xiaohui Xie and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Madhura Mukhopadhyay, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Data availability
Publicly available datasets used were: Levine et al. Human BMC CYTOF at: https://github.com/lmweber/benchmark-data-Levine-32-dim; Stuart et al. Human BMC CITE-seq (from the R package SeuratData, ‘bmcite’) at https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis.html; Zainab et al. Human H1N1 challenged whole blood CYTOF at flow repository ‘FR-FCM-Z2NZ’; Bjornson et al. Human and NHP whole blood CYTOF at flow repository ‘FRFCM-Z2ZY’; Goltsev et al. Murine Spleen CODEX at https://data.mendeley.com/datasets/zjnpwh8m5b/1 (raw images per reasonable request from the Nolan Laboratory); Gayoso et al. Murine Spleen CITE-seq at https://github.com/YosefLab/totalVI_reproducibility/tree/master/data; COVID-19 Cell Atlas. COVID-19 patient BALF CITE-seq (VIB/Ghent) at https://www.covid19cellatlas.org/index.patient.html; Hartmann et al. Human PBMC CyTOF at flow repository ‘FR-FCM-Z249’, HD06_run1; 10X Genomics. Human PBMC CITE-seq at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3?. Newly generated data used came from COVID-Lung CODEX imaging expression files (macrophage related) at: https://github.com/shuxiaoc/mario-py/tree/main/Manuscript_Archive_Code/data/COVID-19. Full dataset information, including raw images of the CODEX and PANINI validation experiments, is available on reasonable request. All data mentioned above are also summarized and deposited (with related preprocessing scripts) at https://github.com/shuxiaoc/mario-py.
Code availability
MARIO and related tutorials are freely available to the public at GitHub https://github.com/shuxiaoc/mario-py. For reproducibility, code to regenerate the main and supplementary figures have also been deposited to GitHub repository.
Competing interests
G.P.N. received research grants from Pfizer, Inc., Vaxart, Inc., Celgene, Inc. and Juno Therapeutics, Inc. during the course of this work. G.P.N. and Y.G. have equity in Akoya Biosciences, Inc. G.P.N. is a scientific advisory board member of Akoya Biosciences, Inc.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Bokai Zhu, Shuxiao Chen.
These authors jointly supervised this work: Zongming Ma, Garry P. Nolan, Sizun Jiang.
Contributor Information
Zongming Ma, Email: zongming@wharton.upenn.edu.
Garry P. Nolan, Email: gnolan@stanford.edu
Sizun Jiang, Email: sjiang3@bidmc.harvard.edu.
Extended data
is available for this paper at 10.1038/s41592-022-01709-7.
Supplementary information
The online version contains supplementary material available at 10.1038/s41592-022-01709-7.
References
- 1.Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat. Rev. Genet. 2016;17:175–188. doi: 10.1038/nrg.2015.16. [DOI] [PubMed] [Google Scholar]
- 2.Schwartzman O, Tanay A. Single-cell epigenomics: techniques and emerging applications. Nat. Rev. Genet. 2015;16:716–726. doi: 10.1038/nrg3980. [DOI] [PubMed] [Google Scholar]
- 3.Papalexi E, Satija R. Single-cell rna sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. 2018;18:35–45. doi: 10.1038/nri.2017.76. [DOI] [PubMed] [Google Scholar]
- 4.Vistain, L. F. & Tay, S. Single-cell proteomics. Trends Biochem. Sci.46, 661–672 (2021). [DOI] [PMC free article] [PubMed]
- 5.Fulwyler MJ. Electronic separation of biological cells by volume. Science. 1965;150:910–911. doi: 10.1126/science.150.3698.910. [DOI] [PubMed] [Google Scholar]
- 6.Baumgarth N, Roederer M. A practical approach to multicolor flow cytometry for immunophenotyping. J. Immunol. Meth. 2000;243:77–97. doi: 10.1016/s0022-1759(00)00229-5. [DOI] [PubMed] [Google Scholar]
- 7.Bendall SC, et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science. 2011;332:687–696. doi: 10.1126/science.1198704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stoeckius M, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods. 2017;14:865–868. doi: 10.1038/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Peterson VM, et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 2017;35:936–939. doi: 10.1038/nbt.3973. [DOI] [PubMed] [Google Scholar]
- 10.Giesen C, et al. Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nat. Methods. 2014;11:417–422. doi: 10.1038/nmeth.2869. [DOI] [PubMed] [Google Scholar]
- 11.Lin J-R, Fallahi-Sichani M, Sorger PK. Highly multiplexed imaging of single cells using a high-throughput cyclic immunofluorescence method. Nat. Commun. 2015;6:8390. doi: 10.1038/ncomms9390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Keren L, et al. A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell. 2018;174:1373–1387. doi: 10.1016/j.cell.2018.08.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Goltsev Y, et al. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell. 2018;174:968–981. doi: 10.1016/j.cell.2018.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Haghverdi L, Lun AT, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat. Biotechnol. 2019;37:685–691. doi: 10.1038/s41587-019-0113-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Barkas N, et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods. 2019;16:695–698. doi: 10.1038/s41592-019-0466-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Stuart T, et al. Comprehensive integration of single-cell data. Cell. 2019;177:1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Welch J, et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell. 2019;177:1873–1887. doi: 10.1016/j.cell.2019.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377. [Google Scholar]
- 20.Zhang X, Xu C, Yosef N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 2019;10:2611. doi: 10.1038/s41467-019-10500-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim HJ, Lin Y, Geddes TA, Yang JYH, Yang P. Citefuse enables multi-modal analysis of Cite-seq data. Bioinformatics. 2020;36:4137–4143. doi: 10.1093/bioinformatics/btaa282. [DOI] [PubMed] [Google Scholar]
- 22.Levine JH, et al. Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell. 2015;162:184–197. doi: 10.1016/j.cell.2015.05.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rahil, Z. et al. Landscape of coordinated immune responses to H1H1 challenge in humans. J. Clin. Invest.130, 5800–5816 (2020). [DOI] [PMC free article] [PubMed]
- 24.Bjornson-Hooper, Z. B. et al. A comprehensive atlas of immunological differences between humans, mice and non-human primates. Front. Immunol.13, 867015 (2022). [DOI] [PMC free article] [PubMed]
- 25.Gotthardt D. Loss of stat3 in murine NK cells enhances NK cell–dependent tumor surveillance. Blood, J. Am. Soc. Hematol. 2014;124:2370–2379. doi: 10.1182/blood-2014-03-564450. [DOI] [PubMed] [Google Scholar]
- 26.Rauch I, Müller M, Decker T. The regulation of inflammation by interferons and their stats. Jak.-Stat. 2013;2:e23820. doi: 10.4161/jkst.23820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dallagi A, et al. The activating effect of ifn-γ on monocytes/macrophages is regulated by the lif–trophoblast–il-10 axis via stat1 inhibition and stat3 activation. Cell. Mol. Immunol. 2015;12:326–341. doi: 10.1038/cmi.2014.50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zarubin T, Jiahuai H. Activation and signaling of the p38 map kinase pathway. Cell Res. 2005;15:11–18. doi: 10.1038/sj.cr.7290257. [DOI] [PubMed] [Google Scholar]
- 29.Chaudhary O, et al. Inhibition of p38 MAPK in combination with art reduces siv-induced immune activation and provides additional protection from immune system deterioration. PLoS Pathog. 2018;14:e1007268. doi: 10.1371/journal.ppat.1007268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Govek KW, et al. Single-cell transcriptomic analysis of MIHC images via antigen mapping. Sci. Adv. 2021;7:eabc5464. doi: 10.1126/sciadv.abc5464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gayoso A, et al. Joint probabilistic modeling of single-cell multi-omic data with Totalvi. Nat. Methods. 2021;18:272–282. doi: 10.1038/s41592-020-01050-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mazzucchelli R, Durum SK. Interleukin-7 receptor expression: intelligent design. Nat. Rev. Immunol. 2007;7:144–154. doi: 10.1038/nri2023. [DOI] [PubMed] [Google Scholar]
- 33.Grumont R, et al. The mitogen-induced increase in T cell size involves PKC and NFAT activation of rel/NF-κb-dependent c-myc expression. Immunity. 2004;21:19–30. doi: 10.1016/j.immuni.2004.06.004. [DOI] [PubMed] [Google Scholar]
- 34.Kleiman E, et al. Distinct transcriptomic features are associated with transitional and mature B-cell populations in the mouse spleen. Front. Immunol. 2015;6:30. doi: 10.3389/fimmu.2015.00030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wen L, Shinton SA, Hardy RR, Hayakawa K. Association of B-1 Bb cells with follicular dendritic cells in spleen. J. Immunol. 2005;174:6918–6926. doi: 10.4049/jimmunol.174.11.6918. [DOI] [PubMed] [Google Scholar]
- 36.Hardtke S, Ohl L, Förster R. Balanced expression of CXCR5 and CCR7 on follicular T helper cells determines their transient positioning to lymph node follicles and is essential for efficient B-cell help. Blood. 2005;106:1924–1931. doi: 10.1182/blood-2004-11-4494. [DOI] [PubMed] [Google Scholar]
- 37.Kreslavsky T, et al. Essential role for the transcription factor BHLHE41 in regulating the development, self-renewal and BCR repertoire of B-1a cells. Nat. Immunol. 2017;18:442–455. doi: 10.1038/ni.3694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Pavlasova G, Mraz M. The regulation and function of CD20: an ‘enigma’ of B-cell biology and targeted therapy. Haematologica. 2020;105:1494. doi: 10.3324/haematol.2019.243543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Netea MG, et al. Il-1β processing in host defense: beyond the inflammasomes. PLoS Pathog. 2010;6:e1000661. doi: 10.1371/journal.ppat.1000661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Carsetti R, Rosado MM, Wardmann H. Peripheral development of B cells in mouse and man. Immunological Rev. 2004;197:179–191. doi: 10.1111/j.0105-2896.2004.0109.x. [DOI] [PubMed] [Google Scholar]
- 41.Arnon TI, Horton RM, Grigorova IL, Cyster JG. Visualization of splenic marginal zone B-cell shuttling and follicular B-cell egress. Nature. 2013;493:684–688. doi: 10.1038/nature11738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chung JB, Sater RA, Fields ML, Erikson J, Monroe JG. CD23 defines two distinct subsets of immature B cells which differ in their responses to T cell help signals. Int. Immunol. 2002;14:157–166. doi: 10.1093/intimm/14.2.157. [DOI] [PubMed] [Google Scholar]
- 43.Stolp J, et al. Intrinsic molecular factors cause aberrant expansion of the splenic marginal zone B cell population in nonobese diabetic mice. J. Immunol. 2013;191:97–109. doi: 10.4049/jimmunol.1203252. [DOI] [PubMed] [Google Scholar]
- 44.Uhlén, M. et al. Tissue-based map of the human proteome. Science347, 1260419 (2015). [DOI] [PubMed]
- 45.Chan Zuckerberg Initiative Single-Cell. Single cell profiling of COVID-19 patients: an international data resource from multiple tissues. Preprint at medRxiv (2020).
- 46.Michalec L, et al. CCL7 and CXCL10 orchestrate oxidative stress-induced neutrophilic lung inflammation. J. Immunol. 2002;168:846–852. doi: 10.4049/jimmunol.168.2.846. [DOI] [PubMed] [Google Scholar]
- 47.Russo RC, Garcia CC, Teixeira MM, Amaral FA. The CXCL8/IL-8 chemokine family and its receptors in inflammatory diseases. Expert Rev. Clin. Immunol. 2014;10:593–619. doi: 10.1586/1744666X.2014.894886. [DOI] [PubMed] [Google Scholar]
- 48.Segovia M, et al. Targeting TMEM176b enhances antitumor immunity and augments the efficacy of immune checkpoint blockers by unleashing inflammasome activation. Cancer Cell. 2019;35:767–781. doi: 10.1016/j.ccell.2019.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Guo Q, et al. Induction of alarmin S100A8/A9 mediates activation of aberrant neutrophils in the pathogenesis of COVID-19. Cell Host Microbe. 2021;29:222–235. doi: 10.1016/j.chom.2020.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Mostafavi S, et al. Parsing the interferon transcriptional network and its disease associations. Cell. 2016;164:564–578. doi: 10.1016/j.cell.2015.12.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zhou Z, et al. Heightened innate immune responses in the respiratory tract of COVID-19 patients. Cell Host Microbe. 2020;27:883–890. doi: 10.1016/j.chom.2020.04.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Nguyen-Robertson C, Haque A, Mintern J, La Flamme AC. COVID-19: searching for clues among other respiratory viruses. Immunol. Cell Biol. 2020;98:247–250. doi: 10.1111/imcb.12336. [DOI] [PubMed] [Google Scholar]
- 53.Merad M, Martin JC. Pathological inflammation in patients with COVID-19: a key role for monocytes and macrophages. Nat. Rev. Immunol. 2020;20:355–362. doi: 10.1038/s41577-020-0331-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Patel H, et al. Proteomic blood profiling in mild, severe and critical COVID-19 patients. Sci. Rep. 2021;11:6357. doi: 10.1038/s41598-021-85877-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Khan, S. A., Goliwas, K. F. & Deshane, J. S. Sphingolipids in lung pathology in the coronavirus disease era: a review of sphingolipid involvement in the pathogenesis of lung damage. Front. Physiol.12, 760638 (2021). [DOI] [PMC free article] [PubMed]
- 56.Gutmann C, et al. SARS-CoV-2 RNAemia and proteomic trajectories inform prognostication in COVID-19 patients admitted to intensive care. Nat. Commun. 2021;12:3406. doi: 10.1038/s41467-021-23494-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Doehn J-M, et al. CD169/SIGLEC1 is expressed on circulating monocytes in COVID-19 and expression levels are associated with disease severity. Infection. 2021;49:757–762. doi: 10.1007/s15010-021-01606-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Jiang, S. et al. Combined protein and nucleic acid imaging reveals virus-dependent B cell and macrophage immunosuppression of tissue microenvironments. Immunity55, 1118–1134.e8 (2022). [DOI] [PMC free article] [PubMed]
- 59.Cang Z, Nie Q. Inferring spatial and signaling relationships between cells from single cell transcriptomic data. Nat. Commun. 2020;11:2084. doi: 10.1038/s41467-020-15968-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Chen S, Liu S, Ma Z. Global and individualized community detection in inhomogeneous multilayer networks. Ann. Stat. 2022;50:2664–2693. [Google Scholar]
- 61.Burkard, R., Dell’Amico, M. & Martello, S. Assignment problems: revised reprint. SIAM, 35–71 (2012).
- 62.Chen, S., Jiang, S., Ma, Z., Nolan, G. & Zhu, B. One-way matching of datasets with low rank signals. Preprint at arXiv:2204.13858 (2022).
- 63.Löffler M, Zhang AY, Zhou HH. Optimality of spectral clustering in the Gaussian mixture model. Ann. Stat. 2021;49:2506–2530. [Google Scholar]
- 64.Tran H, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21:12. doi: 10.1186/s13059-019-1850-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Publicly available datasets used were: Levine et al. Human BMC CYTOF at: https://github.com/lmweber/benchmark-data-Levine-32-dim; Stuart et al. Human BMC CITE-seq (from the R package SeuratData, ‘bmcite’) at https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis.html; Zainab et al. Human H1N1 challenged whole blood CYTOF at flow repository ‘FR-FCM-Z2NZ’; Bjornson et al. Human and NHP whole blood CYTOF at flow repository ‘FRFCM-Z2ZY’; Goltsev et al. Murine Spleen CODEX at https://data.mendeley.com/datasets/zjnpwh8m5b/1 (raw images per reasonable request from the Nolan Laboratory); Gayoso et al. Murine Spleen CITE-seq at https://github.com/YosefLab/totalVI_reproducibility/tree/master/data; COVID-19 Cell Atlas. COVID-19 patient BALF CITE-seq (VIB/Ghent) at https://www.covid19cellatlas.org/index.patient.html; Hartmann et al. Human PBMC CyTOF at flow repository ‘FR-FCM-Z249’, HD06_run1; 10X Genomics. Human PBMC CITE-seq at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3?. Newly generated data used came from COVID-Lung CODEX imaging expression files (macrophage related) at: https://github.com/shuxiaoc/mario-py/tree/main/Manuscript_Archive_Code/data/COVID-19. Full dataset information, including raw images of the CODEX and PANINI validation experiments, is available on reasonable request. All data mentioned above are also summarized and deposited (with related preprocessing scripts) at https://github.com/shuxiaoc/mario-py.
MARIO and related tutorials are freely available to the public at GitHub https://github.com/shuxiaoc/mario-py. For reproducibility, code to regenerate the main and supplementary figures have also been deposited to GitHub repository.