Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jul 17.
Published in final edited form as: Nat Biotechnol. 2023 Sep 21;42(7):1084–1095. doi: 10.1038/s41587-023-01940-3

Supervised discovery of interpretable gene programs from single-cell data

Russell Kunes 1,2,*, Thomas Walle 1,3,4,5,*, Max Land 1, Tal Nawy 1, Dana Pe’er 1,6
PMCID: PMC10958532  NIHMSID: NIHMS1954057  PMID: 37735262

Abstract

Factor analysis can drive biological discovery by decomposing single-cell gene expression data into a minimal set of gene programs that correspond to processes executed by cells in a sample. However, matrix factorization methods are prone to technical artifacts and poor factor interpretability. We have developed Spectra, an algorithm that identifies user-provided gene programs, modifies them to dataset context as needed, and detects novel programs that together best explain expression covariation. Spectra overcomes the dominance of cell-type signals by modeling cell-type-specific programs and can characterize interpretable cell states along a continuum. We show that it outperforms existing approaches in challenging tumor immune contexts; Spectra finds factors that change under immune checkpoint therapy, disentangles the highly correlated features of CD8+ T-cell tumor reactivity and exhaustion, finds a novel program that explains continuous macrophage state changes under therapy, and identifies cell-type-specific immune metabolic programs.

Introduction

Deciphering the mechanisms that underlie cellular function is a central goal in biology. We frequently wish to know, for instance, how different cell types respond to external stimuli and how this alters processes within the cell. Although clustering analysis can delineate cell types in single-cell RNA sequencing (scRNA-seq) data, it is difficult to retrieve coherent, interpretable gene programs representing these cellular processes, and to quantify them in response to perturbation.

Gene programs are sets of genes defined by common tasks, such as metabolic pathways and responses to inflammatory cues or growth signals. Gene set scoring (e.g., scanpy score_genes1,2) is a simple and widely used approach to query which known gene programs are active in which cells, but it is often confounded by gene set overlap and technical factors. Single-cell sequencing is particularly well suited to identify gene programs, since programs tend to be regulated by mechanisms that are shared across cell subpopulations. Coregulation creates collinearity in gene expression levels, lending low-dimensional structure to high-dimensional cell-by-gene count matrices. Matrix factorization is a means of mining this structure to identify candidate gene programs3,4, and it has become a core tool in single-cell analysis; for example, factorization by principal component analysis (PCA) appears early in most scRNA-seq analysis pipelines. In principle, the power of factorization lies in summarizing biological activity as a compact set of cellular building blocks—it can provide a minimal vector representing the degree to which a cell activates each gene program, rather than a noisy vector of all observed genes or a single label denoting cell-type. Yet matrix factorization is a lowly constrained problem (there are many ways to decompose a matrix), and unsupervised approaches such as PCA and non-negative matrix factorization (NMF) produce factors that are often difficult to interpret or driven by technical artifacts such as batch effects, ambient RNA, or gene expression scale differences4,5. Some methods take a supervised approach, applying known gene sets as prior knowledge to make detected factors more interpretable 6,7. However, pre-existing gene sets are typically defined in different biological contexts than those under study. In addition, cell-type factors tend to prevail in factor analysis because expression differences between cells are dominated by cell type5. The popular practice of partitioning data by cell type and factoring each subset separately mitigates this issue, but makes it impossible to find shared programs. A successful factor analysis method should identify all active gene programs in a dataset, including variations specific to biological context as well as novel factors, and it should quantify the degree to which each gene program is executed by each cell type. We have developed Spectra (supervised pathway deconvolution of interpretable gene programs) to provide meaningful annotations of cell function by balancing prior knowledge with data-driven discovery (https://github.com/dpeerlab/spectra). Spectra incorporates existing gene sets and cell type labels as prior biological information. It explicitly models cell type and represents input gene sets as a gene-gene knowledge graph, using a penalty function to guide factorization towards the input graph. The graph representation enables data-driven modification of the input gene graph to reflect biological context, and the identification of novel gene programs from residual unexplained variation. The degree of reliance on prior knowledge can be tuned with a global parameter. Spectra’s ability to minimize the influence of cell type allows it to identify factors that are shared across cell types. We show that it outperforms existing approaches by solving longstanding challenges in tumor immune contexts; Spectra finds factors that change under immune checkpoint therapy (ICT), disentangles the highly correlated features of CD8+ T cell tumor reactivity and exhaustion under ICT, learns a novel gene program that explains the continuum of macrophage state changes under therapy, and identifies metabolic programs specific to different immune cell types. The open-source software scales to large atlases and overcomes batch effects to find factors that are stable across cohorts and even tumor types, and that are robust enough to be associated with patient-level clinical variables.

Results

Spectra factor analysis identifies interpretable gene programs from single-cell data

To model gene expression, we assume that each cell executes a small number of gene programs and that the observed expression in a cell is determined by the sum of its active programs. Spectra decomposes the cell-by-gene expression matrix into a cell-by-factor matrix that identifies and quantifies the programs executed by each cell and a factor-by-gene matrix representing the genes in each gene program (Fig. 1a and Methods). As input, the algorithm receives a normalized cell-by-gene count matrix, a cell-type annotation for each cell, and either a list of gene sets, or gene-gene relationships in the form of knowledge graphs. As output, Spectra provides a set of normalized global and cell-type-specific factor matrices, representing the gene loadings for each identified factor (gene scores); a sparse matrix of normalized factor loadings for each cell (cell scores); and the modified gene knowledge graph representing factors inferred from the data (see Methods for a technical description of Spectra and parameter settings). Spectra attempts to balance prior knowledge and interpretability with faithfulness to the data. Two key features distinguish it from other factorization methods. First, Spectra uses known cell type information and allows for cell-type-specific factors; by incorporating cell-type-specific gene weights that explain away constitutively expressed cell-type marker genes, it mitigates their influence on the factors. Second, Spectra uses existing gene sets as prior knowledge, which it represents as an input gene-gene knowledge graph, enabling their data-driven modification as well as the derivation of entirely new factors. Together, these features allow Spectra to identify more interpretable factors and to discover new biology. Given that clustering on principal components is usually superior to factor analysis alone for identifying cell types, we provide cell-type labels as input to Spectra, which models the influence of a factor on gene expression relative to baseline expression per cell type. Modeling cell types also enables the incorporation of both global and cell-type-specific factors for improved inference. For example, while the T-cell receptor (TCR) activation program should be limited to T cells, many of its genes are also activated by different programs in other cell types, which confuses traditional factor analysis. Cell type information enables Spectra to decompose gene expression in a cell-type-specific manner rather than assuming that all cell types use identical gene programs. Spectra’s likelihood function ensures that after decomposition, the reconstituted matrix closely matches the input matrix, and its penalty function guides gene factorization towards the gene-gene knowledge graph (Methods). To better capture prior knowledge, we use binary gene-gene relationships and encourage these gene pairs to share similar factors. Spectra takes input gene sets and turns each into a fully connected clique in the input graph, indicating that each gene pair in the set is related. Factors are thus scored by how well they match the data, as well as how many edges in the gene-gene graph support them. Most established gene sets are derived from bulk sequencing data and multiple biological contexts. In contrast, analysis typically involves a dataset from a specific biological context that only utilizes subsets or variations of these gene programs. Spectra can take a broad compilation of input gene sets and determine which are supported by the data, ignoring those that are dissimilar to its identified factors. Encoding prior knowledge as a graph facilitates computational efficiency, and more importantly, it allows Spectra to adapt gene programs by adding or removing edges in the input graph to generate the modified graph. The algorithm incorporates background edge and non-edge rates, provided as input parameters or learned from the data, to determine edge addition and removal rates. A critical feature of Spectra is that it can detach factors from graph penalization to learn entirely new factors. Effectively, Spectra attempts to explain as many of the input gene counts as possible by adapting the input gene graph (providing highly interpretable factors), and uses the residual unexplained counts to identify non-penalized factors that can capture entirely novel biology. By design, Spectra is thus empowered to reduce the dominance of cell type while detecting more subtle global and cell-type-specific gene programs; it uniquely balances prior information for maximal interpretability with the data-driven modification of existing gene sets and discovery of new programs.

Figure 1 — Spectra uses gene sets and cell types to guide gene program discovery from scRNA-seq data.

Figure 1 —

a, As input, Spectra receives a gene expression count matrix with cell type labels for each cell, as well as pre-defined gene sets, which it converts to a gene-gene graph. The algorithm fits a factor analysis model using a loss function that optimizes reconstruction of the count matrix and guides factors to support the input gene-gene graph. As output, Spectra provides factor loadings (cell scores) and gene programs corresponding to cell types and cellular processes (factors). b, Gene set categories in the immunology knowledge base. c, Experimental design of PBMCs (n=23,754) from healthy human donors (n=3) incubated for 6h with LPS, phorbol 12-myristate 13-acetate (PMA) or recombinant human IFN-γ. d, Ability of different algorithms to identify gene programs associated with biological perturbations in the PBMC dataset. For select factors, mean per-donor cell scores are provided for T or innate lymphoid cells (T/ILC), B cells (B), and myeloid cells (M) (n = 3 donors). Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent range.

Spectra factors predict ground truth signaling perturbations

We first curated a general resource of immunology gene sets that can be input to Spectra for analyzing any immune-related dataset. Our knowledge base contains 231 relevant gene sets, including 50 ‘cellular identity’ gene sets to define input cell types and 181 ‘cellular process’ gene sets (Fig. 1b, Supplementary Table 1 and Methods). To generate the resource, we developed 97 new gene sets, including 14 from perturbation experiments, and added these to 134 gene sets from publications and external databases, some of which we modified. Of the cellular processes, 150 apply to most cell types in the data (e.g. leukocytes) and are designated as global, and 31 apply to individual cell types. We designed the cellular process gene sets to have comparable size (median n = 20 genes per gene set) and relatively little overlap (median pairwise overlap coefficient 40%) to enable dissection of a large number of processes and to avoid size-driven effects.

We then used our immunology knowledge base to infer gene programs in a well-controlled experimental system with ground truth from Kartha and colleagues8, consisting of scRNA-seq data from human peripheral blood mononuclear cells (PBMCs) after in vitro stimulation with interferon-γ (IFN-γ), lipopolysaccharide (LPS), or phorbol myristate (PMA), a protein kinase C activator used to mimic T-cell receptor activation (Fig. 1c). We ran Spectra in addition to expiMap9 and Slalom6, factorization methods that also incorporate prior gene sets, and tested the association of factor cell scores with their corresponding perturbations.

Only Spectra identified gene programs associated with all three perturbations in the correct condition and cell type (Fig. 1d). Specifically, Spectra detected 18.8-fold overexpression of an LPS pathway activation factor under LPS stimulation in the LPS-responsive monocyte population across donors, whereas expiMap9 incorrectly found lower cell scores under stimulation in monocytes for one donor, and Slalom6 found a reduction in all three donors. Furthermore, Spectra detected 15.5-fold overexpression of a TCR activation factor in T cells stimulated with PMA. While both Slalom6 and expiMap9 correctly detected an increase in T cells, they also found inappropriate responses in monocytes and B cells. Finally, Spectra detected 4.0- and 1.5-fold overexpression of an IFN type 2 (IFN-γ) pathway activation factor in IFN-γ-stimulated myeloid and T cells, respectively. expiMap9 found lower cell scores in myeloid cells of one donor, and Slalom6 IFN factor loadings were strongly negative, despite the strong expression of IFN-γ receptors in myeloid cells.

Application of Spectra to an immuno-oncology dataset

To test Spectra in a more challenging and potentially impactful context for factorization, we applied Spectra to scRNA-seq data from non-metastatic breast cancer patients before and after pembrolizumab (anti-PD-1) treatment (‘Bassez dataset’), (Fig. 2a)10. The original study used clustering and gene set analysis to identify therapy-induced changes, and employed TCR sequencing to define patients’ clonal T-cell expansion status under anti-PD-1 as a surrogate for ICT response10.

Figure 2 — Evaluation of Spectra performance on simulated data and an immuno-oncology dataset.

Figure 2 —

a, Treatment and scRNA-seq sampling regime of breast cancer patients in the Bassez dataset10. We used Spectra to analyze tumor infiltrating leukocytes from these data. b, t-SNE embedding of tumor-infiltrating leukocytes (n=97,863 cells) from the Bassez dataset, colored by cell type. B, B cell; DC, dendritic cell; gdT, γδT cell; GC, germinal center; mac, macrophage; mast, mast cell; mono, monocyte; NK, natural killer cell; pDC, plasmacytoid dendritic cell; plasma, plasma cell; T, T cell; TCR: T-cell receptor; Treg, regulatory T cell. c, Maximum overlap coefficient of every global factor generated by Spectra (n=152), expiMap9 n=197, soft_mask = True), Slalom6 (n=20), NMF (n=100) and scHPF4 (n=100) with every input gene set. Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. d, Cell scores for Spectra and scanpy.score_genes1, 2 factors plotted against MAGIC-imputed (t=3) IFNGR1 expression for each cell, colored by cell type (n=97,863 cells). e, Proportion of held-out genes recovered by Spectra or Slalom6 from the Bassez dataset7, for each input gene set tested. Lines connect identical input gene sets. f, Coherence (mean pairwise log-normalized co-occurrence rate among top 50 markers) of factors generated by various factor analysis methods, using a random sample of 10,000 cells from the Bassez dataset, with 14 cell types and 181 input gene sets. The experiment was repeated n=5 times. Boxes and line represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. g, Runtime dependence on cell number (left panel) and gene set number (right panel). The experiment was repeated n=3 times; shading indicates 95% CI.

To ensure consistent cell typing at a resolution best suited for Spectra, we independently annotated the dataset into 14 broad cell types of established discrete lineages (including CD8+ T cells, macrophages, dendritic cells and plasma cells), leaving Spectra to infer factors associated with finer cell type distinctions such as T-cell activation or macrophage polarization (Fig. 2b, Extended Data Fig. 1, Supplementary Table 2 and Methods). We provided the broad cell type labels and our immunology knowledge base as input, and fit the Spectra model using default parameters (Methods), resulting in 152 global and 45 cell-type-specific factors. Most cell-type-specific factors correspond to CD4+ T cells (n = 12), CD8+ T cells (n = 7) or myeloid cells (n = 6).

We first assessed whether Spectra can identify biologically interpretable gene programs. For every factor, Spectra estimates a dependence parameter (η) between zero and one that quantifies reliance on the gene-gene graph. Most factors (171, or 86.8%) are strongly constrained by the gene-gene graph (η0.25), whereas 26 (13.2%) are novel (Extended Data Fig. 2). We found that factors with η0.25 overlap substantially (0.5 of genes) with an input gene set, enabling their interpretation. In contrast, the widely used unbiased factorization approaches NMF and scHPF4,5 typically produce factors that do not agree well with annotated gene sets (Fig. 2c), underscoring the difficulty of interpreting gene programs derived by these approaches.

We next assessed whether Spectra can provide more biologically sensible factor loadings (assignments of gene programs to cells) than other supervised methods when using identical input gene sets. Spectra uses cell-type labels and cell-type-specific input gene sets to restrict factors to their appropriate cell type; for example, it limits CD8-specific TCR signaling, tumor reactivity and exhaustion factors to CD8+ T cells (Extended Data Fig. 3a). In contrast, the gene-set-based factorization method Slalom6 and autoencoder-based method expiMap9 misassign some TCR activity, CD8+ T cell exhaustion, and tumor reactivity to the myeloid, NK cell and plasma cell lineages (Extended Data Fig. 3), likely because many genes in these factors participate in multiple programs. For example, CISH is a member of the suppressor of cytokine signaling family that is induced by TCR activation11,12, but also by IL-15 in NK cells13, and is a critical regulator of dendritic cell differentiation14 (Extended Data Fig. 3c).

Pleiotropy similarly confounds the most widely used single-cell gene-set annotation tool, score_genes1,2. For example, Spectra’s interferon gamma (IFNγ) response factor is well correlated with the IFN–γ receptor upstream of this gene program and correctly captures it across all cell types, whereas score_genes1,2 IFN–γ response is detected almost exclusively in the myeloid population (Fig. 2d). This myeloid bias is due to differences in baseline expression across cell types, especially HLA-II genes, which are preferentially expressed by myeloid antigen-presenting cells (Extended Data Fig. 4). Spectra overcomes pleiotropy by implicitly downweighting the influence of genes whose expression could be explained by multiple factors—it decomposes gene expression using the factors best supported by total expression in a given cell. Spectra is able to identify IFN–γ activity and its previously reported activation by ICT15,16 across expected immune cell types17 because it learns these factors in a cell-type-specific manner, accounting for differences in baseline expression across cell types. Thus, in addition to yielding more interpretable gene programs than other supervised methods, Spectra is better at inferring which cells these programs are active in, enabling it to detect subtle effects of ICT on multiple cell types that are missed by score_genes1,2.

Spectra outperforms other methods in delineating and assigning gene programs

We systematically benchmarked Spectra against other widely used methods by measuring how well they identify coherent gene programs and assign activity to cells. A key feature of Spectra is that it can modify input gene sets in a data-driven manner. We first evaluated the quality of output gene programs by holding out 30% of genes from 20 input gene sets and tracking whether these genes are identified in the resulting factors (Methods). Spectra factors recover many more genes than Slalom6 (Fig. 2e) and expiMap9 (Extended Data Fig. 5ac); for example, among the 50 genes with highest gene scores (top 50 marker genes) for the MYC factor, Spectra identified 7 of 33 held-out genes (GLN3, NOP16, PAICS, APEX1, PA2G4, TSR1 and TRAP1) relevant to MYC signaling, and likewise performed well for other cellular processes, while MYC signaling was not retained by Slalom6 (Fig. 2e, Extended Data Fig. 5a and Methods). Among the genes with highest scores, Spectra also recovers known MYC targets DKC118 and TOMM4019, which are absent from the training and hold-out sets.

To provide a more systematic evaluation of new genes, we reasoned that genes belonging to a shared program should appear together in the same cells. Spectra explicitly uses this coherence signal to add new genes; to ensure that the data is not overfit, we applied factor analysis with held-out cells (not used in training), and evaluated the coherence of inferred factors in the test set (Methods). Spectra and other methods that take the sparsity of scRNA-seq data into account (Slalom6, scHPF4) perform well, while generic models (NMF) do not (Fig. 2f). The key advantage of supervised approaches is that by seeding inference with a known gene set, coherent genes are more likely to be biologically meaningful. Matrix factorization methods rely on objective functions that implicitly encourage the estimation of diverse gene programs, such that programs expressed in similar contexts (e.g. T-cell activation and exhaustion) are often recovered as a single merged factor. To test the ability to robustly distinguish correlated programs, we simulated expression data from a generic factor analysis model with both correlated and uncorrelated factors (Methods). Spectra’s use of prior knowledge allowed it to separate highly correlated factors, unlike other methods (Extended Data Fig. 5d). We next asked whether Spectra can accurately quantify the activity of inferred factors across cells, which is particularly challenging because pleiotropy creates correlation between gene programs. Given the lack of ground truth, we synthesized data with features similar to real data and benchmarked the assignment of factor loadings to cells (Methods). As gene set overlap increases, score_genes1,2 surges in false positive score estimates, while Spectra is able to correctly assign expressed factors to cells (Extended Data Fig. 5e). Due to their multivariate nature and sparsity, factorization methods select the factors that best explain the data globally, such that each factor accounts for expression not already explained by other factors. Factor analysis is thus superior to score_genes1,2 even for the simple task of scoring gene sets.

In contrast to Spectra, Slalom6 suffers a substantial drop in accuracy as the number of active gene sets increases (Extended Data Fig. 5f). Moreover, Slalom6 can only assess a few dozen gene sets before runtime becomes prohibitive, whereas Spectra’s graph-based representation allows it to scale to hundreds of thousands of cells and hundreds of gene programs, with shorter runtimes than most other methods. When run on a graphics processing unit (GPU), Spectra outperformed all methods, including NMF and the GPU-based expiMap9 (Fig. 2g). Similarly, Spectra’s peak memory usage remained low with increasing gene set numbers, while Slalom’s6 rose sharply (Extended Data Fig. 5g). Spectra runtime and memory needs increased proportionally with the number of cell types, but remained low for typical cell type numbers (¡30 min, ¡10 GB memory for 30 cell types, 25,000 cells, Extended Data Fig. 5g,h). Our benchmarking demonstrates that Spectra is faster and infers programs with superior interpretability and coherence, while retrieving more ground truth factors.

Spectra disentangles correlated features of CD8+ T cell tumor reactivity and exhaustion

To understand how tissues respond to cancer treatments and to ultimately improve therapeutic efficacy, we seek to quantify gene program changes under therapy on a per-cell-type basis. One population that is particularly important to track under ICT consists of non-dysfunctional tumor-reactive CD8+ T cells, a subset of T cells that recognize tumor-associated antigen20 and are also cytotoxic21, 22. These cells express clonal TCRs and specific markers (e.g. ENTPD1, CXCL13)23, 24, and can accumulate upon PD-1/PD-L1 checkpoint blockade in a process called clonal expansion 25, 26. Conversely, T cells that expand clonally under ICT are likely to be tumor-reactive22, 25. These cells may also gradually become exhausted (lose effector capacity) upon prolonged antigen exposure in the tumor microenvironment27, 28. Although exhaustion and tumor reactivity lead to different cellular behaviors with highly consequential phenotypes, their gene programs are correlated and challenging to discriminate computationally; clustering approaches, for example, typically group exhaustion, tumor reactivity and cytotoxicity features together10, 29.

We evaluated Spectra’s ability to deconvolve tumor reactivity and exhaustion programs and to quantify therapy-induced changes in each, focusing on CD8+ T cells in the Bassez dataset (Fig. 3a). The exhaustion and tumor reactivity factors scored high in Spectra’s information and importance scores (see Methods), suggesting that they explain relevant gene expression in CD8+ T cells (Extended Data Fig. 6a). Genes from these two programs are correlated in this data (Extended Data Fig. 6b), likely explaining why they were not distinguished in prior work10, 29. Visually, score_genes1, 2 analysis of the input gene sets showed similar distributions in responders and non-responders (Fig. 3b). Yet the absence of tumor-reactive, non-terminally exhausted states in responders is inconsistent with the treatment-induced clonal expansion of these states observed in mouse models30, 31 and longitudinal phenotyping of cancer patients25, 26, and it conflicts with the proven efficacy of ICT in this clinical setting32.

Figure 3 — Spectra deconvolves the highly correlated features of tumor reactivity and exhaustion in CD8+ T cells.

Figure 3 —

Analysis of breast cancer infiltrating leukocytes from the Bassez scRNA-seq dataset 10 (n=42 patients)7. a, t-SNE map of the entire dataset highlighting CD8+ T cells (left), and force-directed layout (FDL) of CD8+ T cells (n=31,925 cells) labeled by evidence of clonal T-cell expansion (responder) in the donor (right). T cells that expand clonally under ICT are likely to be tumor-reactive 17, 20. b,c, FDL of CD8+ T cells colored by tumor reactivity (left) or exhaustion (right) cell scores, and contour plots depicting cell score density distribution from patients with (responder) or without (non-responder) clonal expansion of T cells under therapy. Cell scores were obtained using scanpy.score_genes1,2 (b) or Spectra (c). Only Spectra disentangles processes with correlated gene expression. d, Pearson correlation coefficients of factor cell scores (n=31,925 cells). Tumor reactivity is highly correlated with factors related to CD8+ T cell effector function and ICT response. e, New genes identified by Spectra (n=38) among the 50 genes with highest scores in the tumor reactivity factor, highlighting processes known to be involved in tumor reactivity in different colors (see Supplementary Table 3 for full list of genes). f, Per-sample mean cell scores for the Spectra tumor reactivity factor in positive cells (loading ¿ 0.001). Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. Two-sided p values calculated using Mann-Whitney U tests; pre-anti-PD-1 (n=40): p=3.84x105, statistic =308, Cohen’s d=1.51; on anti-PD-1 (n=40):p=2.00x105, statistic = 313, Cohen’s d=1.49. g, Caushi study design (n=251,777 CD8+ T cells). PBMCs and tumor infiltrating T cells (TIL) were isolated from non-small cell lung cancer (NSCLC) patients. PBMCs were pulsed by peptide pools, expanding TCR clones were identified by sequencing, and the TCR complementary determining regions (CDR) were compared to single-cell TCR sequences of tumor infiltrating T cells, showing their functionally validated antigen specificity. h, Cell scores in tumor infiltrating mutation-associated neoantigen specific (MANA), Epstein-Barr virus specific (EBV) and Influenza A specific CD8 T cells (n=1151). p values calculated using Mann-Whitney U tests.

Whereas gene set scores are markedly correlated and fail to distinguish expanding from non-expanding clones (Fig. 3b), Spectra clearly disentangles them (Extended Data Fig. 6c), identifying a substantial tumor-reactive population, with varying degrees of exhaustion, that is almost exclusive to responders (Fig. 3c). In support of these observations, incompletely exhausted ‘pre-dysfunctional’ T cells are known to be critical for ICT efficacy20. Importantly, Spectra extracts gene programs directly from the unlabeled data and does not need response status to successfully dissect these features. Spectra’s likelihood function discourages overlap between gene programs when a single program is sufficient to explain the observed count matrix, harnessing unique features of each gene-set to associate cells with the best fit program. We calculated the covariance of individual genes with their respective gene set scores, and identified CXCL13 as having highest covariance with both tumor reactivity and exhaustion, indicating that it drives overlap in these scores (Extended Data Fig. 6b). Spectra assigns a high weight for CXCL13 in the tumor reactivity but not the exhaustion factor, and strongly weights genes related to TCR signaling, T cell activation and cytotoxicity in the tumor reactivity factor. We find that only 4 genes overlap among the 50 highest scoring genes for each factor, whereas the exhaustion factor mostly includes exhaustion-inducing transcription factors (TOX, NR4A1) and PDCD1 (PD-1). After establishing that the tumor reactivity factor describes biologically relevant features, we explored its correlation with other factors in CD8+ T cells. Reactivity correlates with proliferative programs, as expected for cells that expand under ICT, as well as oxidative phosphorylation and glycolysis, processes associated with enhanced CD8+ T cell effector function33, and IFN–γ signaling, a key mediator of ICT efficacy15 (Fig. 3d). Of the 50 genes in tumor-reactive CD8+ T cells with highest gene scores, 42 are outside the input gene set, but recent studies support their roles in tumor reactivity (Fig. 3e and Supplementary Table 3) 3439. Separating tumor reactivity also allowed us to assess its independent contribution to clonal expansion and therapeutic response 3741. We found that expression of this factor is higher in responders (patients with clonal T cell expansion) at baseline than non-responders, and in responders it increases further under therapy (Fig. 3f), consistent with the reported association between tumor-reactive cell clusters and therapeutic response42, 43. Spectra thus cleanly disentangles a CD8+ T cell tumor reactivity program that is associated with response to ICT at the cell and patient levels.

T cells selectively kill cancer cells via the binding of their TCR to tumor antigens presented on the cell surface. Tumor-reactive T cells recognize antigens resulting from unique mutations in cancer cells, so-called mutation-associated neoantigens (MANA). To test whether our tumor reactivity gene program identified T cells with MANA-specific TCRs at the single-cell level, we leveraged the lung cancer cell atlas of Caushi and colleagues34 (‘Caushi dataset’), consisting of tumor-infiltrating T cells with paired single-cell TCR sequencing and functional validation of TCR antigen specificity (Fig. 3g). Spectra detected 173 factors in the Caushi dataset, including one tumor reactivity factor. Despite the entirely different context and tumor type, 13 genes overlap among the 50 with highest gene scores in both the Caushi and Bassez reactivity factors (Extended Data Fig. 6d). These include key markers ENTPD1 and CXCL13, as well as markers learned from the data such as CLL337, 38 and LAG336. Moreover, the Caushi reactivity factor is almost exclusively expressed in T cells with a MANA-specific TCR, but not in ‘bystander’ T cells that cannot recognize cancer cells and instead possess TCRs against unrelated antigens such as Influenza A and Epstein-Barr virus (Fig 3h). This independent, functionally validated dataset thus provides strong support for the tumor reactivity factors recovered by Spectra, and suggests that transcriptional features of tumor-reactive T cells are shared across tumor types.

In contrast to Spectra, Slalom6 only found a factor highly enriched for exhaustion genes in the Bassez dataset, which overlaps with the highest scoring factor for tumor reactivity by 35 genes (Extended Data Fig. 6e). scHPF4 factors are not enriched for either reactivity or exhaustion gene sets, whereas expiMap9 identified and successfully deconvolved the two factors (Extended Data Fig. 6e). However, only Spectra was able to distinguish a clonally expanding tumor-reactive T cell population that is specific to responders (Extended Data Fig. 6f). Moreover, Slalom6, scHPF4 and expiMap9 tumor reactivity and Spectra exhaustion factors failed to associate with patient-level response, defined as a significant difference between expression in responders and non-responders before or under ICT (Extended Data Fig. 6g).

Spectra is thus unique in its ability to disentangle tumor reactivity and exhaustion programs in CD8+ T cells, making it possible to identify tumor-reactive populations across cancer types, quantify their level of exhaustion, and find novel mediators of tumor reactivity that can be associated with patient-level therapeutic responses and nominated as candidate targets for enhancing ICT efficacy.

Uncovering metabolic pathway utilization patterns across tumor infiltrating leukocytes

Metabolic processes are fundamental to cancer progression and therapeutic response, in part because cancer and immune cells compete for scarce nutrients such as essential amino acids44, 45. Metabolic rates critically depend on nutrient availability and therefore care must be applied when inferring metabolic activity from gene expression; however, metabolic gene programs are regulated by metabolic need and availability and are therefore informative46. Analyzing cellular metabolism has nevertheless proven very challenging, as participating genes are involved in multiple pathways45. We asked whether Spectra’s ability to deconvolve overlapping gene programs (Extended Data Fig. 5d,e) can empower the inference of metabolic processes from tumor immune cells in the Bassez breast cancer dataset10. Spectra identified gene programs related to all 89 metabolic input gene sets (overlap coefficient >0.25) and determined their expression across cell types, recapitulating known macrophage metabolic characteristics such as iron uptake, iron storage47, 48 and cholesterol synthesis by cytochrome P450 enzymes49, 50, as well as DNA synthesis in cycling germinal center B cells (Fig. 4a).

Figure 4 — Spectra reveals cell-type-specific metabolic profiles in breast cancer data.

Figure 4 —

a, Mean cell score among positive cells (score ¿ 0.01) per cell type for each Spectra metabolic factor identified in the Bassez dataset7 (n=97,863 leukocytes). b, Input genes and new genes inferred by Spectra in the lysine metabolism pathway. c, Overlap of genes from the input lysine metabolism gene set with the top 50 marker genes from lysine metabolism factors identified in the Bassez10 and Zhang29 datasets. d, t-SNE embeddings of tumor infiltrating leukocytes, colored by Spectra factor cell scores in the Bassez (n=97,863 leukocytes) and Zhang (n=150,985 leukocytes) datasets. ILC3, innate lymphoid cell type 3; T, T cell; gdT, γδ T cell; pDC, plasmacytoid dendritic cell; mac, macrophage; mono, monocyte; NK, natural killer cell; B, B cell; mast, mast cell; Treg, regulatory T cell; DC, dendritic cell; GC B, germinal center B cell; plasma, plasma cell.

Spectra also uncovered novel cell-type-specific expression of amino acid metabolic factors, such as cysteine metabolism in γδ-T cells and lysine metabolism in plasma cells (Fig. 4a). The lysine metabolism factor also scored high in Spectra’s information and importance scores, suggesting that they explain relevant gene expression across cells (Extended Data Fig. 7a). Lysine is an essential amino acid found at lower concentrations in malignant lesions than adjacent normal breast tissue, likely due to the high metabolic demands of the tumor51. Among the top 50 marker genes of the lysine factor, Spectra retained 72% of the input gene set, including all key metabolic enzymes; it removed redundant amino acid transporters (SLC25A21, SLC38A4, SLC6A14, SLC7A3); and it retrieved degradation genes (PYCR2, PYCR3, SLC25A15) not found in the input set (Fig. 4b). Moreover, Spectra added genes involved in the unfolded protein response, including the two pivotal initiators XBP1 and ATF6 and their downstream targets (ERLEC152, SDF2L153, HERPUD154, PDIA655). These genes are expressed more coherently and at higher levels in plasma cells than other cells, suggesting coordinated expression as a gene program (Extended Data Fig. 7b) ER stress regulates the capacity of plasma cells to produce immunoglobulins52, likely because large quantities of misfolded antibodies52 must be degraded, generating significant lysine53.

Slalom6 identified a factor in plasma cells with poorer resemblance to lysine metabolism (overlap coefficient 0.17 vs 0.72) and less biological coherence (Extended Data Fig. 7c), possibly because Slalom6, unlike Spectra, does not use cell-type-specific gene scalings. The Slalom6 lysine factor misses ER stress genes and is contaminated with cell type markers (e.g. SDC1, MZB1), likely for the same reason (Extended Data Fig. 7d). expiMap9 identified a lysine metabolism factor which was homogeneously expressed across cells (Extended Data Fig. 7ce). scHPF4 lacks gene set inputs and failed to identify a factor enriched for lysine metabolism genes; moreover, the scHPF4 factor with the highest number of lysine genes (2 of 18) is not expressed in plasma cells (Extended Data Fig. 7ce).

To gauge Spectra’s stability and the reproducibility of its lysine factor, we fitted an independent Spectra model onto data from patients with metastatic breast cancer biopsied before and during paclitaxel chemotherapy with or without anti-PD-L1 (Zhang dataset)29, using identical parameters (Extended Data Fig. 1b). Of the top 50 marker genes identified in the Bassez dataset10, 28 were also identified in the Zhang dataset (Fig. 4c). This includes 17 of the 37 new genes learned directly from both datasets, and encompassed ER stress genes. The lysine metabolism factors from both datasets were specifically expressed in plasma cells, supporting the reproducible identification of both marker genes and cell scores by Spectra (Fig. 4d). Our results link lysine metabolism and ER stress as features of tumor infiltrating plasma cells in breast cancer.

A novel gene program describes continuous macrophage state changes under therapy

Macrophages critically shape anti-tumor immunity and mediate resistance to immune checkpoint therapy by adopting immunosuppressive phenotypes under therapy (adaptive resistance); however, the effect of ICT on macrophage gene programs and their association with response remains unclear54, 55. Bassez and colleagues10 linked a macrophage cluster expressing the complement gene C3 to therapy resistance (Extended Data Fig. 8a,b). While marker genes can help to identify cell populations that change under ICT, they do not necessarily represent biological processes. For example, complement genes such as CFB (which activates C3 (ref. 56)) exhibit opposite trends to C3 and are more highly expressed in responders (Extended Data Fig. 8b,c).

We evaluated whether Spectra can identify more interpretable gene programs underlying macrophage cell states and adaptive resistance mechanisms, using diffusion components to effectively visualize continuous states57. Specifically, we found that diffusion components 2 and 4 (DC2 and DC4) are most relevant for capturing maturation from monocyte-like to macrophage states, and for separating responders from non-responders, respectively (Fig. 5a and Extended Data Fig. 8d). Cell scores for Spectra factors form gradients along DC2 with successive peaks from monocyte to macrophage states, beginning with CYP enzyme activity and TNF–α signaling (required for monocyte survival58), followed by glycolytic activity (likely required for monocyte activation59), a novel factor containing invasive and angiogenic mediators (‘invasion program’), and finally complement production, a key feature of mature macrophages60. Along DC4, Spectra identified programs for high type-II IFN signaling and MHC-II antigen presentation at one extreme, followed by IL-4/IL-13 response, and hypoxia signaling and the invasion program at the other (Fig. 5a; see Supplementary Table 4 for factors associated with each DC). Spectra can thus characterize continuous macrophage states in tumor tissue, including hypoxia and related invasion programs, which may represent critical states along the macrophage spectrum.

Figure 5 — Spectra reveals therapy-induced macrophage gene expression programs.

Figure 5 —

a, Macrophage cells plotted along diffusion components DC2 and DC4, colored by patient-level T cell expansion status (responder, non-responder) in the Bassez dataset7 (n= 12,132 cells). Heatmaps indicate z-scored gene program cell scores, smoothened by fitting a generalized additive model (Methods). b, Graph with nodes representing cellular neighborhoods (n=858) plotted along DC2 and DC4, and edges representing overlap, colored by 2log(fold-change) under anti-PD-1 as estimated with Milo (Methods). The 2log(fold-change) of non-significant (FDR0.05) neighborhoods is set to 0. c, Average cell scores of macrophage neighborhoods (n=858) enriched in non-responders under therapy, and cell scores for all other macrophage neighborhoods in the independent Bassez and Zhang breast cancer datasets. Cell scores were calculated using the Spectra invasion factor (factor 182 from Bassez) or by using scanpy.score genes1,2 on the top 50 marker genes of factor 182 in Zhang. P values (two-sided) were calculated using Mann-Whitney U tests. Bassez: p-value =4.96x105, statistic =1060, Cohen’s d=1.49; Zhang: p-value =3.74x1012, statistic =600886, Cohen’s d=1.03. Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. d, Mean expression, z-scored across cells (n=12,132 cells) in color code below and dot size indicating the percentage of cells with at least one detected copy of the indicated genes with legend below, of factor genes in non-responder macrophage populations and other macrophage populations in the Bassez (n=12,132 cells) and Zhang (n=3,206 cells) datasets.

To find macrophage states that only change in non-responders under ICT—and could therefore confer adaptive resistance in patients—we used the Milo algorithm61, which avoids discretizing the macrophage phenotypic continuum. Milo revealed overlapping cellular neighborhoods (states) that only expand under anti-PD-1 therapy in non-responders (Fig. 5b) and are high in the novel invasion program (Fig. 5c). This invasion program exhibits low graph dependence (η=0.24), indicating that it does not correspond to any input gene set; moreover, Slalom6 and scHPF4 do not identify a similar program (Extended Data Fig. 8e,f). Its high importance and information scores also suggest that it explains relevant macrophage gene expression (Extended Data Fig. 8g). The individual invasion program genes are coherently expressed in macrophages, only increase in non-responders, and include known invasion and metastasis mediators (CTSL62, CTSD63, CTSB64, CHI3L165, SPP166, PLIN267). Furthermore, the invasion program includes inflammation modulators (TREM168, TREM269, GPNMB70), cholesterol metabolism genes (APOE71, 72, APOC173, CYP27A174), some which have been linked to the suppression of inflammatory cytokine release (IL-6, TNF-α)70. Our results suggest that, in patients who do not respond to ICT, macrophages may express upregulate these genes coordinately, constituting a new macrophage program (Fig. 5d). By focusing on residual expression that is not well explained by the gene knowledge graph, Spectra can thus find a novel gene program that is both interpretable and related to ICT response, unlike unsupervised approaches such as scHPF4, or fully supervised approaches such as Slalom6.

To test for replication on independent data, we scored expression of the top 50 marker genes of the Spectra invasion factor in the Zhang dataset29. We ran Milo and identified macrophage populations that expand in radiological non-responders. Despite the different clinical setting of metastatic tumors, we observed high expression in macrophage populations enriched in non-responders (Fig. 5c). The same invasion and cholesterol metabolism genes as identified in the Bassez dataset also showed higher expression in the Zhang dataset, validating our invasion program (Fig. 5d). Spectra thus identifies a novel pro-metastatic gene program that is upregulated under anti-PD-1/PD-L1 in therapy-resistant breast cancer patients, with implications for understanding adaptive resistance mechanisms and macrophage polarization.

Spectra factors generalize to hundreds of patients without explicit batch correction

Technical differences between patient samples and cohorts (batch effects) tend to be confounded with biological differences, making them difficult to distinguish. While batch correction approaches can help, they often remove subtle, yet important, biological signals75. We therefore asked whether Spectra can find shared features without explicit batch correction.

The scRNA-seq lung cancer atlas from Salcher and colleagues76 comprises 1.28 million cells from 19 studies and 318 patients (‘Salcher atlas’), including a single study that uses cryopreserved cells and exhibits a strong batch effect involving all cell types (Fig. 6a). We applied Spectra with default parameters and our immunology knowledge base (with 20 appended gene sets for epithelial and stromal cells), and found 11 global factors that are batch-specific based on their cross-study entropy (Methods); 10 of these are specific to the cryopreserved cell study and account for its batch-driven variation (Fig. 6b), suggesting that Spectra discovers and assigns batch-specific variation to distinct factors. As it did for the Bassez and Zhang breast cancer datasets, Spectra identified lysine metabolism, CD8+ T-cell-specific tumor reactivity, and macrophage-specific invasion factors in the Salcher atlas without explicit batch correction. Despite the differences in tumor type and clinical cohort, 10 to 13 of the top 50 marker genes in the lysine factor were shared among the three datasets, and of these, 6 to 12 genes were absent from the input (Fig. 6c). Newly discovered shared genes include ER stress transcription factors XBP1/ATF6 targets SDF2L177 and PDIA678 (in the lysine metabolism factor); TCR signaling target BATF35, 39 and the clinically targetable immune checkpoint gene LAG336, 79 (tumor reactivity factor); and invasion mediators CTSL/CTSD62, 63 and inflammatory mediators TREM168 and GPNMB70 (macrophage invasion factor). The identified factors showed high stability across datasets in the Salcher atlas, and the lysine metabolism factor was expressed at much higher levels in plasma cells, as we observed in breast cancer, in most Salcher datasets (13 of 19 studies, p<1012 compared to each cell type across all patients) (Fig. 6d).

Figure 6 — Spectra gene programs are reproducible across multiple studies.

Figure 6 —

a, UMAPs of whole lung tumor single cell suspensions from 19 studies comprising 318 patients and 1.28 million cells, calculated on PCs from non-batch-corrected median library size normalized log1p-transformed data. UMAPs colored by study of origin (left) or cell type annotation used for fitting Spectra (right). b, Global factors showing the smallest entropy across studies. Several factors are predominantly expressed in the Adams et al. studies which uniquely used frozen instead of fresh cells. Haber., Habermann; Lambr., Lambrechts; Laugh., Laughney; Madis., Madisson; Mayn., Maynard; Reyf., Reyfman; Travag., Travaglini; met., metabolism; demethyl., demethylation; glycerophosph., glycerophospholipid metabolism. c, Overlap between input gene set and top 50 marker genes from the Bassez, Zhang, and Salcher datasets for the lysine metabolism factor (left), CD8+ T cell tumor reactivity factor (middle), and macrophage invasion factor (right, no input gene set because this is a new factor). d, z-scored mean cell scores of the lysine metabolism factor (number 56) per study and cell type in the Salcher atlas. Side bars below indicate the mean z-score per column and on the right show the patient numbers per study. Two-sided p values calculated using Wilcoxon matched-pairs signed rank tests comparing the mean cell scores per patient of plasma cells with the indicated cell type. Dendritic cells (DC): statistic = 3152, Cohen’s d=0.7562, macrophages (Mac): statistic =2350, Cohen’s d=0.8648, granulocytes (gran): statistic =516, Cohen’s d=0.9140, regulatory T cells (Treg): statistic =4441, Cohen’s d=0.6077, fibroblasts (fibro): statistic =6782, Cohen’s d=0.5093, mast cells (mast): statistic =5348, Cohen’s d=0.5209, NK cells: statistic =3883, Cohen’s d=0.7010, epithelial cells (epi): statistic =3214, Cohen’s d=0.7340, CD8 T cells: statistic =4555, Cohen’s d=0.7035, T cells (T): statistic =3345, Cohen’s d=0.5601, B cells (B): statistic =2903, Cohen’s d=0.7725, CD4 T cells (CD4): statistic =2385, Cohen’s d=0.8790, endothelial cells (endo): statistic =5978, Cohen’s d=0.5014. e,f Mean cell sores in positive (i0.001) CD8+ T cells (e) or macrophages (f) for the tumor reactivity factor number 174 (e) and the macrophage invasion factor number 193 (f) in ever smokers and never smokers (upper panels), or in EGFR-mutated and EGFR wild type patients (lower panels). p values were calculated using Mann-Whitney U tests (two-sided). Tumor reactivity smoking: n=153,p=0.0022, statistic =3500, Cohen’s d=0.45; tumor reactivity EGFR: n=30,p=0.18, statistic =78, Cohen’s d=0.52; invasion smoking: n=147,p=0.051, statistic =2928, Cohen’s d=0.30; invasion EGFR: n=32,p=0.010, statistic =59, Cohen’s d=1.17.

The high stability of factors across datasets prompted us to examine whether Spectra can discover meaningful associations with two clinically important variables, EGFR mutation and smoking status. While EGFR-mutated tumors are resistant to immune checkpoint blockade80, smokers respond more frequently81. Tumor reactivity cell scores are higher in CD8+ T cells from tumors of smokers than non-smokers (p=0.002), and are higher in EGFR wild type than mutated tumors (p=0.180) (Fig. 6e). This is consistent with a retrospective cohort study linking infiltration of CD39+ tumor reactive T cells to tobacco exposure in lung cancer patients43 (CD39 is a key marker of the Salcher tumor reactivity program). The macrophage invasion factor similarly showed higher cell scores in macrophages from smokers (p=0.051) and EGFR wild type tumors (p=0.010) (Fig. 6f). In the breast cancer datasets, this factor is associated with resistance to ICT (Fig. 5c), and independent studies suggest that its marker genes are involved in suppressing anti-tumor immunity (FABP582, TREM168). The macrophage invasion factor may therefore constitute an adaptive resistance mechanism which counters anti-tumor immunity in smokers and EGFR-mutated tumors. Our observations demonstrate that Spectra finds subtle programs across batches and patients without requiring explicit batch correction. Although patient or sample-level phenotypic association has been attempted with cell type fractions, Spectra factors make it possible to associate clinical phenotypes with cell-type-specific gene programs—a promising strategy for biomarker discovery and gaining insights into cancer biology.

Discussion

Spectra anchors data-driven factorization with prior biological knowledge to infer factors that are coherently expressed, interpretable, and not polluted by cell type markers. The algorithm modifies each factor to the dataset’s biological context by upweighting novel genes that are tightly expressed with the bulk of factor genes and downweighting genes that are not. This enables the dissection of programs that are highly correlated, such as T-cell exhaustion and tumor reactivity in a tumor context. We demonstrate that expression of this T cell tumor reactivity program separates breast cancer patients by their clonal expansion status after anti-PD1 treatment (while less coherent factors estimated by other methods fail to do so), and is replicated in an independent study of lung cancer patients with extensive functional validation of T-cell specificity.

A unique feature of Spectra is its unified approach for dealing with multiscale expression variance—the fact that differences related to cell type dominate the marginal gene-gene covariance matrix, obscuring higher-resolution cell-type-conditional covariance structure. Spectra addresses this by accepting cell type labels as input and explicitly modeling cell-type-specific factors that can account for local correlation patterns. As a result, it reliably identifies programs that are conserved across multiple cell types relating to metabolism, response to cytokine signaling, differentiation and growth, while separately estimating the cell-type-specific components of these programs.

Spectra’s interpretability depends on both its probabilistic model and the quality of prior knowledge used as input. We compiled an immunology knowledge base of high confidence gene sets for 50 cell types and 181 cellular processes, which can improve the analysis of immune scRNA-seq data using Spectra or other supervised methods. However, Spectra does not require input gene sets to be of high quality, nor very relevant to the dataset being analyzed. Its biologically grounded prediction model for gene-gene relationships flexibly accounts for potentially noisy edges in the input graph. The algorithm adaptively tunes its reliance on prior information based on concordance of the input graph with observed data, and it allocates novel factors when prior information is insufficient to explain observed expression. This property enabled the discovery of a novel cancer invasion gene program describing an axis of variation in tumor-associated macrophages that is strongly related to anti-PD-1 therapy resistance, which replicated upon transfer to two independent datasets.

The main simplifying assumption made by Spectra and other factor analysis methods is that factors combine linearly to drive gene expression. Though occasionally violated, this approximation establishes a direct relationship between estimated model parameters and each factor’s contribution to gene expression; moreover, it avoids severe issues of non-identifiability and indeterminacy during optimization. The tradeoff is that nonlinear cooperative effects of multiple factors on gene expression cannot be determined via matrix decomposition. T cell priming, for example, represents a nonlinear interaction between TCR stimulation, coactivation and cytokine signaling, that initiates differentiation along a terminal effector memory precursor trajectory. Uncovering nonlinear relationships in an interpretable manner is a future goal of factorization.

We anticipate that Spectra will be useful in assessing the determinants of heterogeneity in large scale scRNA-seq studies. Spectra factors are stable across two breast cancer and a very large lung cancer atlas, totaling over 1.5 million cells from 375 patients and 21 studies, demonstrating that the method finds robust biological signal while overcoming batch effects (without explicit batch correction) and scaling well for atlas interrogation. Importantly, Spectra factors make it possible to associate clinical covariates with cell-type-specific gene programs. The strong association of some factors with immunotherapy response and patient phenotypes suggests that Spectra can nominate novel clinical biomarkers and therapeutic targets. In addition, the ability to transfer factors learned from one dataset to another can advance our ability to iteratively transfer and refine knowledge across scRNA-seq studies without requiring data integration.

Methods

Overview of Spectra

Spectra (https://github.com/dpeerlab/spectra) addresses these issues by grounding data-driven factors with prior biological knowledge (Supplementary Fig. 1). First, Spectra takes in biological prior information in the form of cell type labels and explicitly models separate cell type specific factors that can account for local correlation patterns. This explicit separation of cell type specific and global factors enables the estimation of factors at multiple scales of resolution. Secondly, Spectra resolves indeterminacy of the reconstruction loss function via a penalty derived from a gene-gene knowledge graph that encourages solutions that assign similar latent representations to genes with edges between them. To account for prior information of variable relevance and quality, Spectra adaptively tunes its reliance on prior information based on concordance of the prior and observed expression data. Finally, novel factors are adaptively allocated when prior information is insufficient to explain the observed expression data.

In the first step of Spectra, a set of gene-gene similarity graphs is built by aggregating information across gene sets and/or other sources. This graph representation is flexible and can accommodate various types of prior knowledge: gene sets can be incorporated into graphs by including edges between genes that are annotated to the same pathway, while existing datasets can be used to generate annotations by thresholding partial correlations or factor similarity scores. This representation lends to computational convenience as the graph dimensions are fixed regardless of the size of the input annotations. The annotations are either labeled as cell type specific, or have global scope. A separate graph is thus built for each cell type alongside a global graph.

In the second step, Spectra learns a multidimensional parameter for each cell and each gene, representing each cell and each gene’s distribution over gene expression programs. Similarity of the parameters between genes indicate that these genes are likely to have an edge joining them while similarity of the parameters between a cell and a gene indicate that the cell is likely to express that gene. Hence, the graph encodes the prior that genes with edges between them are likely to be expressed by the same set of cells. In practice, we take a number of additional steps to fulfill the desiderata: (1) factors not represented in the annotations can be discovered (2) low quality annotations can be removed (3) discrete cell types are assumed to be fixed and known and therefore not captured as factors by the model.

In order to avoid penalizing novel factors that have no relation to the annotations, we introduce a weighting matrix that scales the computation of gene-gene similarity scores by factor specific weights that are learned from the data. Factors that have low weight are not used in computing edge probabilities while factors with high weights influence the edge probabilities directly. Hence Spectra can estimate similar parameters for two genes without forcing a high edge probability between them, so long as the factors corresponding to these genes also have low weight. These weights allow the addition of new, unbiased factors that are not influenced by the input annotations. Importantly, weights are estimated from the data allowing for an adaptive determination of the relative number of unbiased and biased factors. An estimated background rate of edges in the graph allows for the removal of annotations with little supporting evidence from gene expression data. Finally, Spectra explicitly separates global and cell type specific factors by enforcing a cell type determined block sparsity pattern in the cell loading matrix. Cell type specific factors capture within cell type variation while global factors capture any variation that is shared across multiple cell types. To reduce the burden of modeling constitutively expressed cell type marker genes, each factor’s contribution to gene expression is multiplied by a cell type specific gene weight. These cell type specific gene weights explain away the influence of cell type marker genes and hence mitigate the tendency of these marker genes to influence the factors themselves.

Components of the Spectra objective function

Broadly speaking, Spectra fits a set of factors and cell scores by minimizing an objective function with two components. The first component of the objective function, Reconstruction, measures how well the estimated model parameters can reconstruct (or predict) the observed expression data using the set of all model parameters, Θ. We write Reconstruction(Θ) to emphasize that Reconstruction is a function that maps a set of model parameters to a corresponding objective value. The second component of the objective function measures how well the set of model parameters Θ correspond to our biological prior information. This second component is denoted Graph(Θ). We weight this term by a user defined hyperparameter λ which allows a user to control the level of confidence placed in the given biological prior information. The general form of the Spectra objective function is:

(Θ)=λReconstruction(Θ)+Graph(Θ)

Below we describe the precise functional forms of each of the objective function components.

Reconstruction(Θ): Modeling gene expression as a low rank product

We assume that the expression variation observed in the count matrix is driven by variation in the activity of different biologically meaningful gene programs, as well as technical variation which often involves highly expressed genes. Therefore, our model of gene expression needs to account for both components. In more detail, interpretation of factors estimated from single cell RNA sequencing data is often hindered by highly expressed genes, which factor analysis methods based on reconstruction loss functions must account for. Housekeeping genes required for basal cellular function such as GAPDH, ACTB, and ribosomal genes are expressed at high levels, and hence unduly influence the reconstruction loss function despite the fact that their expression variance is explained in large part by overall levels of transcription. As a result, existing matrix decomposition methods tend to put high weight on such non-specifically expressed genes; though post hoc corrections can be applied for the interpretation of individual factors. On the other hand, certain important cytokines and chemokines (e.g. IL4, IL6, IL2, IL10), receptors (CXCR1,CXCR2), and transcription factors (RORC, BATF3) are expressed in low mRNA copy number. Normalization strategies that rescale features empirically tend to amplify measurement uncertainty associated with lowly expressed genes, leading matrix factorization methods to overfit and return low quality gene expression programs. To address this, we introduce gene scale factors gj that are estimated from the data and allow the model to explain high expression and variability of certain genes without increasing the magnitude of these genes’ factor weights. Since lowly expressed genes are correspondingly noisier we bound the minimum gene scale factors below by a tuning parameter δ.

By way of notation, X refers to the processed gene expression matrix, with entry Xij containing the gene expression value for cell i and gene j. The matrix X has n rows (the number of cells) and p columns (the number of genes). K refers to the number of gene expression programs unless otherwise specified. Additionally, for a given cell indexed by i the cell loading, a set of weights across the set of factors, is denoted by αi. The distribution across factors for gene j is denoted as θj which sums to 1 over K gene expression programs, k=1Kθjk=1. Unsubscripted variables refer to the collection containing all possible subscripts, e.g. θ refers to the collection of all θj. The base expression model describing the gene expression measurement for cell i and gene j is:

𝔼[Xij]=(gj+δ)αiθj

with gj[0,1] a gene scaling parameter, αi+K and θjΔK1 (where ΔK1 is the set of positive K–vectors that sum to 1). The low rank decomposition of this expression model can be visualized in Supplementary Fig. 2.

Incorporating cell types into modeling expression variation

Because expression variation is dominated by cell types, existing methods generally fit factors that are polluted with cell type markers or alternatively must be run on a subset of the data. For example, T cell receptor activation programs - consisting of markers such as NFATC1 and NFATC2 - are confounded with T cell identity and existing factor analysis methods tend to return identity markers such as CD3, CD4 and CD8. Similarly, programs representing metabolic pathways are often confounded with plasmacytoid dendritic cell (IL3R, BDCA2) or B cell identity markers (CD19, CD79A). While it is challenging to fit a biologically meaningful factor model, successful cell-typing of scRNA-seq data using clustering approaches is a solved problem for discrete cell types but not for intermediate states. Therefore, to mitigate this issue, Spectra assumes that discrete cell types are known and therefore not captured as factors by the model, instead Spectra explicitly fits cell type specific and global factors - allowing Spectra to effectively deal with expression variance at multiple scales. To perform this cell type integrative factor analysis, for cell type c and cell i the model is extended to:

𝔼[Xcij]=(gj+δ)αc,i,:Kθj+(gcj+δ)αc,i,K+1:θcj

where c is the cell type label for cell i, gcj is cell-type specific gene scaling, θcjΔKc1 is a cell type specific gene representation with αc,iK+Kc. Single subscript variables such as gj and θj denote global parameters, while the notation α:K indicates the first K elements of a vector (typically denoting global elements) and αK+1: indicates the tail of the vector from the K+1’st element (typically denoting cell type specific elements). The threshold δ restricts the maximum ratio of gene scaling factors to 1+δδ.

Spectra models the presence of gene programs with highly limited scope in that they can only be activated by a specific cell type, which can be represented by a hard coded sparsity pattern in the cell loading matrix (Supplementary Fig. 3). The cell type specific gene scalings gcj associated with these programs are encouraged to capture cell type identity markers and constitutively active genes, enabling factors themselves to capture variation across cell types and within cell types (Supplementary Fig. 4). Spectra tends to assign constitutive genes such as EEF1A1 and ACTB as well as identity markers such as CD4 and CD3 high values of gj. Lowly expressed genes important for CD4 T cell specific gene programs such as IL21, IL13, and IL6 are often assigned small values gj, which allows Spectra to attend to gene expression differences that occur on a smaller scale (Supplementary Fig. 4). By default, Spectra runs with at least one cell type specific factor per cell type so that global factors do not capture cell type identities.

Determinating Cell Type granularity

Spectra can accommodate cell type labels at any level of granularity, subject to a linear increase in computational burden with the number of cell types in the dataset. Additionally, as the granularity increases the effective sample size for estimating cell type specific factors decreases, leading to potentially lower quality cell type specific factors. The correct cell type granularity depends on the dataset and the specific scientific questions at hand. First, the analyst should incorporate cell types that are known to be discrete and easily identifiable in the dataset via standard clustering analysis (e.g. T cells, B cells, myeloid cells, and epithelial cells). If cell subtypes exist that are not included as input to the model, Spectra devotes factors to describing variation across these subtypes. Moreover, if intermediate differentiation states between subtypes exist in the data, these subtypes should generally not be included as input to the model because (1) coarser cell type specific factors can describe these intermediate states and (2) delineating between subtypes via clustering may be inaccurate.

Graph(Θ) : Modeling gene-gene relationships in relation to expression data

In addition to faithful approximation of the input count matrix, we would also like interpretable factors that correspond known gene programs and biological processes (prior). Therefore, the second component of our likelihood function is a penalty term that guides the solution towards this prior. A key novelty of Spectra is that it models this prior knowledge as a gene-gene community graph, which provides both computational efficiency and flexibility to adapt the graph structure to the data.

In this graph nodes represent individual genes and edges between genes occur when each gene has a similar distribution over factors. Then communities within the graph, or densely connected subsets, represent gene programs while edges between communities contain information about genes that participate in multiple gene programs. Providing an imperfect, partially known graph structure as input, we can constrain our matrix factorization solution to respect the structure to yield interpretable gene programs. A main advantage of this approach is its flexibility. Gene sets are naturally incorporated into a graph by forming fully connected cliques among members of each set.

Further, more complex prior knowledge graph structures can be used as input, for example, arising from gene programs estimated from a separate dataset or cell atlas. Most importantly, this the structure of this input gene-gene graph can be improved, by fitting it to the data and learning gene programs that are more faithful to the data.

A second advantage of the graph prior is its scalability. While gene sets may be highly overlapping, especially when curated from several separate databases, this redundancy is eliminated when storing information at the level of gene-gene relationships. Redundant gene sets will be merged into highly overlapping communities, and so two redundant gene sets can be approximately described by a single factor. A further computational advantage over gene set priors is that the dimensions of the graph are fixed as the size of gene set database increases, with only the number of edges increasing and eliminates the need for iterating over the gene set dimension. Finally, operations involving the graph are implemented via efficient and parallelizable matrix multiplications with the graph adjacency matrix, thus allowing Spectra to efficiently scale to a large number of gene-sets and cells (Fig. 2g)

In order to encourage factors to capture our prior knowledge of gene programs, we assume that binary gene-gene relationships are evidence of a pair of genes having similar latent profiles. This assumption could be incorporated by assuming a model for edge probabilities depending on the similarity scores θi,θj for genes i and j. However, the naive inner product does not explicitly account for the fact that prior information is invariably imperfect in systematic ways. First, at the level of entire gene programs: not all gene programs are active in all datasets and therefore entire graph communities may be unnecessary for describing the observed expression data, while there are likely novel gene programs observed in the expression data that are not be represented by communities in the graph. Also gene programs are imperfect, both due to inaccuracy of annotation and more frequently, gene programs differ across biological contexts and our prior information is typically derived from a different biological context. Therefore, genes may be misclassified into gene sets to which they do not belong (corresponding to noisy edge observations) or gene sets may be incomplete (corresponding to missing edges). Spectra addresses these issues in two ways: (1) adaptively modeling background noise in the graph, allowing for the addition and removal of edges (Section ’Background edge rates’) and (2) tuning the weight of the prior gene-gene matrix through the incorporation of a weight matrix, termed the factor interaction matrix, into the inner product between gene representations θi and θj (Section ’The factor interaction matrix tunes the weight of the gene-gene prior’).

The factor interaction matrix tunes the weight of the gene-gene prior

To understand the purpose of the factor interaction matrix, let’s first consider the ordinary inner product, measuring gene-gene similarity in terms of gene program representations:

θi,θj=θi1θj1++θiKθjK

The maximum value of this product is 1, achieved only when gene i and gene j put all their weight into a single gene program. Consider what happens if genes i and j are important components of a gene program that exists only in the expression data and not in our prior information. Then i and j are not connected in the graph and so the inner product model encourages θi,θj0. When θi,θj0, gene i and j must be components of entirely separate programs. In this way, we see that the naive inner product discourages new factors from being estimated from the expression data. Such an inner product model estimates novel factors that are heavily biased by the graph.

Now instead of the naive inner product consider a weighted product, weighted by scalar values b1,b2,,bK that are between 0 and 1:

θi,θjb=b1θi1θj1++bKθiKθjK

In order to model the data we can adjust the values of b1,,bK to achieve the best fit. Consider the same situation as above, where i and j are not connected in the graph but they are components of a gene program supported by expression data alone. The product model again encourages θi,θjb0; however, now this constraint does not necessarily encourage θi and θj to be dissimilar. To see this, suppose that θi=[1,0,0] and θj=[1,0,0]. If b1=0, then:

θi,θjb=b111+b200+b300=0

Hence, novel gene programs can be estimated so long as the value of bk corresponding to that program is pushed towards 0. We can interpret gene programs corresponding to low values of bk as novel and gene programs corresponding to high values of bk as supported by prior information. We could equivalently write each weight bk as one of the non-zero elements of a diagonal matrix

B=[b1bK]

so that

θi,Bθj=θi,θjb=b1θi1θj1++bKθiKθjK

In practice, we allow the off diagonals of this matrix B to be estimated as non-zero (Supplementary Fig. 5). The resulting matrix is termed the factor interaction matrix.

Allowing off diagonals of the factor interaction matrix to be non-zero serves two purposes: First, it allows the model to explain overlapping gene sets without forcing shared genes to have partial membership. For example, if two gene sets overlap but in reality represent two distinct biological processes that can be separated in the gene expression data, the model is not forced to assign partial membership to overlapping genes but can fully assign genes to one of two programs. To account for this, the off diagonal element corresponding to this pair of gene programs (Bk,l for programs k and l) can be estimated as greater than 0. On real data, we see this occur for β-alanine metabolism and fatty acid metabolism (Supplementary Fig. 6). Second, non-zero off diagonal elements of the factor interaction matrix serve to mitigate the effect of low quality edges in the prior graph by allowing edges between genes that are in separate gene expression programs to arise with non-zero probability.

Full Spectra model

As notation we refer to the adjacency matrix of an input graph as Ap×p with element Aij=1 if an edge exists between i and j and Aij=0 otherwise. Following the discussion above, the Spectra generative model states (Supplementary Fig. 5):

[Aij=1]=θi,Bθj

In the full Spectra model, each gene has a separate representation per cell type (in addition to its global representation), θci, where c indexes into the possible cell types. In order to supervise these representations in a cell type specific manner, the user (optionally) provides one graph for each cell type and a graph representing global gene-gene relationships (Supplementary Fig. 6, Supplementary Fig. 7). These graphs are modeled separately - where each graph’s edges can only be predicted using factor representations specific to that cell type. The cell type specific graphs are denoted Ac for cell type c, with Ac,ij=1 if there is a cell type specific annotation between genes i and j for cell type c. The cell type specific graphs can only influence cell type specific factors and vice versa:

[Ac,ij=1]=θci,Bcθcj

diagrammed in Supplementary Fig. 7. Importantly, a separate factor interaction matrix, Bc, is learned for each cell type with a prior graph provided.

The computational cost of including granular cell type specific prior information can be large, as each cell type requires its own graph.

Background edge rates

Realistic annotation graphs have a number of edges that are not supported by expression data, and the model should be allowed the flexibility to attribute edges (or the lack thereof) in annotations to a background rate of noise. To allow flexibility in modifying the original graph we incorporate background edge and non-edge rates κ and ρ that reflect noise rates in the observed graph. These parameters serve two separate purposes: first, they deal with numerical stability issues by moving probabilities away from 0 and 1, and second they control the rate that edges are added and removed from the original graph. Intuitively, our inference procedure examines whether a relationship (or lack of a relationship) in the prior knowledge graph is consistent with expression data, and if not can ascribe this relationship to random noise.

The generative process of our model is that with some probability ρ, edges between gene i and j are blocked out and cannot occur irrespective of the corresponding factor values θi and θj. If this doesn’t occur, an edge will be generated by random chance with probability κ. Finally, if neither of these events occur, an edge is generated according to the factor similarity score θi,Bθj. This yields the following distribution for the adjacency matrix:

[Aij=1]=(1κ)(1ρ)θiBθj+κ(1ρ)[Aij=0]=(1κ)(1ρ)(1θiBθj)+ρ

where κ and ρ are (cell type specific) background rates of 1 and 0 in the adjacency matrix respectively. κ and ρ can be estimated from the data or fixed to constants and treated as tunable hyperparameters.

Constructing the gene-gene prior graph

In most applications, Spectra receives a set of gene-sets, rather than a gene-gene graph as input and the gene-gene graph is constructed from these gene-sets. Large gene sets generally provide lower evidence that any given gene is crucial to the process that the gene set represents. For example, hallmark gene sets often contain hundreds of genes83, some of which are upregulated as distant downstream targets. Additionally, larger cliques represent a larger component of the likelihood function, potentially biasing Spectra solutions towards attending to the largest gene communities. Therefore, by default, when Spectra takes in gene sets as input, the edge weights used to down-weight the contribution of any individual graph edge proportionally to the size of the gene set that it’s derived from. The default weighting scheme is to weight edges by the total number of edges in the clique. For a given gene set Gk, this involves downweighting by 1(|Gk|2):

wij1(|Gk|2)

where |Gk| is the size of a gene set Gk containing genes i and j. The weights are rescaled so that the median weight across gene sets is 1. When a pair of genes exists in multiple gene sets, the weights accumulate additively. Another reasonable choice is wij=1max(d(i),d(j)), where d(i) is the degree of node i.

As an alternative weighting scheme, Spectra accommodates weighted graphs by scaling edges by edge specific weights. This feature allows users to annotate the prior information graphs with additional quantitative information representing relative confidence in each individual annotation.

Pseudo-likelihood function

The heretofore described model components describe the expected values of the expression data matrix, X, and the prior knowledge graph, A under the Spectra generative process. Together with specific observation distributions, this would specify a likelihood function that serves as the maximization objective of Spectra, fit via either first order methods or expectation maximization. The loss function described below is the negative value of a proper likelihood function in the case where weights wij are equal to 1, λ=1 and expression data X follow a Poisson distribution. In practice, these conditions are not satisfied so we adopt the terminology pseudo-likelihood function to describe the negative loss function. For ease of exposition, we first describe the pseudo-likelihood function assuming a single cell type. Recall the general form of the Spectra objective, consisting of a term that measures the ability of Spectra factors to recapitulate expression data and a term that measures the concordance of Spectra factors with the prior knowledge database:

(Θ)=λReconstruction(Θ)+Graph(Θ)

As edges are binary, combined with the assumption of independence, the log likelihood of Aij given a probability of 1, pij:=Aij=1, is:

log(Aij)=Aijlogpij+(1Aij)log(1pij)

With pij as described in Section ’Modeling gene-gene relationships in relation to expression data’ and ’The factor interaction matrix tunes the weight of the gene-gene prior’,

log(Aij)=Aijlog((1κ)(1ρ)θiBθj+κ(1ρ))+(1Aij)log((1κ)(1ρ)(1θiBθj)+ρ)

To incorporate weights (following Section ’Constructing the gene-gene prior graph’), we weight likelihood terms corresponding to each edge in the graph by an edge specific weight wij:

log(Aij)=wijAijlog((1κ)(1ρ)θiBθj+κ(1ρ))+(1Aij)log((1κ)(1ρ)(1θiBθj)+ρ)

Combining across all observations (i,j), this leads to the expression for Graph(Θ):

Graph(Θ)=i=1pj=1,jip[wijAijlog((1κ)(1ρ)θiBθj+κ(1ρ))+(1Aij)log((1κ)(1ρ)(1θiBθj)+ρ)]

The loss function derived from the Poisson distribution has been used widely for modeling single cell RNAseq counts 4, 61, 84. Though processed data may not necessarily be well described by the Poisson observation model (i.e. scran processed data is on a log scale) the resulting log likelihood strikes a practical balance in scaling with gene expression magnitude. The resulting loss function has been used in contexts other than modeling count data as the KL divergence loss 85. Here, we are primarily concerned with how the loss function behaves under changes in scale. For example, suppose we have an estimated expression value Xˆij. We can write Xˆij(Θ) as our predicted gene expression as a function of the model parameters. The least squares loss 2(Θ):=XijXˆij(Θ)2 is quadratically dependent on the scale of Xij, since replacing both ground truth and estimate by scaled versions φXij and φXˆij leads to a loss of φ22(Θ). Similar to the issues addressed in 2.2.1, the squared loss function encourages factors to attend to highly expressed genes, since scale differences amplify the loss quadratically. At the other extreme, consider the Itakura-Saito loss (IS loss), given by (we briefly assume both ground truth and estimate are not 0):

IS(Θ):=XijX^ij(Θ)logXijX^ij(Θ)1

If we scale observed counts and prediction, φXij and φX^ij, then the IS loss does not change. So, matrix factorization with the IS loss does not suffer from a bias towards highly expressed genes. However, forcing the model to predict all lowly expressed genes is not desirable - often leading to low quality factors. The Poisson log likelihood exhibits a practically convenient balance between these two extremes:

Pois(Θ):=XijlogX^ij(Θ)+X^ij(Θ)

When Xij and X^ij are scaled by φ, Pois(Θ) is scaled by φ. This linear dependence on expression scale achieves a good balance in the relative weighting between highly expressed and lowly expressed genes.

An additional advantage of this loss function is that the second term behaves as a lasso penalty 86, inducing sparsity in the resulting estimates of X^ij for sparse data X, noted by87. This sparsity allows for a parsimonious explanation of a cell’s gene expression using as few factors as possible. In Spectra, we have: X^ij:=αiθj(gj+δ), yielding the expression:

Reconstruction(Θ)=i=1nj=1pXijlog(αiθj(gj+δ))αiθj(gj+δ)

Combining the components, the pseudo-log likelihood function is:

(α,θ,g,B)=λi=1nj=1pXijlog(αiθj(gj+δ))αiθj(gj+δ)Reconstruction+i=1pj=1,jip[wijAijlog((1κ)(1ρ)θiBθj+κ(1ρ))+(1Aij)log((1κ)(1ρ)(1θiBθj)+ρ)]Graph

Again, X is the data matrix after processing, whereas (α,θ,g,B) are the four model parameters that need to be estimated. The first term in the pseudo-log likelihood function comes from the log likelihood of the Poisson distribution (also referred to as the KL Divergence loss function when multiplied by −1) while the second term is the log likelihood of a Bernoulli distribution with positive observations scaled by wij. The pseudo-likelihood function optimized by Spectra includes an optimization over cell type specific and global parameters and so an additional sum over cell types is included in the pseudo-log likelihood (Supplementary Fig. 7).

(α,θ,g,B)=c=1Ci=1ncλcj=1pXcijlog((gj+δ)αc,i,:Kθj+(gcj+δ)αc,i,K+1:θcj)c=1Ci=1ncj=1p((gj+δ)αc,i,:Kθj+(gcj+δ)αc,i,K+1:θcj)Reconstruction+c=1C+1i=1pj=1,jip[wc,ijAc,ijlog((1κc)(1ρc)θciBcθcj+κc(1ρc))+(1Ac,ij)log((1κc)(1ρc)(1θciBcθcj)+ρc)]Graph

As all discrete parameters have been integrated out, this pseudo-log likelihood can be directly maximized via first order methods such as gradient descent. Approximate second order methods are not ideal, due to the high dimension of the parameter space for practical problem sizes. However, for smaller sized problems (in terms of number of genes and factors) we develop an expectation maximization approach that yields intuitive coordinate ascent updates of model parameters.

Spectra’s output

To describe the activity level of factor k in cell i we compute cell scores as cell_scoreik=qkαik where qk=1pj=1pθjk. In other words, the cell scores are the loadings weighted by the total factor usage across all genes. This allows us to circumvent the non-identifiability of scale associated with factor analysis approaches. Regarding terminology, we will always refer to the unnormalized loadings αik as “loadings” and the normalized loadings as cell scores. Additionally the cell specific parameters of other matrix factorization methods are described as “loadings”. The ground truth parameters in our simulations are also described as “loadings”.

To describe the relevance of gene j for factor k we compute gene scores for gene j and factor k as gj+δgj+δ+offsetθjk. The first term is near 0 when gj is very small and near 1 when gj is large. This allows us to remove very lowly expressed genes from the factors while maintaining coherence. By default, the offset term is set to 1 can be tuned and in some cases set to 0, which yields the factors θjk themselves. Each θjk is more directly influenced by the prior than gjθjk and so setting offset to 0 tends to yield marker lists closely resembling input gene sets.

Users can access additional parameters that facilitate interpretation of the gene scores and cell scores. The factor interaction matrix per cell type (B, Supplementary Fig. 6) contains entries in the range [0,1], where diagonal entries can be interpreted as the relevance of a given factor to the prior graph. Off diagonal entries can be interpreted as a background rate of edges between genes that are expressed in separate factors. For each cell type, users can access a posterior graph that is denoised using information from the expression data. The posterior graph is computed by the inner product θi,Bθj for each pair of genes θi and θj after estimating θi, θj and B from the data.

Of importance are the diagonal elements of the interaction matrices B, which contain information about the dependence of the factor on the input graph. We term these diagonal elements η (“eta”), specificially ηc:=diagBc.

Factor importance and Information Scores

We adopt two metrics to prioritize factors in the output of Spectra, factor importance and factor information scores, each measuring a different property of the factor. Both metrics are computed per cell type for all of the factors that are potentially relevant to that cell type. In other words, to prioritize the relevant factors for a cell type, the metrics are computed for each cell type specific factor and each global factor, resulting in 2K+Kc scores for cell type c. The factor importance score measures the overall contribution of a factor to explaining the observed expression data (as measured by the reconstruction component of the loss function), regardless of whether this factor explains within-cell-type variation. The factor information score, complementary to the factor importance score, measures whether the gene set associated with a factor captures meaningful within-cell-type variation. Factors with high scores in either of these categories are potentially of interest for post hoc analysis. The factor importance score is a relative change in reconstruction error for a specific cell type when a certain factor is masked out. Let L¯c(θ,θj) be the reconstruction error for cell type c:

L¯c(θ,θj):=i=1ncλcj=1pXcijlog((gj+δ)αc,i,:Kθj+(gcj+δ)αc,i,K+1:θcj) (1)
i=1ncj=1p((gj+δ)αc,i,:Kθj+(gcj+δ)αc,i,K+1:θcj) (2)

Here, it is understood that all parameters are except θ and θj are fixed to their fitted values. Further, let ϵk denote a vector of all 1s of dimension equal to the number of factors, except at k where it is 0:ϵk=1ek. The importance score for cell type specific factor k is then c,k=L¯c(θ,ϵkθj)L¯c(θ,θj)L¯c(θ,θj) where represents elementwise product. Similarly the importance score for a global factor k is given by c,k(g)=L¯c(ϵkθ,θj)L¯c(θ,θj)L¯c(θ,θj).

Information scores are given by Definition (1) in 88, but computed per cell type to represent cell type specific information content. Specifically, given a marker list associated with a factor and with M set to 30 we have:

Cc,k=m=2Ml=1m1logDc(gm(k),gl(k))+1Dc(gl(k)) (3)

where now gm(k) is the m’th top gene for factor k and Dc(,) and Dc() are the co-occurence frequency and frequency respectively within cell type c. We plot expCc,k as the information scores.

Optimization

We develop two optimization schemes: an auxiliary latent variable expectation maximization (EM) approach and gradient descent based optimization via Adam 89. EM converges quickly in many situations; however, the memory requirements are substantially larger than the gradient descent based optimization. Specifically the memory requirement of EM parameter storage is OnpK+p2K2 due to auxilliary parameter storage while the memory requirement of gradient descent is substantially lower: OnK+pK+K2.

Though memory intensive, the EM solution is valuable for two reasons: (1) for problems with a small number of factors (<20) and genes (<2500), EM is fast and less sensitive to initialization than gradient descent 90(2) the EM updates are intuitive and give us understanding of how our algorithm balances evidence from the graph and expression data.

On the other hand, optimization with Adam can handle a large number of factors (>200) and genes (> 10000), and can exhibit stability with the appropriate initialization. By default Spectra uses Adam for optimization.

Expectation Maximization

For ease of exposition we describe the EM (expectation maximization) routine for the non-integrative model; the updates are easily extendable to incorporate cell type labels. Additionally we write the pseudo-likelihood function equivalently (up to a scale factor) in terms of λ˜:=1λ. To make expectation maximization possible, we exploit two facts about the distribution of (X,A) 91.

The first is that if zijkPoisgj+δαikθjk and we define Xij=k=1Kzijk, then Xij still has the correct marginal distribution, due to standard properties of the Poisson distribution 87. Secondly, if we define z˜ijCategorical(θi), z˜jiCategorical(θj) and define a conditional distribution for Aij as:

(Aij=1|z˜ijk=1,z˜jil=1)=Bkl

then Aij still has the correct marginal distribution 92. As a result, we can optimize the marginal log likelihood via optimization of the expected complete data log likelihood 𝔼z,z˜[logp(X,A,z,z˜)] where the expectation is taken over the posterior p(z,z˜|A,X). The expected complete data log likelihood is given by:

˜(α,B,θ,g)=i=1nj=1pk=1Kϕijklog((gj+δ)αikθjk)(gj+δ)αikθjk+λ˜i=1pj=1,jipk=1Kl=1Kϕ˜ijkl(wijAijlog((1κ)Bkl+κ)+(1Aij)log((1κ)(1ρ)(1Bkl)+ρ)+logθik+logθjl)

where ϕijk:=𝔼(zijkX)=Xijαikθjkk=1Kαikθjk and

ϕ˜ijkl=(z˜ijk=1,z˜jil=1A)θikθjl((1κ)Bkl+κ)wijAij((1κ)(1ρ)(1Bkl)+ρ)1Aij

Importantly, this manipulation moves summations outside of the logs which permits analytic EM updates for B, α and g given by:

αikj=1pϕijkj=1pθjk(gj+δ)gjproj[0,1](i=1nk=1Kϕijki=1nk=1Kαikθjkδ)Bklproj[0,1]((ρ1ρ+(1κ))Ξklκ(1κ)(1+Ξkl))

where Ξkl:=i=1pj=1pϕ˜ijklwijAiji=1pj=1pϕ˜ijkl(1Aij), representing an odds ratio between Bernoulli outcomes. Further, the complete data log likelihood has diagonal Hessian when viewed as a function of θ only, L˜(θ), permitting linear time Newton Raphson updates:

γjl1θjl[i=1nj=1pϕijl+λ˜i=1pk=1Kϕ˜ijkl+λ˜i=1pk=1Kϕ˜jilk](gj+δ)i=1nαilHjl1θjl2i=1nj=1pϕijl+λ˜i=1pk=1Kϕ˜ijkl+λ˜i=1pk=1Kϕ˜jilkΔjk=1KγjkHjk1k=1KHjk1θjkθjkHjk1(Δj+γjk)

The integrative version of Spectra uses analogous updates, with the bounds of summations appropriately modified; specifically the E step updates are ϕijk=Xijgc,jαikθjkgjk=1Kαikθjk+gc,jk=K+1K+Kcαikθjk for cell type specific factors and ϕijk=Xijgjαikθjkgjk=1Kαikθjk+gc,jk=K+1K+Kcαikθjk for global factors.

Adam

For large single cell RNA-seq datasets, the memory requirement of EM for fitting a large number of factors is prohibitive. We optimize the marginal log likelihood with Adam 89, a momentum based gradient descent optimizer implemented in pytorch, directly. In detail, the Adam hyperparameters β1 and β2 are set to default values 0.9 and 0.999 respectively. We use a learning rate schedule of [1.0,.5,.1,.01,.001,.0001] where training at subsequent learning rates occurs after convergence at higher learning rates. A maximum number of iterations is fixed to 10,000. This default training scheme can be modified by the user. In particular, for faster convergence either the maximum number of iterations can be made smaller or the smallest learning rates can be removed, allowing for solutions that are not as fine tuned.

Algorithm 1.

EM-Spectra routine

Require: X0, A{0,1}p×p, T+, κ+, ρ+, λ˜+
 initialize B, α, θ, g
while nn1>ϵ do
  ϕijkXijαikθjkk=1Kαikθjk
  ϕ˜ijklθikθjl((1κ)Bkl+κ)wijAij((1κ)(1ρ)(1Bkl)+ρ)1Aij
  ϕ˜ijklϕ˜ijkl/klϕ˜ijkl
  while t<T do
   γjl1θjl[i=1nj=1pϕijl+λ˜i=1pk=1Kϕ˜ijkl+λ˜i=1pk=1Kϕ˜jilk](gj+δ)i=1nαil
   Hjl1θjl2i=1nj=1pϕijl+λ˜i=1pk=1Kϕ˜ijkl+λ˜i=1pk=1Kϕ˜jilk
   Δjk=1KγjkHjk1k=1KHjk1
   θjkθjkHjk1(Δj+γjk)
  αikj=1pϕijkj=1pθjk(gj+δ)
  gjproj[0,1](i=1nk=1Kϕijki=1nk=1Kαikθjkδ)
  Bklproj[0,1]((ρ1ρ+(1κ))Ξklκ(1κ)(1+Ξkl))
  n(α,θ,B,g)

Initialization

Since the Spectra objective function is non-convex and susceptible to suboptimal local maxima, initialization plays an important role in the quality of the eventual solutions. When Spectra is provided with gene sets as input, our strategy is to initialize factors as close to the gene sets as possible. Whenever the number of factors is greater than the number of gene sets, we resort to a gene set based initialization procedure:

First, a hyperparameter t controls the strength of the initialization. By default, t is set to 25. For a given cell type, whenever the number of factors is at least as large as the number of gene sets, we initialize logθijt when gene i belongs to gene set j for each gene set j=1,,Ngs and Ngs is the number of gene sets. Further, the factor interaction matrix is initialized with logitBjjt to encode the knowledge that this factor corresponds to a gene set. To encourage the last factor to capture genes that have no edges in the prior graph, we initialize the last row and column to small values, logitBK,jt and logitBj,Kt for all j=1,,K. Corollary 1 explains why this leads to extremely fast convergence when λ is small.

For a given cell type, when the number of factors is not greater than the number of gene sets we resort to initialization with non-negative matrix factorization 93.

GPU acceleration

For all results involving GPU acceleration, a wrapper around the original model implementation is provided; loading models onto the GPU via the Pytorch syntax device = torch.device(‘cuda:0’); model to(device) when CUDA is available. Data (including adjacency matrices and expression data) are similarly loaded onto GPU. All GPU methods were run on an NVIDIA A100 Tensor Core GPU.

Determining the number of factors

We adopt two approaches to determining the number of factors. The first is to set the number of factors for each cell type equal to the number of gene sets available for that cell type +1 (similar to the approach taken by slalom), and the second is to estimate the number of factors from the data via bulk eigenvalue matching analysis 94. Fitting a large number of factors is possible: in our experiments we fit a set of 197 factors.

The second approach involves three steps. In the first step we estimate a null distribution of eigenvalues based on sampling variances σj2Gamma(θ,1/θ) and then subsequently an n×p Gaussian matrix ZijN0,σj2. We sample B of these matrices and take the average of the sorted eigenvalues of 1nZZt over B samples. Typically B is set to 100. Given the average sorted eigenvalues, we compute a regression coefficient without intercept between the “bulk” of these eigenvalues and the bulk of the observed eigenvalues of the data covariance matrix. The bulk of the distribution are the values between some lower and upper quantiles, which are hyperparameters of the method. We perform a line search on θ to find a value of θ that minimizes the sum of squared residuals of this regression. Denoting this regression coefficient as β, in the second step we simulate a background distribution based on sampling variance terms σj2Gamma(θ,β/θ), data from ZijN0,σj2, and eigenvalues from 1nZZt. Finally, K is estimated as the (1α) quantile of the simulated distribution of leading eigenvalues. We apply this process for every cell type separately.

Determining Spectra’s input parameters

In Table 1, we summarize all user defined inputs to the Spectra algorithm. The data matrix and regularization strength λ must be provided by the user, while prior information can be provided in the form of a dictionary of cell type specific and global gene sets (note that Spectra can also be run by providing graph adjacency matrices directly). Optionally, cell type labels that align with keys of the gene set dictionary can be provided. The lower bound for gene scale factors, δ, controls the extent that gene expression is normalized and is set to a default value of 0.001. This translates to a maximum ratio of gene scale factors of 1000. By default the graph edge weights are set to be inversely proportional to the total number of edges induced by the gene set leading to a given edge, and accumulate additively for genes in multiple sets. The background rate of noise edges, κ, and the rate at which edges are randomly removed from the graph, ρ, can be provided as fixed parameters that provide users with an extra degree of control over the extent that the graph is modified. If they are set to None (default), they are estimated during the training process in the same manner that other model parameters are estimated.

Table 1:

Spectra model inputs

Input Description

X Expression matrix with n cells and p genes (Required).
λ Regularization strength of prior graph (Required).
Gene set dictionary Dictionary with cell types as keys, gene sets as values (Optional).
Cell type labels List of cell types corresponding to expression matrix (Optional).
δ Parameter that bounds minimum gene scale factor (Optional).
w Graph edge weights (Optional).
κ Background rate of edges (Optional).
ρ Background rate of edge deletion (Optional).

For typically sized scRNAseq datasets, as a rule of thumb we recommend λ=0.01 for studies in which factors should closely resemble the input gene sets and λ=0.1 for studies where the factors should be allowed to deviate significantly from the gene sets. Values of δ ranging from 0.0001 to 0.01 yield similar results with δ>0.01 providing solutions with typical highly expressed genes observed from NMF.

Validation metrics

Marker list coherence metrics

To evaluate the quality of factors computed from data, we follow previous work 95, 96 we use coherence, co-occurrence of factor genes in held-out data, to evaluate the quality of the inferred factors. For a given factor, we consider the 50 top marker genes with the highest gene scores for that factor. Between every pair of genes in the top 50 markers, we compute the pointwise mutual information as:

PMI(gi,gj)=logp(gi,gj)p(gi)p(gj)

where probabilities denote the empirical occurrences in the held out data. Coherence is defined as the average of this quantity across the marker gene list. This metric is used in Fig. 2f. In order to assess the coherence of Spectra and other methods, we allocated 9787 cells as a hold out set to compute the coherence scores at evaluation time. The remaining 88, 076 cells were used to fit the model. For each experiment, we subsampled the 88, 076 cells in the training set to a size of 10, 000 without replacement (repeating this process 5 times to recapitulate the underlying data distribution). This number was chosen to be sufficiently large subject to the constraint that each of the methods under evaluation could run in a reasonable amount of time (< 2 days). For each subsampled dataset we computed the coherence score described above with the top 50 markers where marker lists are determined via the method suggested by the individual papers. For scHPF we used the gene_scores function from the scHPF package to get the top 50 markers 4. For slalom, we multiplied the estimated parameter matrices, i.e. the continuous posterior mean 𝔼W and Bernoulli posterior mean 𝔼Z, as in Buettner et al.6. To evaluate non-negative matrix factorization, we derive marker lists based on the absolute values of the estimated factor matrix as is standard practice 97.

Reconstruction of held out genes

To quantify the ability of methods to impute missing genes from gene sets, we ran Spectra and slalom (Section ’Single Cell RNAseq Data Preprocessing and Analysis’) on the full Bassez dataset but with randomly truncated gene sets. Due to slalom’s computational demands and size of the dataset, we choose a small set of 24 gene sets to evaluate for both methods, which are chosen a priori and held fixed throughout the experiment. We hold out 40 percent of genes (selected randomly) from the original set and measure the fraction of these genes recovered in the top 200 genes according to slalom and Spectra’s gene scores. In order to match factors to gene sets, for both methods we find the gene set (in our full database) with highest Szymkiewicz–Simpson overlap coefficient (overlap coefficient) to the given factor and label the factor as corresponding to that gene set. The overlap coefficient for the sets X and Y is defined as the size of the intersection divided by the size of the smaller set:

overlap(X,Y)=|XY|min(|X|,|Y|)

If two factors both have highest overlap coefficient to the same gene set, we take the one with higher overlap coefficient. The accuracy reported is the fraction of held out markers recovered in the top 200 highest gene scores (See Fig. 2e and 5a,b,c).

Simulation Experiments

Robustness to correlated factors

Matrix factorization methods rely on reconstruction based objective functions that implicitly encourage the estimation of a diverse set of gene programs. As a result, when gene programs are expressed in similar contexts (e.g. CD8 T cell activation, exhaustion, and tumor reactivity or TNF and IFN type II responses) matrix factorization methods often return a single program representing the combined set of correlated programs. Further, as the correlation between gene programs increases, the effective sample size of the estimation problem decreases, as most cells do not provide information to separate the gene programs. To illustrate that Spectra can incorporate prior information to maintain robust estimation in the presence of highly correlated gene programs, we simulated gene expression data from a generic factor analysis model where the cell loadings corresponding to factors 1 and 2 are simulated from a joint log Normal distribution with non-zero correlation terms ranging from 0.25 to 0.99 (Extended Data Fig. 5d). Factors themselves were simulated from a half-Cauchy distribution to achieve realistic levels of sparsity and extreme values. Conditional on simulated factors and loadings, gene expression was simulated from Poisson distribution with mean given by the matrix product of loadings and factors. A noisy prior knowledge graph was simulated by sampling the adjacency matrix from a Bernoulli distribution with parameters given by inner products between factors (as in the Spectra model) and used as input to Spectra. For each value of the correlation, we simulated 10 datasets and ran Spectra (lambda = 0.1), NMF, scHPF and Slalom (20 top genes per factor as input). We quantified estimation accuracy by the mean Pearson correlation of ground truth factors with estimated factors across genes, both for the two correlated factors and for a third factor uncorrelated with the first two. While the unbiased methods, NMF and scHPF, correctly recover the factors when factors are weakly correlated, estimation accuracy deteriorates as the correlation increases (though the inaccurate estimation of the correlated factors does not hurt performance on the uncorrelated factors). Spectra’s utilization of prior knowledge allows it to separate highly correlated factors.

In more detail, in our comparative simulation study factors are correlated in the sense that they tend to be expressed by the same cells (Extended Data Fig. 5d). We simulate ground truth factor matrices with p features and K factors with each entry independently distributed according to a half-Cauchy distribution (chosen to obtain realistically sparse factor matrices). In order to obtain correlated factors, the factor loadings, α, are independently drawn from a correlated LogNormal distribution:

[α1α2αK]LogNormal([000],[1ρ00ρ10000100001])

If we denote the N×K loading matrix by α and the K×p factor matrix by θ, the count data simulated by XPois(αθ+ϵ) where ϵ is a random noise term with variance σ2. An adjacency matrix is sampled coordinate-wise ABern(θ˜θ˜) where θ˜j=θjk=1Kθjk. We run 10 independent trials, for 7 different levels of correlation ρ={0.25,0.5,0.7,0.85,0.9,0.95,0.99}, totaling 70 simulated datasets. Since each of NMF, scHPF, slalom and Spectra estimates a factor matrix, we compared the estimated (normalized) factor matrices to θ˜ via Pearson correlation (y-axis of Extended Data Fig. 5d) after resolving the permutation of estimated factors that is closest to ground truth. Resolving the correct permutation for each estimate is done via finding the permutation that maximizes the average Spearman correlation between ground truth factors and estimates. Since we are interested in performance on correlated factors, we report the average correlation between estimation and ground truth for the first two correlated factors, across the 10 independent trials.

In our experiment, N=20, p=500, K=3, σ2=4 (a setting with low signal to noise ratio). Spectra is provided with the simulated matrix A while slalom is provided with feature sets containing the correct top 20 features of each factor. Spectra uses a λ value of 0.1 and δ value of 0. All methods are run with the correctly specified number of factors and with default parameters.

Biasedness of gene set averaging for overlapping gene sets

When gene sets corresponding to gene programs overlap, simple gene set averaging approaches produce false positive program activity calls. To illustrate this phenomenon, we simulated gene expression data driven by sets of overlapping gene programs with varying degrees of gene set overlap and showed that gene set averages are increasingly biased proxies for program activity as the degree of gene set overlap increases (Extended Data Fig. 5e). Specifically, we simulated factor matrices with known sparsity pattern determined by a set of gene sets (each non-zero entry independently Exponential(16)). Each gene set is designed to have overlap coefficient ρ with at least one other gene set, with ρ ranging from 0 to 0.75. Loadings are generated by first sampling each coordinate logNormal(0, 1) independently, and then zeroing out components that are <1 to induce sparsity. Simulated expression data is from a Poisson distribution with mean given by the matrix product of simulated loadings and factors.

For each possible value of the overlap coefficient ρ in {0.0,0.3,0.5,0.75}, we create 3 simulated datasets and ran both Spectra and score_genes with the ground truth gene sets. Accuracy is measured by the Pearson correlation of estimated cell scores (or score_genes estimates) and the ground truth factor loadings from the data generation process (y-axis of Extended Data Fig. 5e). In this experiment, the gene set size is fixed to 20, the number of gene sets is 10, the number of features (genes) is 500, and the number of observations is 1000.

Recovery of active gene sets

We compared Spectra to slalom, another factor analysis method that uses prior information in the form of gene sets, in a simulation experiment where we measured the ability of each method to recover the gene sets involved in the true data generating process. Here, we followed the simulation settings of the Buettner et al manuscript6 closely. First background factors are generated from an Exponential distribution, as before. To simulate sparsity entries smaller than 2 are zeroed out. Next, loadings are generated LogNormal independently and entries <1 are zeroed out. We then generate both active and control gene set based factors as in Buettner et al.6 and Section ’Biasedness of gene set averaging for overlapping gene sets’ where gene sets overlap with overlap coefficient 0.3. Loadings corresponding to active gene set based factors are also drawn from a standard LogNormal and zeroed out if less than 1. Next genes are randomly added to gene sets and removed from genesets to achieve a False Positive Rate (FPR) of 0.2 and False Negative Rate (FNR) of 0.2. As a measure of success we use the area under the ROC curve (AUC) based on slalom’s relevance score and Spectra’s average cell score for a given factor. Spectra was robust to increasing the number of gene sets while slalom suffered a drop in AUC as the number of active gene sets increased (Extended Data Fig. 5f).

In our experiments, the number of active pathways vary on the x—axis of Extended Data Fig. 5f, the number of control pathways is 5, the gene set size is 20, the number of genes is 300, the number of cells is 300, the number of unbiased factors is 5, the gene set overlap coefficient is 0.3.

Runtime and memory benchmarks

All memory and runtime benchmarks were performed on simulated data to allow for precise control over the number of cells, genes, factors and cell types. Data was simulated as in Section ’Simulation Experiments’, closely following the settings described in 6. To benchmark run time and memory with respect to the number of cells, we scaled the number of cells in our simulated data from {300,1k,5k,10k,25k,75k,100k,200k} cells. The number of genes was held constant at 2000 genes. To benchmark the methods on the number of gene sets, we scaled the n_active_pathways parameter in our simulation from {10,20,50,100,200} gene sets. Next, we note in order to keep our gene set size =20 constant with an overlap coefficient ρ=0.3 between gene sets, we increase the number of genes to 3, 000 genes. We used 25K cells for all experiments. To benchmark Spectra GPU and CPU on the number of cell types, we scaled the number of cell types from {2,4,8,16,32,64} cell types. All experiments were run using 25K cells and 3K genes. We note that due to variation in the number of epochs until convergence, we forced both Spectra CPU and GPU to run to the default 10K epochs to study a pessimistic but low variance run time quantity; though convergence was generally achieved between 2K and 7K epochs. All CPU methods were run on 5 CPU cores (Intel Xeon Gold 6230 at 2.10 GHz), while all GPU methods were run on an NVIDIA A100 Tensor Core GPU.

A human immunology knowledge base

Databases such as the Gene Ontology Resource (GEO) 98, the Molecular Signatures Database (MSigDB) 99, the Kyoto Encyclopedia of Genes and Genomes 100 and the Reactome database 101 contain thousands of gene sets and their relationships, but they are noisy and often do not distinguish whether or not genes are transcriptionally regulated. For example, many genes with signaling pathway annotations are regulated at the post-translational level by phosphorylation or subcellular localization. Expression signatures in these databases are often derived from bulk sequencing data which may not represent responses in individual cells. Moreover, the databases do no have a framework for distinguishing which gene sets are cell-type specific. To address these issues, we created an immunology knowledge base with the following criteria:

  1. Genes within a gene sets define a cellular process at the transcript level

  2. Gene sets represent cellular processes at the single cell level

  3. Gene sets can be specific to a defined cell type

Our knowledge base includes 181 gene sets representing ‘cellular processes’ to be queried by Spectra. Like Spectra, the knowledge base models gene sets as a graph wherein every gene set is a node connected to all individual gene nodes within the set, as well as to a cell type node. Cell type nodes (currently 50) are connected to ’cellular identity’ gene sets, one for each cell type, which contain marker genes for their connected cell type. Metadata such as scientific publication the gene set was derived from, gene set version, and original gene set authors are stored as node properties. Cell type nodes are organized in a hierarchy, reflecting that cell types are frequently subsets of other cell types. This hierarchy starts with a cell type node labeled ’all-cells’ to which gene sets for ’cellular processes’ occurring in all cell types are connected. Thus, the knowledge base can be queried for a cellular processes that can be found in all cell types (e.g. glycolysis) or a cellular process specific to a cell type such as T cell receptor signaling which is only present in T cells. It also allows retrieving ’cellular identity’ marker gene sets which define the queried cell type.

Within this resource, 150 cellular processes apply to all leukocytes and 31 apply to individual cell types. Of all 231 gene sets, 97 were gene sets newly curated from the literature, 14 used data from perturbation experiments, 11 were adopted from literature with modifications and 123 were taken from the literature and external databases without changes. Gene sets correspond to diverse cellular identities (n=50) and cellular processes such as homeostasis (n=9), stress response (n=3), cell death and autophagy (n=18), proliferation (n=6), signaling (n=12), metabolism (n=90), immune function (n=22), immune cell responses to external stimuli (n=18) and hemostasis/coagulation (n=3) (Fig. 1b). We designed the gene sets for cellular processes to have comparable size (median n=20 genes per gene sets) and relatively little overlap (median pairwise overlap coefficient 40%) to enable dissection of a large number of cellular processes and avoid gene set size-driven effects.

To specify Spectra input, the user first defines cell types at a granularity of interest in their single-cell expression data, and retrieves the cell-type-specific cellular process gene sets and gene sets applying to all cell types from the knowledge base. Next, they can select cellular process gene sets pertaining to all cell types in the dataset, which should be set as ‘global’ in the Spectra model.

The user indicates which cellular processes can be considered global based on which cell types are present in the dataset under study. For example, if a dataset only contains T cells, all cellular processes pertaining to leukocytes and T cells should be considered global. If cellular processes apply to more than one but not all cell types in the data there are two options:

  1. The gene set can be multiplied and one copy can be assigned to each cell type. This will ensure that the cell scores of the resulting factors will be specific to those cell types but may result in separate factors for the same cellular process in each cell type.

  2. The gene set can be set as global which will generally result in one factor. However, cell scores for this factor may be detected in other cell types also.

Users can take advantage of the hierarchical organization of cell types in the knowledge base by adding the children or parent classes of selected cell types. For example, cellular processes for both ‘CD4 T cells’ and its parent, ‘T cells’ can be retrieved and assigned to ‘CD4 T cells’, making it possible to find broader processes (e.g. T-cell receptor signaling) that are specific to CD4 T cells. Alternatively, cellular processes for CD4 T cells and for CD4 subtypes ‘TH1’, ‘TH2’ and ‘TH17’ can be retrieved and assigned to ‘CD4 T cells’, thereby pooling rare cell types which may not contain enough training data for the Spectra model to converge to a generalizable solution. Moreover, hierarchical classification is advantageous when cellular processes are ambiguously or incorrectly assigned. For example, CD4 T cell subtypes are often presented as distinct lineages with distinct cellular processes. However, mixed CD4 T cell subtypes have been reported, such as cells possessing both TH1 and TH2 polarization cellular processes 102, suggesting that CD4 T cells can be described using combinations of purportedly subtype-specific processes.

Our human immunology knowledge base is available on GitHub as the Cytopus python package https://github.com/wallet-maker/cytopus 103. Users can load our default or a custom knowledge base using the >KnowledgeBase class build on a NetworkX object. Cytopus includes methods to retrieve gene sets and corresponding cell types, visualize them as a graph and convert them into a Spectra-compatible dictionary. The celltypes method retrieves a list of available cell types; processes generates a dictionary of all ’cellular processes’ gene sets; and identities generates a dictionary of all ’cellular identity’ gene sets. Gene set metadata (e.g. author, topic, date of generation, version) can be accessed as node properties of the gene sets. The get_celltype_processes method retrieves cell-type-specific ’cellular processes’ based on a user-provided list of cell types at the desired granularity (generally all cell types contained in the data).

Full cytopus documentation can be found at https://github.com/walletmaker/cytopus/. The tumor infiltrating leukocyte gene sets used in the paper are included in the Spectra package (https://github.com/dpeerlab/Spectra) as well as in Supplementary Table 1.

Single Cell RNAseq Data Preprocessing and Analysis

PBMC dataset

The 8 authors performed scRNAseq on peripheral blood mononuclear cells (PBMC) from 4 healthy donors after incubation (16h) in interferon gamma (IFNg), lipopolysaccharide (LPS), or the protein kinase C and TCR stimulation mimetic phorbolmyristate (PMA). For 3 donors and the 6h timepoint they added Golgi inhibitors (GI) which prevent exocytosis of cytokines from PBMC upon perturbation and secondary paracrine signaling events and compared them to controls treated with GI alone. Thereby the gene expression changes in the GI treated perturbations compared to the GI treated controls can be attributed to direct signaling of the applied perturbations alone. We used the GI treated conditions as a ground truth to benchmark factorization methods.

We obtained preprocessed count matrices (23, 754 cells, 4 patients) from the Gene Expression Omnibus (accession number: GSE178431) from the Kartha et al. peripheral blood mononuclear cell (PBMC) perturbation scRNAseq dataset. We normalized gene expression counts to median library size and log1p transformed the data. We selected the 10, 000 most highly variable genes using scanpy’s pp.highly_variable_genes function with the seurat_v3 method on raw counts. To avoid discarding genes relevant to cell typing we added a manually curated list cell typing markers to HVGs (Supplementary Table 5). We then calculated the neighborhood graph on the first 50 principal components using these highly variable genes which explained 27.87% of the total variance and calculated a UMAP embedding on this neighborhood graph.

To get coarse immune cell types we clustered the data in principal component space using the scanpy implementation of phenograph. We chose k=40 parameter for PhenoGraph because of its ability to delineate immune cell from non-immune cell populations while showing stable clustering in a window of adjacent k parameters (pairwise rand indices >0.7). We then annotated the clusters into coarse immune cell types (monocyte, T/innate lymphoid cells, B cells/plasma blasts) by assessing the mean marker gene expression per cluster (Supplementary Table 2).

Running Spectra

To run Spectra, we retrieved 188 input gene sets pertaining to PBMC data from the newest version of our Cytopus knowledge base (10.5281/zenodo.7306238). This included gene sets for signaling/response programs to the ground truth perturbations (IFN-γ response, LPS signaling in monocytes/macrophages, TCR activation in T cells). We fitted Spectra on the union of the 10, 000 most highly variable genes and the input gene sets for a total of 11, 840 genes using the following parameters: lambda 0.01, delta 0.001, rho 0.001. We obtained a total of 196 factors and found one factor for each the input gene sets related to each perturbation according to our criteria below (overlap coefficient of top 50 marker genes with input gene set >0.2). We calculated the average cell score per cell type and sample and compared the perturbed and unperturbed conditions.Spectra was run on a compute cluster with 64CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 512 GB RAM.

Running Slalom

For Slalom expression data was preprocessed identically to Spectra. Because Slalom’s runtime scales linearly with the number of gene sets (Fig. 2g), we had to subset the number of gene sets used in order to run Slalom on our dataset of interest. We provided Slalom only with the three gene sets corresponding to the investigated perturbations (lipopolysaccharide, interferon-gamma, phorbolmyristate) plus 10 additional factors. These gene sets were used to determine the I parameter of the slalom initFA() function. The following additional input parameters were used: nHidden=0, nHiddenSparse=0, do_preTrain=FalpruneGenes = False with all other options set to default values.

Running expiMap

For expiMap expression data was preprocessed identically to Spectra. We used expiMap’s default parameters as shown in the tutorials of the scArches github repository (github.com/theislab/scarches, version of March 20th, 2023). We provided expiMap with the same gene sets as Spectra. When using the default parameters, expiMap cannot learn new genes involved in gene programs related to the input gene sets, nor can it learn new factors in the reference data, but only in the mapped query.

Immunooncology datasets

To study Spectra in an immuno-oncology context we used two published scRNA-seq datasets of tumor-infiltrating leukocytes from female breast cancer patients treated with immunotherapy. We chose this immuno-oncology context for multiple reasons:

  1. The abundant prior knowledge of cellular processes and well-characterized cell types in tumor infiltrating leukocytes enabled us to leverage the full power of gene set and cell type priors.

  2. The availability of before- and on-treatment samples to test the sensitivity of factor cell scores to environmental perturbation with anti-PD-1/PD-L1 therapy

  3. The clinical need for detecting cellular processes affected by anti-PD-1 in humans to improve current immunotherapy strategies.

  4. The availability of two studies in similar biological settings to enable validating findings in an independent dataset.

Bassez dataset

10 was a prospective window-of-opportunity study reporting scRNAseq as an exploratory endpoint (’Bassez dataset’). The authors analyzed scRNAseq data from whole tumor single-cell suspensions from 42 operable breast cancer patients before and after anti-PD-1 immunotherapy (Pembrolizumab, NCT03197389). Patients received neoadjuvant chemotherapy (CTX) as per standard of care (CTX n=11, no CTX n=31) followed by a single dose of anti-PD-1. Breast resections were performed 7 – 14 days after anti-PD-1. Tissue from pre-anti-PD-1 biopsies (7-15 days before surgery) and from surgical resections was processed for scRNAseq. As a surrogate for response to therapy the authors of the original study quantified the clonal expansion of T-cells under therapy on the patient level using paired single cell T-cell receptor sequencing (scTCR-seq); we used these annotations to find response-associated cellular processes in the data. The authors categorized patients as either exhibiting (responders) or lacking (non-responders) T-cell clonal expansion under therapy. To classify patients, they quantified the number of expanding T cell clones (T-cells with identical T-cell receptor (TCR) sequences) per patient, labeling patients with > 30 expanding clones as responders and with ≤ 30 expanding clones as non-responders. A T-cell clone had to fulfill two criteria to be labeled as expanded:

  1. Detected at least twice in the patient’s on-treatment sample

  2. More frequent in the patient’s on-treatment as compared to the pre-treatment sample either by the absolute cell number in that clone or by the cell number in that clone relative to the number of cells with a TCR detected.

Zhang dataset

29 was a retrospective clinical study analyzing tumor-infiltrating leukocyte scRNAseq data of pre- on- and post-therapy samples from 15 advanced female breast cancer patients receiving either anti-PD-L1 (atezolizumab) combined with chemotherapy (paclitaxel, n=8) or chemotherapy alone (paclitaxel, n=7) (’Zhang dataset’). Notably, patients received corticosteroid pre-medication for paclitaxel. The authors assessed patient response to immunotherapy using radiological response according to RECIST v1.1 criteria 104. RECIST v1.1 are the standard criteria used for drug-approval relevant clinicl trials and standard clinical management of patients with metastatic solid tumors. The RECIST criteria classify patients into responders (combining partial and complete response labels) and non-responders (combining progressive and stable disease labels) based on the change in the sum of tumor lesion diameters under therapy. We used this classification to identify response-associated cellular processes in the Zhang dataset.

Processing strategy

To minimize systematic differences in cell type annotations and normalization of gene expression counts, we performed the same pre-processing for the Bassez10 and Zhang29 data. After basic filtering, removing residual low-quality cells and doublets and subsetting to leukocytes with scanpy 2, we normalized the data using scran 84. We hierarchically annotated celltypes in the data first labeling major immune subsets (T/ILC cells, B/plasma cells, myeloid cells) by clustering on the most dominant principal components (PCs) only. We then partioned the data into these major immune subsets, re-normalized the data within every subset using scran, and clustered on more PCs to annotate granular cell types. We then combined the annotated data from major immune subsets for joint analysis using Spectra. We have outlined the details of the analysis strategy below.

Retrieving single cell gene expression data

Count matrices of the 10 study were kindly provided by the authors and are also available at (http://biokey.lambrechtslab.org, 226, 635 cells). Raw read counts are available in the European Genome-phenome Archive (EGA, https://ega-archive.org/) under accession numbers: EGAS00001004809, EGAD00001006608. The count matrices for the 29 study were downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) using the following accession number: GSE169246 (489,490 cells).

Removing low quality cells

To prepare the data for clustering, we removed cells with less than 200 genes per cell and genes observed in less than 20 cells, as well as mitochondrial and ribosomal genes. This filtering procedure removed 2971 and 203 genes resulting in a total of 22, 639 and 20, 898 genes in the Bassez 10 and Zhang 29 data, respectively. We defined doublets in the data by running DoubletDetection 105 for each sample individually using standard parameters (clustering algorithm: PhenoGraph, p-value threshold: 1e16, voter threshold: 0.5). DoubletDetection detected 3270 (1.4%) and 12,760 (2.6%) doublets as well as 27 (0.01%) and 8 (0.001%) ambiguous doublets in the Bassez 10 and Zhang 29 data respectively which we removed from the data.

Retrieving tumor-infiltrating leukocytes for downstream annotation

While the Zhang 29 data contained sorted tumor infiltrating leukocytes, the Bassez 10 data contained unsorted whole tumor single cell suspensions. To retrieve immune cells from the Bassez 10 data for downstream annotation, we first performed standard median library size normalization and log1p-transformed the data so that the normalized expression of every gene j in cell i is xij and the median of the sum of gene expression counts per cell is medj=1nxj:

xij=ln(med(j=1nxj)*xijj=1nxij+1)

We then clustered the data using PhenoGraph 106 on the most dominant principal components which we selected using the knee point of the PC vs. explained variance curve (calculated using the kneed package v.0.7.0 107) or the lowest number of PCs explaining ≥ 20% of the total variance whichever was higher. Using this procedure we clustered the data with PhenoGraph on the first 26 PCs explaining 20.1% of total variance. We chose k=80 parameter for PhenoGraph because of its ability to delineate immune cell from non-immune cell populations while showing stable clustering in a window of adjacent k parameters (pairwise rand indices >0.7). We then subsetted leukocytes for further analysis by their marker gene expression (myeloid cells, T cells, innate lymphoid cells = ILC, B cells and plasma cells) per PhenoGraph cluster (Supplementary Table 2).

Annotating tumor-infiltrating leukocytes

We re-normalized leukocytes in the Bassez 10 and Zhang 29 data using scran because median library size normalization can generate artificial differential gene expression between cells of different library size such as leukocytes 84. After testing all genes and a range between 5, 000 and 15, 000 HVG, we selected the top 15, 000 highly variable genes (HVG) for the Basseez 10 and all genes for the Zhang 29 data which led to the best separation of major immune cell subtypes using scanpy’s pp.highly_variable_genes function with the seurat_v3 method on raw counts. To avoid discarding genes relevant to cell typing we added a manually curated list of 458 cell typing markers to HVGs (Supplementary Table 5). We then repeated the clustering procedure outlined above (Bassez 10: 24 PCs explaining 20.1% of total variance Zhang 29: 52 PCs explaining 20% of total variance, k=50) and annotated major immune cell subsets (T/ILC cells, B/plasma cells, myeloid cells) by assessing their mean marker gene expression per cluster (Supplementary Table 2). To obtain more granular annotations, we partitioned the data into major immune subtypes (T, ILC, B/plasma, myeloid), we re-normalized each subtype using scran, re-calculated HVGs and PCs and clustered as described above. The processing parameters for each subtype are indicated in Table 3 below:

Table 3:

Clustering parameters for immune subtypes

Bassez data Zhang data
TNK B M TNK ILC B M
HVGs 7, 500 7, 500 15, 000 19, 379 19, 379 10, 000 18, 888
PCs 24 10 16 100 100 17 23
variance[%] 20.3 20.1 20.1 17.9 17.7 20.4 20.1
k 30 40 20 40 40 60 30

We then annotated granular immune cell types by assessing the mean marker gene expression per cluster (Supplementary Table 2). In the Bassez 10 data, we detected clusters with low library size and lower complexity of gene-gene correlation patterns at this step (5,509 cells) which we removed from the data. Finally, we combined the annotated major immune subtypes for downstream joint analysis.

Running Spectra

After the filtering and preprocessings steps above, the Bassez data had 97, 863 10 and the Zhang data 150, 985 29 cells. To run Spectra, we restrict the number of genes using scanpy’s highly_variable_genes function with the cell_ranger method selecting the 3, 000 most highly variable genes. We removed several genes which are highly abundant and may originate from ambient RNA in many cell types thus adding noise to the analysis. This included mitochondrial, ribosomal, immunoglobulin (genes starting with IGHM, IGLC, IGHG, IGHA, IGHV, IGLV, IGKV), T cell receptor variable domains (genes starting with TRBV, TRAV, TRGV, TRDV), and hemoglobin genes (genes starting with HB). The total number of genes used (the union of genes included in a gene set and highly variable genes) were 6397 for Basseez 10 and 6398 for Zhang 29

181 Gene sets from our knowledge base were then converted into weighted adjacency matrices. One of Spectra’s strongest features is its ability to meaningfully modify the input gene-gene knowledge graph (gene sets) in a data driven matter: With the influence parameter λ set to 0.01, the median overlap coefficient across all factors in the Bassez 10 data was 88%, with 25% of factors relevantly deviating from the gene sets (overlap < 70%), and 7% of factors bearing little resemblance to the input gene sets (overlap < 20%). With the influence parameter set to 0.1, the median overlap coefficient across all factors was 82% with 42% of factors relevantly deviating from the gene sets (overlap < 70%) and 12% of factors with overlap less than 20%. In terms of graph edit distance to the input graph (defined as the mean absolute difference between input and output graphs), at λ set to 0.1, we had 0.011 and at λ set to 0.01 we had 0.0095 with diminishing returns in graph edit distance for lower λ (0.0095 again for λ=1e4 and 0.0094 for λ=1e5). For the analyses described below, we used an influence parameter λ between 0.1 and 0.01 depending on whether we wanted more (0.01) or less (0.1) adherence to the input gene sets. Because we obtained very similar results with these parameters in two independent datasets, it is likely that this also constitutes a good default for other datasets. Spectra was run on a compute cluster with 64 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 512 GB RAM.

Running Slalom

Because Slalom’s runtime scales linearly with the number of gene sets (Fig. 2g), we had to subset the number of gene sets used in order to run Slalom on our datasets of interest (n=20 gene sets, runtime 63.49 CPU hours, 40 GB memory on the Bassez 10 dataset). Expression data was preprocessed identically to Spectra. To compare results of Spectra and Slalom we chose a subset of 20 gene sets of scientific relevance to the immune microenvironment under immune checkpoint blockade for the Bassez 10 and Zhang 29 datasets: CD8 T cell tumor reactivity, Type II interferon response, Myeloid angiogenic effectors, Post-translational modification, MHC class I presentation, G2M transition, Oxidative phosphorylation, Type I interferon response, Macrophage IL4/IL13 response, Glycolysis, DNA synthesis, G1S transition, Lysine metabolism, MHC class II presentation, Hypoxia response, Pentose phosphate pathway, CD8 terminal exhaustion, PD-1 signaling, TCR activiation and cytoxicity effectors. These gene sets were used to determine the I parameter of the slalom initfa() function. The following additional input parameters were used: nHidden=0, nHiddenSparse=0, do_preTrain=False, mingenes = 1 pruneGenes = False with all other options set to default values.

Running scHPF

scHPF was run with the following commands, following the defaults in the class constructor of the scHPF package. from scHPF import *; model = scHPF(nfactors = K) ; model.fit(X)

Running expiMap

When using the default parameters, expiMap cannot learn new genes involved in gene programs related to the input gene sets, nor can it learn new factors in the reference data, but only in the mapped query. We note that most demonstrations in the expiMap manuscript are based on these default parameters, and do not involve adaptation to the data.

For the immunology datasets, where the specific task evaluated involved learning new genes and new factors, we modified the default parameters according to an expiMap author’s recommendations (personal communication) 9. We refer to this mode as soft mode. These parameters include setting soft_mask = True in the expiMap model scarches.models.EXPIMAP as well as setting an L1 penalty using the alpha_l1 parameter of the model.train method, which enables the latent nodes to use genes absent from the input gene sets with an L1 regularization. The alpha_l1 parameter was increased in steps of 0.1 starting from 0.5 until the share of inactive genes exceeded 0.95 (this information can be visualized by the print_stats parameter in the .train method). Using this strategy we selected an alpha_l1 of 0.8. For Extended Data Fig. 5b,c reconstruction experiment where the tasked involved recovering held out genes we used an alpha_l1 parameter of 0.4 because expiMap showed the strongest performance with this parameter in a similar experiment performed by the expiMap authors (Extended Data Fig. 7 in 9). However, for most of our gene sets and in contrast to Spectra, expiMap did not recover a meaningful proportion of our input gene sets (Extended Data Fig. 5b,c). This is despite the fact that the Spectra model was complicated by including 16 new factors while we did not add any new factors for expiMap. 9.

To learn new factors one has to provide additional parameters which for analyzing the Bassez 10 dataset except for the reconstruction experiment in Extended Data Fig. 5b,c. We added 16 new factors, the same number as for Spectra (n_ext, set to 16). According to an expiMap author’s instructions, we set the L1 regularization coefficient for these nodes gamma_ext to 0.6, enabled the Hilbert-Schmidt independence criterion regularization (HSIC) (use_hsic=True, hsic_one_vs_all=True), and provided the HSIC coefficient beta=3. Because expiMap removed relevant input gene sets, we had to perform two additional modifications of the steps provided in the tutorial for the soft mode. We increased the number of highly variable genes (4000 instead of 2000) to retain the input lysine metabolism gene set, and decreased the minimum gene set size from 13 to 8 to retain our tumor reactivity gene set. We provided expiMap with the identical 181 gene sets used for Spectra. Because expiMap removes smaller gene sets in preprocessing steps the final number of gene sets used for the model fit was 142 which resulted in 158 factors including 16 new factors and 3 inactive factors.

Running NMF

Non-negative matrix factorization was run using the sci-kit learn package (sklearn.decomposition.NMF) with default parameters, specifically nmf = NMF (n_components=k); nmf.fit(X.astype(float)) where k is the number of factors and X is the processed expression matrix.

Running net-NMFsc

netNMFsc was run with default parameters; however, max_iters was set to 100K as convergence was never achieved at the default tolerance level (1e–2) at the default 20K iterations. Specifically the following operations were used: operator = netNMFsc.netNMFGD(d=k, max_iter=max_iters); operator.N = adj_matrix ; operator.X = X.T ; W = operator.fit_transform() where adj_matrix is the global graph provided to Spectra, X is the processed expression matrix, and k is the number of factors.

Assigning factor labels

Factor labels were assigned using the overlap coefficient of the top 50 marker genes (genes with the highest gene scores) with each gene set. We observed a bimodal distribution of overlap coefficients with one group of factors centered close to 0 and one group of factors centered close to 1 (Extended Data Fig. 2). We therefore chose a threshold of 0.2 to separate high overlap from low overlap factors. For every factor, if the maximum overlap coefficient was >0.2 we assigned the gene set label with the maximal overlap coefficient to that factor, if the maximum overlap coefficient was 0.2 we did not assign a label to that factor.

Aggregating cell scores at the sample level

To aggregate cell scores at the sample level we calculated either the mean or the mean of the positive cells. The latter was chosen for Spectra and ScHPF which show bimodal cell score distributions with one mode centered around zero. The mean will be skewed towards the more frequent zero mode and is therefore inappropriate to estimate the central tendency of the distribution. Positive cells were defined as cells with a cell score >0.001. This threshold was defined empirically to separate the positive and zero mode by inspecting the distributions of multiple factors. Because expimap and slalom can take negative values, we used the mean value for these methods.

Lung cancer datasets

Caushi dataset

The Caushi et al. 34 study performed paired single cell RNA sequencing (scRNAseq) and single cell T cell receptor sequencing (scTCRseq) of 16 primary non-small cell lung cancer patients (560, 916 cells, 16 patients). Moreover, peripheral blood mononuclear cells were pulsed with different peptides (specific for viral or tumor neoantigens) and reactive, expanding T cell clones and their T cell receptor sequences were identified using the Mutation-Associated Neoantigen Functional Expansion of Specific T Cells (MANAFEST) assay 108. The authors thereby identified T cell receptor sequences of mutation-associated neoantigen (MANA), Epstein-Barr, and influenza reactive T cells. They used these TCR sequences to identify tumor (MANA) and virus reactive T cells in the lung cancer tissue.

Preprocessed data was obtained from the original study’s authors 34. The processed data can also be obtained from the Gene Expression Omnibus under accession number GSE173351. Because cell type annotations were not available, we obtained original study authors’ cluster labels (details on preprocessing and clustering in 34. We reannotated the original authors’ 15 clusters using marker genes (Supplementary Table 2).

To preprocess the data for Spectra, we normalized raw counts to median library size and applied log1p-transformation. We restricted the number of genes using scanpy’s highly_variable_genes function with the cell_ranger method to the 3, 000 most highly variable genes. We retrieved a total of 168 input gene sets global (n=152) and T cell subtype specific gene sets ((CD4-T, CD8-T, Treg, n=12)) from the newest version of our Cytopus knowledge base (10.5281/zenodo.7306238). We took the union of the highly variable genes, and the genes included in these input gene sets for a total of 6838 genes 34 used for fitting the Spectra model using the following parameters: λ=0.1,δ=0.001,ρ=0.001. We ran Spectra on these data and obtained 173 factors one of which matched the CD8 T cell tumor reactivity gene set according to the criteria above in “Assigning factor labels”. Spectra was run on a compute cluster with 64 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 512 GB RAM. We plotted and compared the tumor reactivity factor cell scores in 1, 151 CD8 T cells with available TCR specificity information grouped by it’s target antigen (Epstein-Barr virus, influenza virus, mutation-associated neoantigen). We found that this tumor reactivity factor was almost exclusively expressed in mutation-associated neoantigen specific T cells.

Salcher atlas

The Salcher non-small cell lung cancer atlas combined single cell RNA sequencing (scRNAseq) data of whole tumor single cell suspensions or tumor infiltrating leukocytes from 19 independent studies (1, 283, 972 cells from 318 patients). They also homogenized cell type annotations and metadata between datasets.

Preprocessed data including unnormalized gene expression counts were obtained from zenodo (https://doi.org/10.5281/zenodo.6411867) for the Salcher et al. lung cancer scRNAseq atlas 76 The authors cell type annotations were summarized after vetting them for relevant marker expression profiles (Supplementary Table 2).

To run Spectra, we restricted the number of genes using scanpy’s highly_variable_genes function with the cell_ranger method selecting the 3, 000 most highly variable genes with the batch_key option which calculates highly variable genes in each dataset in the atlas separately and then merges them based on in how many datasets they are captured. We removed several genes which are highly abundant and may originate from ambient RNA in many cell types thus adding noise to the analysis. This included mitochondrial, ribosomal, immunoglobulin (genes starting with IGHM, IGLC, IGHG, IGHA, IGHV, IGLV, IGKV), T cell receptor variable domains (genes starting with TRBV, TRAV, TRGV, TRDV), and hemoglobin genes (genes starting with HB). We retrieved a total of 198 input gene sets from the newest version of our Cytopus knowledge base (10.5281/zenodo.7306238). The total number of genes used (the union of genes included in a gene set and highly variable genes) were 7322. We normalized and log1p transformed the gene expression counts and ran Spectra with the following parameters: λ0.01,δ0.001,ρ0.001 and obtained one factor each for CD8 T cell tumor reactivity and lysine metabolism. We also obtained one factor which shared 20 of the top 50 marker genes with the Macrophage invasion factor from the Bassez 10 dataset (overlap coefficient =0.4). We then calculated the overlap of these factors with the factors obtained from the 10 and Zhang 29 datasets (Figure 6c). Spectra was run on a compute cluster with 128 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 1,024 GB RAM.

To calculate embeddings for plotting UMAPs, we calculated the neighborhood graph (k=10) on the first 50 principal components using the top 3000 highly variable genes which explained 45.92% of the total variance and calculated a UMAP embedding on this neighborhood graph using the scanpy implementation (Figure 6a).

We obtained study-specific factors by their cross-study entropy. We first removed spuriously expressed factors with <=100 positive (cell score >0.001) cells. For each remaining factor, we then calculated the entropy of study label proportions in factor positive cells. We selected the factors which showed a cross-study entropy higher than 2.0794 which is the entropy for a hypothetical factor where positive cells are absent in 11 of the 19 studies analyzed and where they are equally distributed among the remaining 8 studies. This resulted in 11 global (Figure 6b) and 3 cell type specific factors.

To assess the stability of our lysine factor across studies we plot its z-scored (across cell types) mean expression per cell type (Figure 6d). We also calculated its mean expression per patient sample and cell type and compared the expression in plasma cells with other cell types using Wilcoxon matched pairs signed-rank tests (Figure 6d). To compare cell scores of the CD8 T cell tumor reactivity and macrophage invasion factors in clinically relevant patient subgroups, we calculated their mean cell scores in positive (cell score >0.001) CD8 T cells and macrophages, respectively. We then compared these aggregated cell scores in ever smokers vs never smokers and EGFR wild type vs mutated tumors. We excluded patients with other driver mutations from EGFR wild type tumors because these tumors have different clinical and biological behaviour. Correction for covariates was not performed because many covariates (sex, age) were only available for a small fraction of patients with available smoking and EGFR status.

Classifying new and modified factors

We classified all factors as new, modified or unspecified based on their input gene-gene knowledge graph dependency parameter η. The dependence parameter is a scalar value between zero and one, that quantifies its reliance on the input gene set graph. We observed a bimodal distribution of η with one group of factors centered close to 0 and another group of factors centered close to 1 (Extended Data Fig. 2a). We therefore chose a threshold of 0.25 to separate high dependence from low dependence parameters. We defined new factors as factors with a graph dependency parameter η<0.25 and modified factors as factors with a graph dependency parameter η0.25.

Analyzing breast cancer infiltrating leukocytes

We compared Spectra, Slalom and scHPF’s capacity to retrieve features of tumor infiltrating immune cells. We ran the three algorithms on all leukocytes in the Bassez dataset as described above using a λ parameter of 0.01 for Spectra. We also ran Spectra on the Zhang dataset using a λ parameter of 0.01. For cells with high library size, such as macrophages (Bassez dataset median library size = 8038), we calculated gene scores for Spectra factors using an offset of 1 which retrieved more stably expressed genes (e.g. mean scran normalized expression of top 50 marker genes of the macrophage factor 182:1.15 with offset vs 0.41 with no offset). For remaining analyses with lower library size, such as T-cells (Bassez dataset median library size = 3127) or B cells (Bassez dataset median library size = 3954), we calculated gene scores for Spectra using an offset of 0 which allowed for more sensitive retrieval of lowly-expressed genes such as transcription factors involved in tumor reactivity and exhaustion (e.g. EOMES, TOX), as well as metabolic processes (e.g. PIPOX, BBOX1).

Visualizing scRNAseq data

To visualize individual genes in embeddings and account for sparsity in scRNA-seq data, we imputed gene expression using scanpy’s implementation of MAGIC 109 with a t parameter of 3 and the exact solver (Fig. 2d, Extended Data Fig. 3c, 6b, 7b, 8b). For visualizing all leukocytes, we calculated t-SNE embeddings on 57 PCs explaining 25.0% (Bassez dataset) or 55 PCs explaining 20% of variance (Zhang dataset) with standard parameters including a learning rate of 1000 using the scanpy implementation (Fig. 2b, 4d; Extended Data Fig. 3a, 4a, 7e, 8a).

Aggregating factor cell scores at the patient level

Spectra factors are sparse and bimodally distributed, with one mode centered around zero and one more positive mode. To aggregate factor cell scores at the sample level we calculated the average cell score of the positive fraction (Fig. 3f, 6e,f, Extended Data Fig. 4b, 6g). The positive fraction was defined as cells with cell score >0.01 selected empirically as the threshold which best separated the two cell score modes. If all cells showed a cell score 0.01 for a given gene program the mean of the positive fraction was set to 0 for that gene program.

Gene set enrichment analysis

To find the most representative factors for a gene set in the Spectra, Slalom and scHPF, we performed gene set enrichment analysis for the exhaustion and tumor reactivity input gene sets in the top 50 marker genes (genes with highest gene scores) of every factor using gseapy’s enrichr function 110 (Extended Data Fig. 6e, 7c). The enrichr function calculates enrichment using a hypergeometric test to calculate the probability of drawing the observed number of genes belonging to a gene set of interest when sampling froma pool of all genes without replacement (here: the union of the 3000 most highly variable genes plus the genes contained in the gene sets, see 6.2.8 ’Running Spectra’). We calculated enrichment of gene sets in the top 50 markers genes (genes with highest gene scores) for each factor:

p(k,M,n,N)=(nk)(NnMk)(NM)

where n is the number of factor marker genes (here: 50), k is the number of genes from the gene set in the top 50 marker genes in the factor, N is the total number of genes contained in the data, M is the number of genes from the gene set contained in the data. From this, we calculated an FDR using the Benjamini-Hochberg correction. We assumed the factors with the lowest FDR for enrichment were representative for the respective gene sets if FDR <0.05.

CD8 T cell analysis

We took the subset of CD8 T cells (11 clusters) from Bassez dataset to explore CD8 T cell tumor reactivity (tumor reactivity) and CD8 T cell exhaustion (exhaustion). The most representative factors for the tumor reactivity and exhaustion gene sets were retrieved using the gene set enrichment procedure described above (6.5.3 ’Gene set enrichment analysis’) for each factor analysis method (Extended Data Fig. 6e). Spectra factors were also compared to expression scores for the tumor reactivity and exhaustion gene sets using scanpy’s score_genes function (Fig. 3b, Extended Data Fig. 6b). To find genes driving score_genes expression we calculated the covariance of all genes within the tumor reactivity or exhaustion gene sets with the tumor reactivity and exhaustion gene scores (Extended Data Fig. 6b). To visualize of force-directed layouts we used scanpy’s tl.draw_graph function and the ForceAtlas2 method on a nearest neighbors graph computed on CD8 T cells using scanpy’s tl.neighbors function with n=10 nearest neighbors (Fig. 3a, 3b, 3c, Extended Data Fig. 3b,c, 6b). Contour plots were created using seaborn’s jointplot kernel density estimation with standard parameters (Fig. 3b, 3c, Extended Data Fig. 6f).

Metabolism analysis

We assessed the expression pattern of metabolic factors across cell types and found highly specific expression of the lysine metabolism program in plasma cells. In the Bassez dataset we noticed a small (n=114 cells, 0.1% of all cells, 3% of all plasma/B cells) group of heterotypic doublets expressing plasma cell and T cell markers (CD3E, CD3D, CD3G, IGHG4, IGHG1) which was not apparent in previous analyses and not detected by DoubletDetection. We removed these cells from further analyses involving plasma cells (Fig. 4). We also inspected the mean expression per cell type of the MAGIC 109 imputed (scanpy implementation, t=3, exact solver) top 50 individual marker genes of the lysine factor genes with highest gene scores (Extended Data Fig. 7b)

Macrophage analysis

To analyze differentiation gradients in macrophages and to capture all possible maturation stages, we retained the subset of 18 clusters (12, 132 cells) annotated as mature macrophages (12 clusters) or more immature myeloid derived (suppressor) cells/monocytic cells (6 clusters) in the Bassez dataset for further analysis (Supplementary Table 2). We embedded the data using diffusion components (DCs) which preserve differentiation trajectories better than many common linear and non-linear dimensionality reduction techniques 57. Using a classification strategy, we selected the DCs which best captured the differentiation from more monocytic states to macrophages while separating patients with (responders) and without clonal T cell expansion (non-responders, Fig. 5a, 5b, Extended Data Fig. 8a, b,d). For every pair of the first 20 DCs we performed 1) a linear regression with the DCs as the independent variable and the scran-normalized expression of the monocyte marker S100A8 as the dependent variable and 2) a logistic regression with the DCs as the independent variable and response status as the dependent variable. We chose the DC pair with highest sum of the coefficient of determination R2 (linear regression) and highest mean accuracy of (logistic regression). We calculated Spectra cell score trends over the DCs by fitting a generalized additive model as implemented in Palantir’s _gam_fit_predict and calculate_gene_trends functions citesetty2019characterization using cell scores instead of gene expression and DCs instead of pseudotime (Fig. 5a):

yi,j,k=β0+f(DCi,k)

where yi,j,k is the cell score of cell i, factor j and the kth diffusion component and DCi,k the kth diffusion component for cell i. We then visualized cell score trends using the plot_gene_trend_heatmap function from Palantir.

To calculate compositional changes under immune checkpoint therapy we used the Milo package 61. Milo is analogous to differential gene expression analysis, but instead of identifying genes that differ between two groups of cells, it tests for differential cell density in (possibly overlapping) neighborhoods of a k-nearest neighbors (KNN) cell-cell similarity graph, across different conditions. We chose the default fraction of 0.1 to be sampled as index cells from the KNN graph, such that representative cellular neighborhoods were only constructed for those index cells. Milo counts the number of cells per sample in each neighborhood and uses a generalized linear model with a negative binomial distribution to test for differences in abundance. Milo also accounts for multiple comparison testing by computing a spatial false discovery rate (FDR).

For the Bassez dataset we constructed a KNN graph on macrophages and monocytic cells. The Milo paper gives the following heuristic to estimate an optimal k parameter 61:

kSa

where S is the number of samples (here: 79) and a is an arbitrary scaling parameter. Following the authors’ suggestion of 3a5 resulted in an overly large k parameter 237k395. We therefore chose k to be smaller than the smallest population of cells identified by clustering (58 cells) but close to the k parameter obtained by the heuristic above resulting in k=50 to construct the KNN graph and to identify the nearest 50 neighbors of the index cells.

We then assessed the fold change of cell states under PD-1 blockade using the following regression formula:

ynsresponse+timepoint+timepointresponse

where y is the number of cells from sample s in neighborhood n and response status is defined as 0 for non-responders and 1 for responders. We defined the timepoint as 0 for pre-therapy and 1 for on-therapy. timepoint * response indicates the interaction between the timepoint and response variables. We then identified the neighborhoods specifically enriched for non-responders under therapy by taking the subset of neighborhoods based on the estimated regression coefficients. First we identified the neighborhoods specifically enriched under therapy by retaining a subset all neighborhoods with an FDR<0.05 and coefficient (log fold change) >0 for the timepoint parameter for further analysis. From these, we took a subset of the neighborhoods enriched for non-responders as compared to responders under therapy by selecting neighborhoods with an interaction FDR <0.05 and an interaction coefficient (log fold change) <0 for further analysis. We then compared the mean factor cell scores for these neighborhoods with all remaining neighborhoods.

The Zhang dataset contained less immunotherapy-treated patients than the Bassez dataset (n=8 vs n=42) and therefore did not allow for testing as many covariates. We thus chose a slightly different strategy to find macrophage neighborhoods enriched for non-responders under therapy. As for the Bassez dataset we took a subset of 16 clusters (11, 466 cells) annotated as mature macrophages (12 clusters, 9, 385 cells) or more immature myeloid derived (suppressor) cells/monocytic cells (4 clusters, 2, 081 cells) (Supplementary Table 2), and then selected samples from patients classified as non-responders treated with anti-PD-L1 (see 6.2.2) for a total of 4, 318 cells and 5 samples. Analogously to the k parameter selection strategy above, we constructed a KNN graph using a k parameter of 20 which was smaller than the smallest cell population detected by clustering (22 cells). We then defined Milo neighborhoods as the 30 nearest neighbors of the index cells. We fitted the Milo model using the following regression formula:

ynstimepoint

where y is the number of cells from sample s in neighborhood n and the timepoint defined as either 0 for pre-therapy or 1 for on-therapy. We took a subset of the neighborhoods with FDR<0.2 and a coefficient (log fold change) >0 for the timepoint parameter for further analysis. As for the Bassez dataset, we then compared the factor cell scores for this group with all remaining neighborhoods.

Statistical analysis and visualization

P values were calculated as indicated above using the Milo, scipy and statsmodels python packages. No normality assumption was made. We used a Mann-Whitney U test for independent samples and a Wilcoxon matched pairs signed rank test for paired samples. If not indicated differently, all p values are two-sided and a multiple-comparisons corrected (Bejamini-Hochberg method) p value of 0.05 was considered statistically significant. Cohen’s d was calculated according to the following formula:

d=|mean(a)mean(b)s|

With s as the pooled standard deviation, and mean(a) and mean(b) as the means of group a and b, respectively.

Data was visualized using the matplotlib and seaborn python packages and edited in Adobe Illustrator Creative Cloud (v27.0).

Extended Data

Extended Data Fig. 1 — Cell type annotations in breast cancer datasets. [Related to Fig. 2.].

Extended Data Fig. 1 —

Marker gene expression for 14 and 17 broad cell-type annotations in the Bassez10 (n=97,863 cells) and Zhang29 (=150,985 cells) datasets, respectively. Each gene is normalized to its maximal expression, and the percentage of cells that express the gene within a cluster is indicated (% positive cells). ILC3, innate lymphoid cell type 3; T, T cell; gdT, gamma-delta T cell; pDC, plasmacytoid dendritic cell; NK, natural killer cell; B, B cell; mac, macrophage; mast, mast cell; mem, memory; mono, monocyte; prol, proliferating; Treg, regulatory T cell; DC, dendritic cell; GC-B, germinal center B cell; plasma, plasma cell.

Extended Data Fig. 2 — Dependence of Spectra factors on input gene sets. [Related to Fig. 2.].

Extended Data Fig. 2 —

a, Maximum overlap coefficient for each factor (n=197), denoting overlap with input gene sets, plotted against Spectra’s graph dependency parameter (eta, η). Factors are colored by cell type. Most factors are highly dependent on the gene-gene graph and exhibit correspondingly high overlap coefficients. b, Maximum overlap coefficient of each factor (n=197) with input gene sets and a high (η0.25) or low (η¡0.25) dependence parameter. Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. ILC3, innate lymphoid cell type 3; T, T cell; gdT, gamma-delta T cell; pDC, plasmacytoid dendritic cell; NK, natural killer cell; B, B cell; mast, mast cell; Treg, regulatory T cell; DC, dendritic cell; GC B, germinal center B cell; plasma, plasma cell.

Extended Data Fig. 3 — Spectra detects cell-type-specific gene programs. [Related to Fig. 2.].

Extended Data Fig. 3 —

a, t-SNE projection colored by cell scores for the Spectra, expiMap or Slalom factors best representing CD8+ T-cell tumor reactivity and CD8+ TCR signaling (n=97,863 cells). Black contours highlight aberrant expression in populations other than T cells. b, Force-directed layout (FDL) of CD8+ T cells (n=31,925) colored by the cell scores of indicated factors. c, FDL of CD8+ T cells (n=31,925) colored by imputed expression (MAGIC t=3) of relevant marker genes with the rank according to their gene scores.

Extended Data Fig. 4 — Spectra discerns the effects of immune checkpoint therapy on interferon signaling. [Related to Fig. 2.].

Extended Data Fig. 4 —

a, t-SNE projections of tumor-infiltrating leukocytes (n=97,863) from the Bassez dataset10, colored by cell scores for Spectra or scanpy.score_genes interferon gamma (IFNg) response, or by expression of selected human leukocyte antigen (HLA) class II genes. HLA expression is scran-normalized and not imputed. b, Mean cell score of the positive fraction (cell score ¿ 0.01) per sample and cell type before (blue, n=40) and after (red, n=40) anti-PD-1 immune checkpoint blockade in breast tumor infiltrating leukocyte data from the Bassez dataset10. Two-sided p values calculated using Wilcoxon matched-pairs signed rank tests. Test statistics, left panel: 282 (Treg), 184 (memory B), 75 (mast), 82 (pDC), 237 (CD4 T), 222 (DC), 129 (naive B), 289 (NK), 293 (CD8 T), 76 (gdT), 59 (GC-B), 339 (macrophage), 24 (ILC3), 220 (plasma). Test statistics, right panel: 340 (Treg), 241 (memory B), 168 (mast), 131 (pDC), 362 (CD4 T), 305 (DC), 121 (naive B), 349 (NK), 427 (CD8 T), 71 (gdT), 104 (GC-B), 356 (macrophage), 125 (ILC3), 262 (plasma). Cohen’s d, left panel: 0.049 (NK), 0.320 (CD4 T), 0.228 (GC B), 0.105 (CD8 T), 0.075 (naïve B), 0.064 (Treg), 0.285 (DC), 0.142 (mast), 0.093 (Mac), 0.131 (plasma), 0.171 (ILC3), 0.341 (pDC), 0.067 (gdT), 0.197 (memory B). Cohen’s d, right panel: 0.168 (NK), 0.142 (CD4 T), 0.039 (GC-B), 0.040 (CD8 T), 0.187 (naïve B), 0.008 (Treg), 0.166 (DC), 0.003 (mast), 0.064 (mac), 0.139 (plasma), 0.008 (ILC3), 0.202 (pDC), 0.170 (gdT), 0.146 (memory B). Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5 x IQR. ILC3, innate lymphoid cell type 3; T, T cell; gdT, gamma-delta T cell; pDC, plasmacytoid dendritic cell; Mac, macrophage; NK, natural killer cell; B, B cell; mast, mast cell; Treg, regulatory T cell; DC, dendritic cell; GC-B, germinal center B cell; plasma, plasma cell.

Extended Data Fig. 5 — Spectra recovers genes involved in diverse cellular processes. [Related to Fig. 2.].

Extended Data Fig. 5 —

a, Top 50 marker genes for three cellular process factors identified by Spectra in the Bassez dataset10, after holding out a randomly selected 40% subset from each corresponding input gene set. Marker genes are ranked by Spectra factor score. New genes (absent from the original input gene set) and held-out genes recovered by Spectra are highlighted in red and blue, respectively. b, Proportion of held-out genes (40% of original gene set) recovered by Spectra or expiMap from the Bassez dataset10, for each input gene set tested. Lines connect identical input gene sets. c, Reconstruction performance on individual gene sets (n=23). d, Spectra retrieves highly correlated simulated gene programs more accurately than other methods. Synthetic data was generated by sampling random ground truth loadings and factors from log-normal and half-Cauchy distributions, respectively, and introducing correlations in two factors via off-diagonal entries in the log-normal covariance matrix. After multiplying loadings and factor matrices and introducing noise, models were fit to the data and output factors were correlated with ground truth factors. e, Correlation between ground truth and inferred loadings (cell scores) for score genes and Spectra, as a function of gene set overlap in simulated data. Data consisted of overlapping synthetic gene sets and a random cell loading vector representing the expression of each gene set in a cell (Methods). The sum of gene sets weighted by the individual cell loadings for each cell was used to represent a mean for sampling Poisson gene expression counts. f, Slalom and Spectra are robust to highly overlapping gene sets, but Slalom suffers worse performance when the number of active gene sets is large. Gene expression data was simulated from a factor analysis model in which only a subset of gene sets are active in the data, similar to d and the original Slalom publication5. AUC, area under the ROC curve. Intervals and lines represent 95% CI and mean, respectively, across n=10(d),n=3(e), and n=5(f) independent simulations. g, Memory dependence on cell (left panel), gene set (middle panel), and cell type number (right panel). h, runtime dependence on cell type number. Each experiment (g,h) was repeated n=3 times, shading indicates 95/

Extended Data Fig. 6 — Existing methods fail to separate highly correlated features in CD8+ T cell data. [Related to Fig. 3.].

Extended Data Fig. 6 —

Analysis of breast cancer infiltrating leukocytes from the Bassez scRNA-seq dataset10 (n = 42 patients). a, Spectra information and importance scores (Methods) in CD8+ T cells, colored by η parameters. b, CXCL13 expression is most highly correlated with both tumor reactivity and exhaustion and maps to similar cells as both gene set scores in FDL of CD8+ T cells (n=31,925). Cov., covariance of each gene-set with CXCL13. c, Overlap between the top 50 marker genes of CD8+ T cell exhaustion and tumor reactivity factors. d, Overlap between the top 50 marker genes of the CD8+ T cell tumor reactivity factors in the Bassez10 and Caushi34 datasets. e, Significance (−log10(FDR)) of CD8+ T cell exhaustion or tumor reactivity factors by gene set enrichment analysis (Spectra, n=159; Slalom, n=20; scHPF, n=100 factors). f, Contour plots indicating density of Spectra, Slalom or ScHPF loading scores for CD8+ T cell exhaustion and tumor reactivity factors grouped by clonal T cell expansion status (n=31,925 cells). g, Per-sample mean cell scores for the tumor reactivity factor. Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. P values (two-sided) were calculated using Mann-Whitney U tests (n=40 pre anti-PD-1, n=40 on anti-PD-1 samples): Spectra: pre anti-PD-1: p=3.835x105 statistic: 308 Cohen’s d:1.510; on anti-PD-1: p=2.001x105, statistic =313 Cohen’s d: 1.491; expiMap, pre anti-PD-1 p=0.98823 statistic: 167 Cohen’s d: 0.194; on anti-PD-1 p = 0.80191 statistic: 159 Cohen’s d: 0.150; Slalom pre anti-PD-1: p=0.57 statistic: 198 Cohen’s d: 0.249 ; Slalom on anti-PD-1: p=0.94 statistic: 155 Cohen’s d: 0.059; scHPF pre anti-PD-1: p=0.34 statistic: 216 Cohen’s d: 0.337; scHPF on anti-PD-1: p=0.16 statistic: 201 Cohen’s d:0.592.

Extended Data Fig. 7 — Spectra finds more specific and biologically coherent lysine metabolism factors. [Related to Fig. 4.].

Extended Data Fig. 7 —

a, Spectra importance and information scores for factors in plasma cells, colored by η parameter. b, z-scored average MAGIC imputed (t=3) cellular expression (per cell type) of lysine factor genes (n=97,863 leukocytes). c, Significance (10log(FDR)) and fold enrichment (odds ratio) of the lysine metabolism input gene set, among the 50 genes with highest gene scores, as calculated by gene set enrichment analysis (Spectra: n=152; Slalom: n=20; scHPF: n=100 factors). d, Functional categories of the top 50 marker genes of lysine metabolism factors identified by different factorization methods. ER, endoplasmic reticulum; metab, metabolism; prot synth, protein synthesis; reg, regulation; TF, transcription factor. e, t-SNE embeddings colored by cell scores for top-performing lysine metabolism factors (labeled in c) from different factorization methods (n=97,863 leukocytes). Plasma, macrophage (Mac) and dendritic cell (DC) populations are outlined. ILC3, innate lymphoid cell type 3; T, T cell; gdT, γδ T cell; pDC, plasmacytoid dendritic cell; mac, macrophage; mono, monocyte; NK, natural killer cell; B, B cell; mast, mast cell; Treg, regulatory T cell; DC, dendritic cell; GC B, germinal center B cell; plasma, plasma cell.

Extended Data Fig. 8 — Tumor-infiltrating macrophage cell states exist along continua that change under therapy. [Related to Fig. 5.].

Extended Data Fig. 8 —

a, t-SNE embedding of all leukocytes (left) and distribution of macrophages/monocytes (n=12,132) along diffusion components 2 and 4 (DC2 and DC4, right), highlighting cells of C3-positive macrophage cluster C7 from the Bassez dataset10. b, Macrophages along DC2 and DC4, colored by MAGIC-imputed (t=3) complement gene expression. c, mean MAGIC-imputed (t=3) complement gene expression per sample in macrophages from responsive (E) or non-responsive (NE) patients sampled before (pre, n=40) or during (on, n=40) anti-PD-1 therapy. Boxes and lines represent interquartile range (IQR) and median, respectively; whiskers represent ±1.5xIQR. d, Scran-normalized expression of macrophage and monocyte marker genes in cells sorted along DC2, showing diverging expression along this gradient. e, Overlap coefficient and graph dependency for Spectra factors (n=197). f, scHPF (n=100) and Slalom (n=20) factors do not resemble the Spectra invasion factor (factor 182). Each factor is plotted by fold change of its cell score in macrophage neighborhoods enriched in non-responders under therapy, compared to its cell score in all remaining macrophages, and by the coefficient of overlap between the top 50 marker genes of each factor and of the Spectra invasion factor (analogous to Fig. 5c,b). g, Spectra importance and information scores for factors in macrophages with eta parameter indicated in color code.

Supplementary Material

Supplementary Note
Supplementary Table 1
Supplementary Table 2
Supplementary Table 3
Supplementary Table 4
Supplementary Table 5

Table 2:

Clinical variables of the utilized scRNAseq datasets

operable cohort 1 (n) cohort 2 (n) cortison
Bassez Yes anti-PD-1 (n=31) CTX, anti-PD-1 (n=11)
Zhang No anti-PD-L1 + CTX (n=8) CTX (n=7) +

Acknowledgements

We thank Andrew E. Cornish and Samuel A. Rose for critically reading this manuscript, and Diether Lambrechts and Ayse Bassez for providing access to their single-cell RNA-sequencing data. This work was funded by NCI Human Tumor Atlas Network grant U2C CA233284 (D.P.), U54 CA209975 (D.P.) and Cancer Center Support Grant P30 CA08748 as well as the Functional Genomics Initiative, Center for Epigenetics Research and Alan and Sandra Gerry Metastasis and Tumor Ecosystems Center at Memorial Sloan Kettering (D.P.). R.K. was supported by NSF Graduate Research Fellowship 2020297401. T.W. has been funded by a fellowship of the DKFZ Clinician Scientist Program, supported by the Dieter Morszeck Foundation.

Declaration of interests

D.P. reports equity interests and provision of services for Insitro, Inc. T.W. reports stock ownership for Roche, Bayer, Innate Pharma, Illumina and 10x Genomics as well as research funding (not related to this study) from CanVirex AG, Basel Switzerland and Institut für Klinische Krebsforschung GmbH, Frankfurt, Germany. R.K. and T.N. have no competing interests.

Data availability

Spectra is available as an open-source python package at https://github.com/dpeerlab/spectra and the immune knowledge base at https://github.com/wallet-maker/cytopus86. Notebooks to reproduce figures are available at: https://github.com/dpeerlab/SpectraReproducibility.

References

  • 1.Satija R, Farrell JA, Gennert D, Schier AF, and Regev A. Spatial reconstruction of single-cell gene expression data. Nature biotechnology 2015;33:495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wolf FA, Angerer P, and Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome biology 2018;19:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Bielecki P et al. Skin-resident innate lymphoid cells converge on a pathogenic effector state. Nature 2021;592:128–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Levitin HM, Yuan J, Cheng YL, et al. De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Molecular systems biology 2019;15:e8557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pelka K et al. Spatially organized multicellular immune hubs in human colorectal cancer. Cell 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Buettner F, Pratanwanich N, McCarthy DJ, Marioni JC, and Stegle O. f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. Genome biology 2017;18:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Elyanow R, Dumitrascu B, Engelhardt BE, and Raphael BJ. netNMF-sc: leveraging gene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome research 2020;30:195–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kartha VK, Duarte FM, Hu Y, et al. Functional inference of gene regulation using single-cell multi-omics. Cell genomics 2022;2:100166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lotfollahi M, Rybakov S, Hrovatin K, et al. Biologically informed deep learning to query gene programs in single-cell atlases. Nature Cell Biology 2023;25:337–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bassez A, Vos H, Van Dyck L, et al. A single-cell map of intratumoral changes during anti-PD1 treatment of patients with breast cancer. Nature Medicine 2021;27:820–32. [DOI] [PubMed] [Google Scholar]
  • 11.Shimizu K et al. PD-1 imposes qualitative control of cellular transcriptomes in response to T cell activation. Molecular Cell 2020;77:937–950.e6. [DOI] [PubMed] [Google Scholar]
  • 12.Gallagher MP et al. Hierarchy of signaling thresholds downstream of the T cell receptor and the Tec kinase ITK. Proceedings of the National Academy of Sciences 2021;118:e2025825118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Delconte RB, Kolesnik TB, Dagley LF, et al. CIS is a potent checkpoint in NK cell-mediated tumor immunity. Nature immunology 2016;17:816–24. [DOI] [PubMed] [Google Scholar]
  • 14.Miah MA, Yoon CH, Kim J, Jang J, Seong YR, and Bae YS. CISH is induced during DC development and regulates DC-mediated CTL activation. European journal of immunology 2012;42:58–68. [DOI] [PubMed] [Google Scholar]
  • 15.Grasso CS, Tsoi J, Onyshchenko M, et al. Conserved interferon-γ signaling drives clinical response to immune checkpoint blockade therapy in melanoma. Cancer cell 2020;38:500–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Goswami S, Walle T, Cornish AE, et al. Immune profiling of human tumors identifies CD73 as a combinatorial target in glioblastoma. Nature medicine 2020;26:39–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jorgovanovic D, Song M, Wang L, and Zhang Y. Roles of IFN-γ in tumor progression and regression: a review. Biomarker research 2020;8:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Alawi F and Lee MN. DKC1 is a direct and conserved transcriptional target of c-MYC. Biochemical and biophysical research communications 2007;362:893–8. [DOI] [PubMed] [Google Scholar]
  • 19.Marinkovic D, Marinkovic T, Kokai E, Barth T, Möller P, and Wirth T. Identification of novel Myc target genes with a potential role in lymphomagenesis. Nucleic acids research 2004;32:5368–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Van der Leun AM, Thommen DS, and Schumacher TN. CD8+ T cell states in human cancer: insights from single-cell analysis. Nature Reviews Cancer 2020;20:218–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Duhen T, Duhen R, Montler R, et al. Co-expression of CD39 and CD103 identifies tumor-reactive CD8 T cells in human solid tumors. Nature communications 2018;9:2724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li H, Leun AM van der, Yofe I, et al. Dysfunctional CD8 T cells form a proliferative, dynamically regulated compartment within human melanoma. Cell 2019;176:775–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Thommen DS, Koelzer VH, Herzig P, et al. A transcriptionally and functionally distinct PD-1+ CD8+ T cell pool with predictive potential in non-small-cell lung cancer treated with PD-1 blockade. Nature medicine 2018;24:994–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Liu B, Zhang Y, Wang D, Hu X, and Zhang Z. Single-cell meta-analyses reveal responses of tumor-reactive CXCL13+ T cells to immune-checkpoint blockade. Nature Cancer 2022;3:1123–36. [DOI] [PubMed] [Google Scholar]
  • 25.Lee YJ, Kim JY, Jeon SH, et al. CD39+ tissue-resident memory CD8+ T cells with a clonal overlap across compartments mediate antitumor immunity in breast cancer. Science immunology 2022;7:eabn8390. [DOI] [PubMed] [Google Scholar]
  • 26.Yost KE, Satpathy AT, Wells DK, et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nature medicine 2019;25:1251–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Scott AC, Dündar F, Zumbo P, et al. TOX is a critical regulator of tumour-specific T cell differentiation. Nature 2019;571:270–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Khan O, Giles JR, McDonald S, et al. TOX transcriptionally and epigenetically programs CD8+ T cell exhaustion. Nature 2019;571:211–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Zhang Y, Chen H, Mo H, et al. Single-cell analyses reveal key immune cell subsets associated with response to PD-L1 blockade in triple-negative breast cancer. Cancer Cell 2021;39:1578–93. [DOI] [PubMed] [Google Scholar]
  • 30.Miller BC, Sen DR, Al Abosy R, et al. Subsets of exhausted CD8+ T cells differentially mediate tumor control and respond to checkpoint blockade. Nature immunology 2019;20:326–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Siddiqui I, Schaeuble K, Chennupati V, et al. Intratumoral Tcf1+ PD-1+ CD8+ T cells with stem-like properties promote tumor control in response to vaccination and checkpoint blockade immunotherapy. Immunity 2019;50:195–211. [DOI] [PubMed] [Google Scholar]
  • 32.Schmid P, Cortes J, Pusztai L, et al. Pembrolizumab for early triple-negative breast cancer. New England Journal of Medicine 2020;382:810–21. [DOI] [PubMed] [Google Scholar]
  • 33.Chowdhury PS, Chamoto K, Kumar A, and Honjo T. PPAR-induced fatty acid oxidation in T cells increases the number of tumor-reactive CD8+ T cells and facilitates anti–PD-1 therapy. Cancer immunology research 2018;6:1375–87. [DOI] [PubMed] [Google Scholar]
  • 34.Caushi JX, Zhang J, Ji Z, et al. Transcriptional programs of neoantigen-specific TIL in anti-PD-1-treated lung cancers. Nature 2021;596:126–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Seo H, González-Avalos E, Zhang W, et al. BATF and IRF4 cooperate to counter exhaustion in tumor-infiltrating CAR T cells. Nature immunology 2021;22:983–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Gros A, Robbins PF, Yao X, et al. PD-1 identifies the patient-specific CD8+ tumor-reactive repertoire infiltrating human tumors. The Journal of clinical investigation 2014;124:2246–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Boutet M, Benet Z, Guillen E, et al. Memory CD8+ T cells mediate early pathogen-specific protection via localized delivery of chemokines and IFNγ to clusters of monocytes. Science advances 2021;7:eabf9975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Shanker A, Verdeil G, Buferne M, et al. CD8 T cell help for innate antitumor immunity. The Journal of Immunology 2007;179:6651–62. [DOI] [PubMed] [Google Scholar]
  • 39.Chen Y, Zander RA, Wu X, et al. BATF regulates progenitor to cytolytic effector CD8+ T cell transition during chronic viral infection. Nature immunology 2021;22:996–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Hsieh RCE, Krishnan S, Wu RC, et al. ATR-mediated CD47 and PD-L1 up-regulation restricts radiotherapy-induced immune priming and abscopal responses in colorectal cancer. Science immunology 2022;7:eabl9330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Franciszkiewicz K, Boissonnas A, Boutet M, Combadiere C, and Mami-Chouaib F. Role of chemokines and chemokine receptors in shaping the effector phase of the antitumor immune response. Cancer research 2012;72:6325–32. [DOI] [PubMed] [Google Scholar]
  • 42.Yeong J, Suteja L, Simoni Y, et al. Intratumoral CD39+ CD8+ T cells predict response to programmed cell death protein-1 or programmed death ligand-1 blockade in patients with NSCLC. Journal of Thoracic Oncology 2021;16:1349–58. [DOI] [PubMed] [Google Scholar]
  • 43.Chow A, Uddin FZ, Liu M, et al. The ectonucleotidase CD39 identifies tumor-reactive CD8+ T cells predictive of immune checkpoint blockade efficacy in human lung cancer. Immunity 2023;56:93–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Leone RD and Powell JD. Metabolism of immune cells in cancer. Nature reviews cancer 2020;20:516–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Artyomov MN and Van den Bossche J. Immunometabolism in the single-cell era. Cell metabolism 2020;32:710–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Wegner A, Meiser J, Weindl D, and Hiller K. How metabolites modulate metabolic flux. Current opinion in biotechnology 2015;34:16–22. [DOI] [PubMed] [Google Scholar]
  • 47.Costa da Silva M, Breckwoldt MO, Vinchi F, et al. Iron induces anti-tumor activity in tumor-associated macrophages. Frontiers in immunology 2017;8:1479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Sun JL, Zhang NP, Xu RC, et al. Tumor cell-imposed iron restriction drives immunosuppressive polarization of tumor-associated macrophages. Journal of Translational Medicine 2021;19:347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Lee MS and Bensinger SJ. Reprogramming cholesterol metabolism in macrophages and its role in host defense against cholesterol-dependent cytolysins. Cellular & molecular immunology 2022;19:327–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Behmoaras J, Diaz AG, Venda L, et al. Macrophage epoxygenase determines a profibrotic transcriptome signature. The Journal of Immunology 2015;194:4705–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Vazquez Rodriguez G, Abrahamsson A, Turkina MV, and Dabrosin C. Lysine in Combination with Estradiol promote dissemination of estrogen receptor positive breast Cancer via Upregulation of U2AF1 and RPN2 Proteins. Frontiers in Oncology 2020:2650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Ricci D, Gidalevitz T, and Argon Y. The special unfolded protein response in plasma cells. Immunological reviews 2021;303:35–51. [DOI] [PubMed] [Google Scholar]
  • 53.Dennler P, Fischer E, and Schibli R. Antibody conjugates: from heterogeneous populations to defined reagents. Antibodies 2015;4:197–224. [Google Scholar]
  • 54.Wang L, Sfakianos JP, Beaumont KG, et al. Myeloid cell–associated resistance to PD-1/PD-L1 blockade in urothelial cancer revealed through bulk and single-cell RNA sequencing. Clinical Cancer Research 2021;27:4287–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.DeNardo DG and Ruffell B. Macrophages as regulators of tumour immunity and immunotherapy. Nature Reviews Immunology 2019;19:369–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Riihilä P, Nissinen L, Farshchian M, et al. Complement component C3 and complement factor B promote growth of cutaneous squamous cell carcinoma. The American Journal of Pathology 2017;187:1186–97. [DOI] [PubMed] [Google Scholar]
  • 57.Haghverdi L, Buettner F, and Theis FJ. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 2015;31:2989–98. [DOI] [PubMed] [Google Scholar]
  • 58.Wolf Y, Shemer A, Polonsky M, et al. Autonomous TNF is critical for in vivo monocyte survival in steady state and inflammation. Journal of Experimental Medicine 2017;214:905–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Lee MK, Al-Sharea A, Shihata WA, et al. Glycolysis is required for LPS-induced activation and adhesion of human CD14+ CD16− monocytes. Frontiers in immunology 2019;10:2054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lubbers R, Van Essen M, Van Kooten C, and Trouw L. Production of complement components by cells of the immune system. Clinical & Experimental Immunology 2017;188:183–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Dann E, Henderson NC, Teichmann SA, Morgan MD, and Marioni JC. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nature Biotechnology 2022;40:245–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Dykes SS, Fasanya HO, and Siemann DW. Cathepsin L secretion by host and neoplastic cells potentiates invasion. Oncotarget 2019;10:5560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.ROCHEFORT H and LIAUDET-COOPMAN E. Cathepsin D in cancer metastasis: a protease and a ligand. Apmis 1999;107:86–95. [DOI] [PubMed] [Google Scholar]
  • 64.Vasiljeva O, Papazoglou A, Krüger A, et al. Tumor cell–derived and macrophage-derived cathepsin B promotes progression and lung metastasis of mammary cancer. Cancer research 2006;66:5242–50. [DOI] [PubMed] [Google Scholar]
  • 65.Lee YS, Yu JE, Kim KC, et al. A small molecule targeting CHI3L1 inhibits lung metastasis by blocking IL-13Rα2-mediated JNK-AP-1 signals. Molecular Oncology 2022;16:508–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Huang Rh, Quan Yj, Chen Jh, et al. Osteopontin promotes cell migration and invasion, and inhibits apoptosis and autophagy in colorectal cancer by activating the p38 MAPK signaling pathway. Cellular Physiology and Biochemistry 2017;41:1851–64. [DOI] [PubMed] [Google Scholar]
  • 67.He Y, Dong Y, Zhang X, et al. Lipid droplet-related PLIN2 in CD68+ tumor-associated macrophage of oral squamous cell carcinoma: implications for cancer prognosis and immunotherapy. Frontiers in Oncology 2022;12:824235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Yuan Z, Mehta HJ, Mohammed K, et al. TREM-1 is induced in tumor associated macrophages by cyclo-oxygenase pathway in human non-small cell lung cancer. PloS one 2014;9:e94241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Park MD, Reyes-Torres I, LeBerichel J, et al. TREM2 macrophages drive NK cell paucity and dysfunction in lung cancer. Nature immunology 2023:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Liguori M, Digifico E, Vacchini A, et al. The soluble glycoprotein NMB (GPNMB) produced by macrophages induces cancer stemness and metastasis via CD44 and IL-33. Cellular & molecular immunology 2021;18:711–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Baitsch D, Bock HH, Engel T, et al. Apolipoprotein E induces antiinflammatory phenotype in macrophages. Arteriosclerosis, thrombosis, and vascular biology 2011;31:1160–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Kemp SB, Carpenter ES, Steele NG, et al. Apolipoprotein E promotes immune suppression in pancreatic cancer through NF-κB–mediated production of CXCL1. Cancer research 2021;81:4305–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Fuior EV and Gafencu AV. Apolipoprotein C1: its pleiotropic effects in lipid metabolism and beyond. International journal of molecular sciences 2019;20:5939. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Li T, Chen W, and Chiang JY. PXR induces CYP27A1 and regulates cholesterol metabolism in the intestine. Journal of lipid research 2007;48:373–84. [DOI] [PubMed] [Google Scholar]
  • 75.Persad S, Choo ZN, Dien C, et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nature Biotechnology 2023:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Salcher S, Sturm G, Horvath L, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell 2022;40:1503–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Sasako T, Ohsugi M, Kubota N, et al. Hepatic Sdf2l1 controls feeding-induced ER stress and regulates metabolism. Nature communications 2019;10:947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Vekich JA, Belmont PJ, Thuerauf DJ, and Glembotski CC. Protein disulfide isomerase-associated 6 is an ATF6-inducible ER stress response protein that protects cardiac myocytes from ischemia/reperfusion-mediated cell death. Journal of molecular and cellular cardiology 2012;53:259–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Tawbi HA, Schadendorf D, Lipson EJ, et al. Relatlimab and nivolumab versus nivolumab in untreated advanced melanoma. New England Journal of Medicine 2022;386:24–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Hastings K, Yu H, Wei W, et al. EGFR mutation subtypes and response to immune checkpoint blockade treatment in non-small-cell lung cancer. Annals of Oncology 2019;30:1311–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Dai L, Jin B, Liu T, Chen J, Li G, and Dang J. The effect of smoking status on efficacy of immune checkpoint inhibitors in metastatic non-small cell lung cancer: A systematic review and meta-analysis. EClinicalMedicine 2021;38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Liu J, Sun B, Guo K, et al. Lipid-related FABP5 activation of tumor-associated monocytes fosters immune privilege via PD-L1 expression on Treg cells in hepatocellular carcinoma. Cancer Gene Therapy 2022;29:1951–60. [DOI] [PubMed] [Google Scholar]
  • 83.Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, and Tamayo P. The molecular signatures database hallmark gene set collection. Cell systems 2015;1:417–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.L Lun AT, Bach K, and Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome biology 2016;17:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Févotte C and Idier J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural computation 2011;23:2421–56. [Google Scholar]
  • 86.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 1996;58:267–88. [Google Scholar]
  • 87.Gopalan P, Hofman JM, and Blei DM. Scalable Recommendation with Hierarchical Poisson Factorization. In: UAI. 2015:326–35. [Google Scholar]
  • 88.Mimno D, Wallach H, Talley E, Leenders M, and McCallum A. Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing. 2011:262–72. [Google Scholar]
  • 89.Kingma D and Ba J. Adam: A Method for Stochastic Optimization. 3rd Int. Conf. Learn. Represent. ICLR 2015-Conf. Track Proc., Dec 2014. [Google Scholar]
  • 90.Salakhutdinov R, Roweis ST, and Ghahramani Z. Optimization with EM and expectation-conjugate-gradient. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003:672–9. [Google Scholar]
  • 91.Liu JS and Wu YN. Parameter expansion for data augmentation. Journal of the American Statistical Association 1999;94:1264–74. [Google Scholar]
  • 92.Airoldi EM, Blei D, Fienberg S, and Xing E. Mixed membership stochastic block-models. Advances in neural information processing systems 2008;21. [Google Scholar]
  • 93.Lee DD and Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401:788–91. [DOI] [PubMed] [Google Scholar]
  • 94.Ke ZT, Ma Y, and Lin X. Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis. Journal of the American Statistical Association 2021:1–19.35757777 [Google Scholar]
  • 95.Stevens K, Kegelmeyer P, Andrzejewski D, and Buttler D. Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 2012:952–61. [Google Scholar]
  • 96.González-Blas CB, De Winter S, Hulselmans G, et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. bioRxiv 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Barkley D, Moncada R, Pour M, et al. Cancer cell states recur across tumor types and form specific interactions with the tumor microenvironment. Nature Genetics 2022;54:1192–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Consortium TGO. The Gene Ontology Resource: 20 years and still Going strong. Nucleic Acids Research 2018;47:D330–D338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, and Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011;27:1739–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Kanehisa M, Furumichi M, Tanabe M, Sato Y, and Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 2016;45:D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Croft D, O’Kelly G, Wu G, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Research 2010;39:D691–D697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Eizenberg-Magar I, Rimer J, Zaretsky I, Lara-Astiaso D, Reich-Zeliger S, and Friedman N. Diverse continuum of CD4¡sup¿+¡/sup¿ T-cell states is determined by hierarchical additive integration of cytokine signals. Proceedings of the National Academy of Sciences 2017;114:E6447–E6456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Walle T. wallet-maker/cytopus: Cytopus v1.21. Version v1.21 2022. DOI: 10.5281/zenodo.7306238. URL: 10.5281/zenodo.7306238. [DOI]
  • 104.Eisenhauer EA, Therasse P, Bogaerts, et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). European journal of cancer 2009;45:228–47. [DOI] [PubMed] [Google Scholar]
  • 105.Gayoso A, Shor J, Carr AJ, Sharma R, and Pe’er D. DoubletDetection (Version v3.0). Zenodo; 2020. [Google Scholar]
  • 106.Levine JH, Simonds EF, Bendall SC, et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 2015;162:184–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Arvai K. kneed. Version 0.7.0. If you use this software, please cite it as below 2020. DOI: 10.5281/zenodo.6944485. URL: 10.5281/zenodo.6944485. [DOI] [Google Scholar]
  • 108.Danilova L, Anagnostou V, Caushi JX, et al. The Mutation-Associated Neoantigen Functional Expansion of Specific T Cells (MANAFEST) Assay: A Sensitive Platform for Monitoring Antitumor ImmunityMANAFEST Assay for Detecting Antigen-Specific T Cells. Cancer immunology research 2018;6:888–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Van Dijk D, Sharma R, Nainys J, et al. Recovering gene interactions from single-cell data using data diffusion. Cell 2018;174:716–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Fang Z, Liu X, and Peltz G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 2022. btac757. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Note
Supplementary Table 1
Supplementary Table 2
Supplementary Table 3
Supplementary Table 4
Supplementary Table 5

Data Availability Statement

Spectra is available as an open-source python package at https://github.com/dpeerlab/spectra and the immune knowledge base at https://github.com/wallet-maker/cytopus86. Notebooks to reproduce figures are available at: https://github.com/dpeerlab/SpectraReproducibility.

RESOURCES