Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-023-03067-9.
Keywords: Gene expression, Single-cell RNA-seq, Single-cell ATAC-seq, Differential expression analysis, Dimensionality reduction, Parts-based representations, Matrix factorization, Topic modeling
Background
A key methodological aim in single-cell genomics is to learn structure from single-cell sequencing data in a systematic, data-driven way [1–3]. Clustering [4–7] and dimensionality reduction techniques such as such as PCA [8–10], t-SNE [11], or UMAP [12] are commonly used for this aim. Despite the fact that many of these techniques have been applied “out-of-the-box” (with some caveats [13–18]), they have been remarkably successful in revealing and visualizing biologically interesting substructures from single-cell data [7, 19–29].
Another class of dimensionality reduction approaches that have been used to identify structure from single-cell data are what are sometimes called parts-based representations—these approaches include non-negative matrix factorization (NMF) [30–44] and topic modeling [45–56], which also have formal connections [48, 57, 58]. Parts-based representations share some of the features of both a clustering and a dimensionality reduction: on the one hand, they learn a lower dimensional representation of the cells; on the other hand, the individual dimensions (the “parts”) of the reduced representation can identify discrete clusters or discrete subpopulations [59, 60]. However, parts-based representations are more flexible than clustering—the dimensions can also capture other features such as continuously varying cell states.
In this paper, we investigate the question of how to interpret the individual dimensions of a parts-based representation learned by fitting a topic model (in the topic model, the dimensions are also called “topics”). For topics that assign observations to discrete clusters, one could apply a standard method for differential expression analysis [61, 62] to compare expression between topics, then annotate these topics by the genes that are differentially expressed. The question, therefore, is what to do with topics that do not assign observations to discrete clusters. To tackle this question, we extend models that compare expression between groups by allowing observations to have partial membership in multiple groups. This more flexible differential expression analysis is implemented by taking an existing model and modifying it to allow for partial memberships to groups or topics. This modified model is a “grade of membership” model [63], so we call our new method grade of membership differential expression (GoM DE). The idea is that, by generalizing existing methods, we can continue to take advantage of existing elements of differential expression analysis but now apply them to learn about different types of cell features beyond discrete cell populations.
We describe the GoM DE approach more formally in the next section. Then, we evaluate the GoM DE approach in simulations, showing, in particular, that it recovers the same results as existing differential expression analysis methods when the cells can be grouped into discrete clusters. In case studies, we demonstrate how the GoM DE analysis analysis can be used to uncover and interpret a variety of cell features from single-cell RNA-seq and ATAC-seq data sets.
Results
Methods overview and illustration
We begin by giving a brief overview of the topic model; then, we describe the new methods for annotating topics. To illustrate key concepts, we analyze a single-cell RNA-seq (scRNA-seq) data set obtained from peripheral blood mononuclear cells (PBMCs) [29] that has been used in several benchmarking studies (e.g., [4, 7, 8, 64, 65]). We refer to these data as the “PBMC data.”
Learning expression topics from single-cell RNA-seq data
The original aim of the topic model was to discover patterns from collections of text documents, in which text documents were represented as word counts [45, 50, 66–68]. By substituting genes for words and cells for documents, topic models can also be used to learn a reduced representation of cells by their membership in multiple “topics” [47].
When applied to scRNA-seq data generated using UMIs, the topic model assumes a multinomial distribution of the RNA molecule counts in a cell,
1 |
where , and m is the number of genes, that is, the number of RNA molecules observed for gene j in cell i is a noisy observation of an underlying true expression level, [8, 69].
For n cells, the topic model is a reduced representation of the underlying expression,
2 |
where are , matrices, respectively, with entries . Each cell i is represented by its “grade of membership” in K topics, a vector of proportions , such that , , and each “expression topic” is represented by a vector of (relative) expression levels , (these are also constrained to sum to 1, which ensures that the s are multinomial probabilities). To efficiently fit the topic model to large single-cell data sets, we exploit the fact that the topic model is closely related to the Poisson NMF model [48].
The matrix in (2), which contains the membership proportions for all cells and topics, can be visualized using a “Structure plot.” Structure plots have been used to visualize the results of population genetics analyses (e.g., [70–72]) and, more recently, to visualize the topics learned from bulk and single-cell RNA-seq data [47].
A Structure plot visualizing the topic model fit to the PBMC data, with topics, is given in Fig. 1. In this data set, the cells have been “sorted” into different cell types which provides a cell labeling to compare against. From the Structure plot, it is apparent that a subset of topics—topics 1, 2, and 3—correspond closely to the sorted subpopulations (B cells, CD14+ monocytes, CD34+ cells) (indeed, distinctive genes and enriched gene sets identified by the methods described below suggest these same subpopulations; Fig. 1C). Topics 4 and 5, on the other hand, are not confined to a single sorted cell type and instead appear to capture biological processes common to T cells and natural killer (NK) cells. CD8+ cytotoxic T cells have characteristics of both NK cells and T cells—these are T cells that sometimes become “NK-like” [73]—and this is captured in the topic model by assigning membership to both topics. Topic 6 also captures continuous structure, but, unlike topics 4 and 5, it is present in almost all cells, and therefore, its biological interpretation is not at all clear from the cell labeling. More generally, the topics, whether they capture largely discrete structure (topics 1–3) or more continuous structure (topics 4–6), can be thought of as a “soft” clustering [47].
Learning chromatin accessibility topics from single-cell ATAC-seq data
For single-cell ATAC-seq data, the observations denote the number of reads mapping to region j in cell i. However, it is common to “binarize” the read counts such that when at least one fragment in cell i maps to region j and otherwise.
Using the topic model to analyze (binarized) single-cell ATAC-seq data was first suggested by [49]. Therefore, they implicitly assumed a multinomial model (1) in which the s are binarized accessibility values instead of UMI counts. A binomial model for binarized accessibility data was proposed in [74]. As we explain in the “Methods” section, we view both models as approximations, and under reasonable assumptions the models are similar.
Differential expression analysis allowing for grades of membership
Having learned the topics, our aim now is to identify genes that are distinctive to each topic. In the simplest case, the topic is a distinct or nearly distinct cluster of cells, such as topic 1 or topic 2 in Fig. 1.
In the following, we describe methods for analyzing differences in expression, but they can also be understood as methods for analyzing differences in chromatin accessibility. Therefore, “expression,” “expressed,” and “gene” in the descriptions below may be substituted with “accessibility,” “accessible,” and “peak” (or “region”).
Consider a single gene, j. Provided unmodeled sources of variation are negligible relative to measurement error, a simple Poisson model of expression should suffice:
3 |
In this model, for gene j in cell i is controlled by the cell’s membership in the cluster: when cell i belongs to the cluster, ; otherwise, . Under this model, differential expression (DE) analysis proceeds by estimating the log-fold change (LFC) in expression for each gene j,
4 |
Although simple, this Poisson model forms the basis for many DE analysis methods [75–80].
We now modify the Poisson model (3) in a simple way to analyze differential expression among topics. In a clustering, each cell belongs to a single cluster, whereas in the topic model, cells have grades of membership to the clusters [63] in which is the membership proportion for cluster or topic k. Therefore, we extend the model to allow for partial membership in the K topics:
5 |
in which the membership proportions are treated as known, and the unknowns represent relative expression levels (a related model is used in C-SIDE [80] to model cell-type mixtures in DE analysis of spatial transcriptomics data). Note that will be similar to, but not the same as, in the topic model because the DE analysis is a gene-by-gene analysis, whereas the topic model considers all genes at once. The standard Poisson model (3) is recovered as a special case of (5) when and all membership proportions are 0 or 1.
Recall, our aim is to identify genes that are distinctive to each topic. To this end, we estimate the least extreme LFC (l.e. LFC), which we define as
6 |
in which is the pairwise LFC,
7 |
In words, the l.e. LFC for topic k is the LFC comparing topics k and l, in which l is chosen to be topic that results in the smallest (“least extreme”) change. By this definition, a “distinctive gene” is one in which its expression is significantly different from its expression in all other topics (note the l.e. LFC reduces to the standard LFC (4) when ). We then annotate topics by the distinctive genes. The estimation of l.e. LFCs and computation of related posterior statistics is described in the “Methods” section.
To illustrate what the least extreme LFC does and does not do, consider the following toy example with topics (Fig. 2). Gene 1 has high expression in topic 1 and low expression in the other topics. Therefore, all the pairwise LFCs for topic 1 are large, , , and this results in an l.e. LFC for topic 1 of . So gene 1 is a distinctive gene for topic 1. Next consider gene 2, which has high expression in topics 1 and 2 and low expression in the other topics. For gene 2, the pairwise LFCs for topic 1 are mostly large, , , except for . So, the l.e. LFC for topic 1 is zero and, as a result, gene 2, although potentially helpful for interpreting topic 1, is not a distinctive gene for topic 1.
Illustration of GoM DE analysis in PBMC data set
To illustrate, we applied the GoM DE analysis to the topic model shown in Fig. 1 and visualized the results in “volcano plots” (Fig. 3). We then used the GoM DE results (Additional file 2: Table S1) to perform gene set enrichment analysis (Additional file 3: Tables S3, S4).
For the topics that closely correspond to cell types, the GoM DE analysis, as expected, identified genes and gene sets reflecting these cell types. For example, topic 1 corresponds to FACS B cells and is characterized by overexpression of CD79A (posterior mean l.e. LFC = 13.05) and enrichment of B cell receptor signaling genes (enrichment coefficient = 0.72). Topic 2 corresponds to myeloid cells and is characterized by overexpression of S100A9 (l.e. LFC = 15.45) and enrichment of genes down-regulated in hematopoietic stem cells (enrichment coefficient = 0.90).
The close correspondence between topics 1 and 2 and FACS cell types (B cells, myeloid cells) provides an opportunity to contrast the GoM DE analysis with a standard DE analysis of the FACS cell types (Fig. 4). This is not a perfect comparison because the topics and FACS cell populations are not exactly the same, but the LFC estimates correlate well (Fig. 4A, B). This comparison illustrates to two key differences:
Many more l.e. LFCs are driven toward zero in the GoM DE analysis (Fig. 4C), so the l.e. LFCs more effectively draw attention to the “distinctive genes” (Fig. 4A, B). This includes genes that are distinctively underexpressed such as ID2 in B cells [81].
The GoM DE analysis yields much larger LFC estimates of the cell-type-specific genes. This is because the topic model isolates the biological processes (topics 1 and 2) related to cell type while removing background biological processes (topic 6) that do not relate to cell type.
Other topics capture more continuous structure, such as topics 4 and 5 (Fig. 1). Although the GoM DE analysis of these topics is not comparable to a standard DE analysis, many of the the distinctive genes and gene sets suggest NK and T cells, which are precisely the FACS-labeled cells with greatest membership to these topics: for example, for topic 4, overexpression of NKG7 (posterior mean l.e. LFC = 14.09), enrichment of cytolysis genes (enrichment coefficient = 2.22); for topic 5, overexpression of CD3D (l.e. LFC = 12.01), enrichment co-stimulatory signaling during T-cell activation (enrichment coefficient = 1.58).
Topic 6 captures continuous structure and is present in almost all cells, so knowledge of the FACS cell types is not helpful for understanding this topic. Still, the GoM DE results for topic 6 show a striking enrichment of ribosome-associated genes (Fig. 3, Additional file 3: Tables S3, S4) (these ribosomal protein genes also account for a large fraction of the total expression in the cells [5]). This ability to annotate distinctly non-discrete structure is a distinguishing feature of the grade-of-membership approach, and below we will show more examples where this feature contributes to understanding of the cell populations.
Evaluation of DE analysis methods using simulated data
Having illustrated the features of this approach, we now evaluate the methods more systematically in simulated expression data sets. We began our evaluation by first considering the case of two groups in which there is no partial membership to these groups, that is, when the cells can be separated into two cell types. The GoM DE analysis should accommodate this special case and should compare well with existing DE analysis methods. We compared with DESeq2 [79] and MAST [83], both popular methods that have been shown to be competitive in benchmarking studies [61, 62, 85] (and are included in Seurat [25]).
To compare the ability of these methods to discover differentially expressed genes, we simulated RNA molecule count data for 10,000 genes and 200 cells in which 98% of cells were attributed to a single topic, with roughly the same number of cells assigned to each of the two topics (with membership proportions of 99% or greater). Note that although half the simulated genes had different expression levels in the two topics, most of these expression differences were small, and therefore the methods were not expected to identify most expression differences. This mimics the typical situation in gene expression studies whereby most expression differences are small. Molecule counts were simulated using a Poisson measurement model so that variation in expression across cells was due to either measurement error or true differences in expression levels between the two groups. For all DE analyses, we took group/topic assignments to be known so that incorrect assignment of cells to topics was not a source of error. Other aspects of the simulations were chosen to emulate molecule count data from scRNA-seq studies (see the “Methods” section). We repeated the simulations 20 times, and summarized the results of the DE analyses in Fig. 5 (also Additional file 1: Figs. S2, S3).
DESeq2 and the GoM DE analysis have several features in common: both are based on a Poisson model, and both use adaptive shrinkage [82, 84] to improve accuracy of the LFC estimates and test statistics. Therefore, we expected the GoM DE results to closely resemble DESeq2 in these simulations. Indeed, both methods produced nearly identical posterior mean LFC estimates, posterior z-scores (Fig. 5A, B), and s-values (Additional file 1: Fig. S3) and achieved very similar performance (Fig. 5C). Although DESeq2 additionally estimates an overdispersion level for each gene, in these simulations, DESeq2 correctly determined that the level of overdispersion was small for genes with large expression differences, which explains the strong similarity of the LFC estimates and posterior z-scores. MAST, owing to an approach that is very different from DESeq2 and the GoM DE analysis, yielded estimates that were less similar (Additional file 1, Fig. S3), yet achieved comparable performance (Fig. 5C).
Next, we evaluated the GoM DE analysis methods in data sets in which the cells had varying degrees of membership to multiple topics. Since existing DE methods cannot handle the situation in which there are partial memberships to groups, we mainly sought to verify that the method behaves as expected in the ideal setting when data sets are simulated from the topic model (2). To provide some baseline for comparison, we also applied the method of Dey et al. [47], which is not strictly a DE analysis method but does provide a ranking of genes by their “distinctiveness” in each topic. This ranking is based on a simple Kullback-Leibler (K-L) divergence measure; large K-L divergences should signal large differences in expression, as well as high overall levels of expression, so large K-L divergences should correspond to small DE p-values. Since the K-L divergence is not a signed measure, we omitted tests for negative expression differences from the evaluations, which was roughly half of the total number of possible tests for differential expression.
We performed 20 simulations with topics and cells and another 20 simulations with topics and cells. To simplify evaluation, all genes either had the same rate of expression in all topics, or the rate was different in exactly one topic. As a result, the total number of expression differences in each data set was roughly the same regardless of the number of simulated topics. Other aspects of the simulations were kept the same as the first set of simulations (see Methods). Similar to before, we took the membership proportions to be known so that mis-estimation of the membership proportions would not be source of error in the GoM DE analysis and in calculation of the K-L divergence scores.
The largest K-L divergence scores in the simulated data sets reliably recovered true expression differences (Fig. 6A, E). Therefore, the K-L divergence scores achieved good true positives rates (i.e., good power) at low false positive rates, (see Fig. 5 for notation). However, for DE analysis, a more relevant performance measure is the false discovery rate, . Because the K-L divergence score does not fully account for uncertainty in the unknown gene expression differences, many genes with no expression differences among topics were also highly ranked, leading to poor FDR control (Fig. 6D, H). By contrast, the GoM DE analysis better accounted for uncertainty in the unknown expression levels. The GoM DE analysis also more accurately recovered true expression differences at small p-values or s-values (Fig. 6B, C, F, G) and therefore obtained much lower false discovery rates at corresponding levels of power (Fig. 6D, H). Comparing the GoM DE analysis with and without adaptive shrinkage, the adaptive shrinkage did not necessarily lead to better performance (Fig. 6D, H) but did provide more directly interpretable measures of significance (s-values or local false sign rates) by shrinking the LFC estimates and adapting the rate of shrinkage to the data; for example, the expression differences were shrunk more strongly in the data sets, correctly reflecting the much smaller proportion of true expression differences (compare Fig. 6C and G).
Case study: scRNA-seq epithelial airway data from Montoro et al. (2018)
We reanalyzed scRNA-seq data for single cells sampled from the tracheal epithelium in wild-type mice [86]. The original analysis [86] used a combination of methods, including t-SNE, community detection [87], diffusion maps [88], and partitioning around medoids (PAM) to identify 7 epithelial cell types: abundant basal and secretory (club) cells; rare, specialized epithelial cell types, including ciliated, neuroendocrine and tuft cells; a novel subpopulation of “ionocytes”; and a novel basal-to-club transitional cell type, “hillock” cells. Although not large in comparison to other modern single-cell data sets, this data set is challenging to analyze, with complex structure, and a mixture of abundant and rare cell types. In contrast to the PBMC data set, there are no existing cell annotations to interpret the topics, so we must rely on inferences made from the expression data alone to make sense of the results.
The topic model fit to the UMI counts with topics is shown in Fig. 7A, and the results of the GoM DE analysis and subsequent GSEA are summarized in Fig. 7. Although we do not have cell labels to compare with, distinctive genes emerging from the GoM DE analysis help connect some of the topics to known cell types. For example, the most abundant topics correspond well with predominant epithelial cell types in the lung: topic 1 shows strong overexpression of basal cell marker gene Krt5 [89] (posterior mean l.e. LFC = 4.62) and distinctive genes in topics 2 and 3 include key secretory genes in club cells such as Bpifa1/Splunc1 [90] (l.e. LFC = 4.93) and Scgb1a1 [91] (l.e. LFC = 5.90).
The “hillock” transitional cells, which were originally identified via a diffusion maps analysis [86], emerge as a single topic (topic 4, cyan), with Krt13 (l.e. LFC = 8.04) and Krt4 (l.e. LFC = 5.46) being among the most distinctive genes. The transitional nature of these cells is evoked by their mixed membership; only 237 out of the 7193 cells have > 90% membership to this topic.
Other less abundant epithelial cell types emerge as separate topics once a topic model is fit separately to the subpopulation of these rare cell types (Fig. 7B). These topics recover ciliated cells (topics 8, 9; Ccdc153, posterior l.e. LFC = 5.39), neuroendocrine cells (topic 10; Chga, l.e. LFC = 6.92), and tuft cells (topic 11; Trpm5, l.e. LFC = 6.94). Note that Foxi1+ ionocytes were previously identified as a novel cell type from a small cluster of 26 cells [86], but our analysis failed to distinguish this very rare cell type from the neuroendocrine cells (Additional file 1: Figs. S4, S5).
The topics also capture biologically relevant continuous substructure in club cells (topics 2 and 3) and ciliated cells (topics 8 and 9) that was not discovered in the original analysis [86]. This continuous substructure may be reflective of finer scale cell differentiation or specialization of function. In particular, we interpret topic 3 as capturing “canonical” or “mature” (Scgb1a1+, l.e. LFC = 5.90) club cells [90], with negative regulation of inflammation, whereas cells with greater membership to topic 2 are “club-like” (Bpifa1/Splunc1+, l.e. LFC = 3.94) [89, 91]. Topic 9, similarly, appears to represent “canonical” ciliated cells, featuring upregulated genes such as such as Ccdc67/Deup1 () and Ccdc34 (3.29) [89, 92, 93], and enrichment of Gene Ontology terms [94] such as cilium organization (GO:0044782) and axonemal dynein inner arm assembly (GO:0036159).
In summary, by taking a topic-model-based approach we identified and annotated well-characterized cell types such as basal cells, as well less distinct but potentially interesting substructures such as “Hillock” cells and club cell subtypes.
Case study: Mouse sci-ATAC-seq Atlas data from Cusanovich et al. (2018)
We reanalyzed data from the Mouse sci-ATAC-seq Atlas [97], comprising 81,173 single cells in 13 tissues. First, to provide an overview of the primary structure in the whole data set, we fit a topic model with topics to these data. The topics correspond closely to the clusters identified in [97] (Additional file 1: Fig. S8), and several different tissues are distinguished by different topics (Fig. 8A). For the 4 tissues that have replicates, the replicates show a similar composition of the topics (Fig. 8A).
Next, we performed a more detailed analysis of just the kidney (6431 cells), fitting a topic model with to just these cells. We focussed on the kidney cells because, as noted previously [97, 98], both expression and chromatin accessibility vary in relation to the spatial organization of the renal tubular cells, and we predicted that this spatial structure could be better captured by topics rather than by traditional clustering methods. To interpret these topics obtained from chromatin accessibility data, we first used the GoM DE analysis to identify differentially accessible peaks for each topic; then, we used “co-accessibility” as predicted by Cicero [95, 97] to connect genes to peaks representing distal regulated sites. Finally, we performed a simple enrichment analysis to identify the “distinctive genes” for each topic, which we defined as the genes with many distal regulatory sites that were differentially accessible.
The results of these analyses are shown in Fig. 8. Many of the distinctive genes (Fig. 8, Additional file 1: Fig. S9, Additional file 6: Table S7) clearly relate topics to known kidney cell types. For example, topic 1 is enriched for genes Klf5 and Elf5 which relate to the collecting duct [98, 99]; topic 3 is enriched for genes Umod and Slc12a1 associated with the loop of Henle [98, 100]); and topics 2, 6, and 7 are respectively enriched for genes related to the distal convoluted tubule (Wnk1), podocytes (Col1a2) and glomerular endothelial cells (Ptprb).
Most interestingly, spatial organization of the proximal tubule is captured by two topics; topic 4 is enriched for Slc5a2 (also known as Sglt2) and Slc2a2 (also known as Glut2), associated with the S1 segment of the proximal tube [96, 101, 102], and topic 5 is enriched for Slc5a8 (Smct1) and Atp11a, related to the S3 segment [96, 103]. This result illustrates the ability of the topic model to capture continuous variation in membership of two somewhat complementary processes, which traditional clustering methods are not designed for.
Case study: chromatin accessibility profiles of the hematopoietic system from Buenrostro et al. (2018)
Buenrostro et al. [104] studied 2034 single-cell ATAC-seq profiles of 10 cell populations isolated by FACS to characterize regulation of the human hematopoietic system. Both PCA and t-SNE showed, visually, the expected structure into the main developmental branches (Fig. 2 in [104]). However, neither PCA nor t-SNE isolated these branches as individual dimensions of the embedding. Identifying these branches may allow for more precise characterization of the underlying regulatory patterns. Here, by fitting a topic model to the data, the main developmental branches are identified as individual topics (Fig. 9A): topic 3, pDC; topic 4, erythroid (MEP); topic 5, lymphoid (CLP); and topic 6, myeloid (GMP and monocytes). Another topic captures the cells at the top of the developmental path (topic 1; HSC and MPP). Other cells at intermediate points in the developmental trajectory, such as CMP, GMP and LMPP cells, are more heterogeneous, and this is reflected by their high variation in topic membership.
To better interpret the regulatory patterns behind each topic, we identified transcription factor (TF) motifs that were enriched for differentially accessible regions in each topic (Fig. 9B, Additional file 7: Table S8). Many of the top TF motifs (as ranked by HOMER p-values [105]) point toward regulation of the main developmental trajectories, such as EBF motifs in topic 5 (lymphoid), CEBP motifs in topics 6 and 7 (myeloid), and Hox motifs in topic 1 (HSC and MPP cells). A few topics (topics 8–10) are much less abundant and do not align well with the FACS cell types, and their motif enrichment results were correspondingly more difficult to interpret.
A complication that arose in analyzing these data, which was also noted in [104], is that the cells were obtained from different sources, and this shows up as systematic variation in the chromatin accessibility. This donor effect is captured by topics 1 and 2 in HSC and MPP cells and, to a lesser extent, in CMP and LMPP cells (Additional file 1: Fig. S11). Topic 1 is enriched for Jun and Fos TF motifs, similar to what was found in [104].
Discussion
The GoM DE analysis is part of a topic-model-based pipeline for analysis of single-cell RNA-seq [47] or ATAC-seq data [49]. This pipeline includes the following steps: (1) fit a topic model to the data; (2) visualize the structure inferred by the topic model; (3) run the GoM DE analysis with the estimated topics; and, optionally, (4) perform other downstream analyses using the results of the GoM DE analysis, e.g., gene set enrichment analysis (for RNA-seq data) or motif enrichment analysis (for ATAC-seq data). Unlike most analysis pipelines for clustering and dimensionality reduction (e.g., [4, 19, 23, 26, 27]), the topic-model-based pipeline is directly applied to the “raw” count data and therefore does not require an initial step to transform and normalize the data which can lead to downstream issues in the statistical analysis [8, 106–108]. We presented several case studies illustrating the use of the topic-model-based pipeline to analyze single-cell RNA-seq and ATAC-seq data sets. From these case studies, we have drawn a few lessons on the practical challenges that may arise in applying topic modeling approaches to single-cell data, and we share these lessons here (see also [47, 49] for related discussion).
One practical question is how to choose K, the number of topics. Many papers have suggested different criteria for determining K. Our view, following [47], is that there is no single “best” K, and we recognize the advantages of learning topics at multiple settings of K; in some data sets, different Ks can reveal structure at different levels of granularity (for example, increasing the number of topics in the Mouse sci-ATAC-seq Atlas data revealed more structure within tissues; see https://tinyurl.com/2p99swdk). We have found that it is often helpful to start with a smaller K to elucidate the less granular structure, which is often easier to interpret, then rerun the topic modeling with larger K to identify finer structure.
We proposed annotating topics by distinctive genes identified using the l.e. LFC. One drawback is that this does not reveal the commonalities that may exist among multiple topics, for example, topics corresponding to subpopulations within a common class of cells. A simple alternative to the l.e. LFC, which is also implemented in the fastTopics R package, is to compare against expression under the “null model” (see the “Methods”). We view this as a complementary LFC metric that may reveal additional insights into the topics.
Donor, batch or other technical effects in the single-cell RNA-seq or ATAC-seq data can complicate the analysis and interpretation of the topics if these effects are not small. Since these effects are usually not known, usually we must assess their impact indirectly [109]. For example, the Mouse sci-ATAC-seq Atlas data included several replicates, but the replicate effects appeared to be small judging by the fact that the replicates showed a similar composition of topics. By contrast, the donor effects in the human hematopoietic system data were much larger, and in the topic model, these donor effects were at least partially captured by individual topics. The broader question of how to deal with non-ignorable donor or batch effects—in particular, how to separate technical effects from biological effects of interest—remains a question of considerable debate and continued investigation [25, 39, 109–116]. In particular, it has been noted that attempting to “correct” for effects can sometimes remove differences that we would like to learn about such as differences in cell-type proportions among the batches.
For modeling UMI counts, an open question is whether the Poisson or multinomial model (1) is sufficient or whether more flexible models are needed (this question was investigated in [69] for single-gene models, but not for multi-gene models). Alternative models such as the negative binomial [117] or Poisson log-normal [80, 118], which can capture additional random variation (“overdispersion”) in underlying expression or measurement error, may result in more robust estimation of the topics.
In single-cell ATAC-seq data, the GoM DE analysis identifies differentially accessible peaks or regions. Usually, these peak-level results need to be translated into biological units that are more useful for annotating the topics (e.g., genes, gene sets, transcription factors). In the analysis of the hematopoietic system single-cell ATAC-seq data, we used HOMER [105] to identify TF motifs enriched for differentially accessible peaks. In the analysis of the Mouse sci-ATAC-seq Atlas data, we identified genes enriched for differentially accessible distal regularity sites. Clearly, the quality of the gene enrichment results will depend on our ability to accurately associate peaks with genes. For this, we used the scores computed in [97] using Cicero [95]. However, there are now several alternatives to Cicero that may be preferred [19, 27, 28, 119–122], and in principle any of these approaches could be combined with the peak-level GoM DE results to identify relevant genes.
Recently developed technologies profile both transcription and chromatin accessibility in single cells [123, 124]. For such data, one could fit two topic models, one to the RNA-seq data and another to the ATAC-seq data. With a careful initialization of the topic model fitting algorithm, the topics may be more consistent across the two modalities. But it would be preferrable to analyze the multimodal data jointly for improved accuracy [125–130]. Potentially, the strategy used in MOFA [131, 132] could be adapted for topic modeling—that is, the transcripts and accessibility profiles would share the same membership proportions, , but each modality would have a different . However, it remains to be seen how well this strategy works in practice.
Conclusions
To summarize, we have described a new method that aids in annotating and interpretating the “parts” of cells learned by fitting a topic model to scRNA-seq data or single-cell ATAC-seq data. Our method, GoM DE (differential expression analysis allowing for grades of membership), can be viewed as an extension of existing differential expression methods that allows for mixed membership to multiple groups or topics.
Methods
Models for single-cell ATAC-seq data
In single-cell ATAC-seq data, is the number of unique reads mapping to peak or region j in cell i. Although can take non-negative integer values, it is common to “binarize” the accessibility data (e.g., [19, 74, 133–135]), meaning that when at least one read in cell i maps to region j and otherwise. For this reason, one might prefer to model the binarized accessibility values as binomial (Bernoulli) random variables. A multinomial model, on the other hand, should better capture the sampling process for reads mapping to regions but does not account for the truncation of read counts above 1. Therefore, we view both the binomial and multinomial models as approximations. As we explain next, under reasonable assumptions the binomial and multinomial models are similar to each other so it may not matter which model one chooses.
The multinomial topic model for analyzing single-cell ATAC-seq data was suggested by [49]. They assumed the multinomial model (1) in which the s are binarized accessibility values instead of UMI counts.
A binomial model was proposed in [74],
8 |
where is a cell-specific factor that depends on sequencing coverage and other properties (e.g., amplification, read post-processing [136]), is a region-specific factor (say, proportional to the size of the region), and the s capture additional variation in accessibility across cells and regions. Moving forward, we make the simplifying assumption that the regions are all approximately the same size; that is, for all .
The binomial model (8) is closely related to a multinomial model. To make the connection, we first note that the binomial model with for all j can be approximated by a Poisson model,
9 |
This will be a good approximation when the s are small and the cell-specific factors are large, which is usually the case in single-cell ATAC-seq data. Next, we note that the Poisson model (9) and multinomial model (1) are closely related if we choose the size factors to be [69, 137]; this implies , where is the matrix with entries . By these arguments, the binomial model (8) (also the model used in [74]) and the multinomial model (1) (also the model used in [49]) are similar, and connecting the two models clarifies the assumptions made by each of the models. In particular, the multinomial topic model (1–2) used here and in [49] assumes a low-rank structure in the s across cells and regions; i.e., .
Differential expression analysis allowing for grades of membership
Derivation of GoM DE model
In the “Methods overview,” we motivated the GoM DE model (3) as extending a basic Poisson model expression to allow for partial membership to K groups or topics. The GoM DE model can also be motivated from an approximation to the topic model. Recall, the topic model, is a multinomial model (1) in which the multinomial probabilities are given by affine combinations of the expression levels in the K topics, . The non-negativity constraints and sum-to-one constraints ensure that the s are multinomial probabilities. From a basic identity relating the multinomial and Poisson distributions [138, 139], the multinomial likelihood for the topic model can be replaced with a likelihood formed by a simple product of independent Poissons, that is,
10 |
where and . The approximation then comes from no longer requiring the s to be multinomial probabilities by removing the constraint that . This allows us to analyze the genes independently. This is a good approximation so long as is large and the s are small (a similar approximation was used for GLM-PCA [8]). To be explicit about this approximation, we say (which are no longer guaranteed to be multinomial probabilities) and (which are no longer guaranteed to sum to one), resulting in the GoM DE model, which for convenience we restate here:
11 |
“Null” model
The simplest Poisson model of the form (3) is one in which is the same across all cells i, that is, for all . We treat this a “null” model, which can be used to make certain comparisons, e.g., to estimate changes in expression in relative to expression in all cells. The maximum-likelihood estimate (MLE) of under the null model is
12 |
Estimation of log-fold change
In practice, we have found the l.e. LFC to work well, so in our results we use the l.e. LFC. But the l.e. LFC may not be appropriate in all circumstances, and for this reason ,we note that the GoM DE analysis framework is quite general and accommodates alternatives to the l.e. LFC. Two alternatives are implemented in the software. One alternative is to compare with the “null” model,
13 |
Another treats one topic l as a “reference topic”, and compares all other topics to l using (4).
Maximum-likelihood estimation
A convenience of the Poisson model allowing for grades of membership is that we can reuse Poisson NMF computations (described below and in more detail in [48]) to compute MLEs of the unknowns : if we consider all genes simultaneously, we recover a Poisson NMF model, , by setting , . Therefore, we can reuse the Poisson NMF algorithms to compute MLEs of the unknowns .
Maximum a posteriori estimation
To improve numerical stability in the parameter estimation, we compute maximum a posteriori (MAP) estimates of in which each is assigned a gamma prior, , with , , and . Typically, will be some small, positive number, e.g., . Here, we use the parameterization of the gamma distribution from [140] in which is the shape parameter and is the inverse scale parameter; under this parameterization, the mean is and the variance is . The maximum-likelihood computations can be reused for MAP estimation with this gamma prior by adding “pseudocounts” to the data; specifically, MAP estimation of given counts and membership proportions and is equivalent to maximum-likelihood estimation of given counts and membership proportions matrix , where is the identity matrix. Unless otherwise stated, we added pseudocounts to the data.
Quantifying uncertainty and stabilizing LFC estimates
We implemented a simple Markov chain Monte Carlo (MCMC) algorithm [141, 142] to quantify uncertainty in the LFC estimates. Although normal approximations to likelihoods are typically used by DE methods to quickly obtain analytical measures of uncertainty (e.g., standard errors, confidence intervals) for LFCs, we found that normal approximations to the likelihoods from (5) were sometimes poorly behaved, particularly for lowly expressed genes. Another consideration was that the analytical solutions provide confidence intervals for the unknowns , but ultimately we are interested in quantifying uncertainty in the l.e. LFCs (6) which do not have a simple linear relationship to the s. Therefore, it is unclear whether the standard analytical solutions can be applied to the l.e. LFCs without making further approximations or simplifications.
MCMC is typically computationally intensive, but with careful implementation (e.g., use of sparse matrix operations and multithreaded computations) the MCMC algorithm is quite fast. Other benefits of using MCMC is that the algorithm can straightforwardly accommodate different choices of LFC statistics and no normality assumptions are needed.
The basic idea behind the MCMC algorithm is as follows: for a given gene j, simulate the posterior distribution of the LFC statistic by performing a “random walk” on , where , . The random walk generates a sequence of states , in which denotes the pre-specified length of the simulated Markov chain. After choosing an initial state , each new state is generated from the current state by the following procedure: first, a topic is chosen uniformly at random; next, a proposed state is generated as , , with for all . Assuming an (improper) uniform prior for the unknowns, , the proposed state is accepted into the Markov chain with probability
14 |
in which is the jth column of the counts matrix , , and is the likelihood at , (note that may include pseudocounts). The standard deviation of the Gaussian proposal distribution, , is a tuning parameter (unless otherwise stated, we used ) The additional term in the acceptance probability is needed to account for the fact that we are simulating the log-transformed parameters , not ([143], p. 11). When the proposal is not accepted, the new state is simply copied from the previous state, .
Most of the effort in running the MCMC goes into computing the acceptance probabilities (14), so we have carefully optimized these computations. For example, we have taken advantage of the fact that the count vectors are typically very sparse. Additionally, these computations can be performed in parallel since the Markov chains are simulated independently for each gene j.
Once Monte Carlo samples , for , have been simulated by this random-walk MCMC, we compute posterior mean LFC estimates and quantify uncertainty in the LFC estimates. For example, expressing the l.e. LFC for gene j and topic k as a function of the unknowns, , the posterior mean l.e. LFCs are calculated as .
The final step in the GoM DE analysis is to perform adaptive shrinkage [82] to stabilize the posterior mean estimates. To implement this step, we used the ash function from the ashr R package [144]. We used the same settings as DESeq2 to replicate as closely as possible the performance of DESeq2 with adaptive shrinkage. DESeq2 calls ash with method = “shrink”, which sets the prior to be a mixture of uniforms without a point-mass at zero.
The adaptive shrinkage method takes as input a collection of effect estimates and associated standard errors . In this setting, it is not immediately obvious what are the standard errors, in part because the posterior distribution of the unknowns is not always symmetric about the mean or median. To provide a reasonable substitute summarizing uncertainty in the estimates, we computed Monte Carlo estimates of highest posterior density (HPD) intervals. A HPD interval is the smallest interval that contains of the probability mass [145, 146]. Specifically, let denote the HPD interval for the LFC estimate of gene j in topic k, and let denote the posterior mean. We defined the standard error as when ; otherwise, . Defining the standard errors in this way prevented overshrinking of estimates that were uncertain but had little overlap with zero. We set the size of the HPD intervals to so that the would recover conventional standard error calculations when the posterior distribubtion is well approximated by the normal distribution. The revised posterior means and standard errors returned by the adaptive shrinkage method were then used by ashr to calculate test statistics including posterior z-scores (defined as the posterior mean divided by the posterior standard error [147]), local false sign rates (lfsr), and s-values.
An important question is the choice of . One heuristic way to assess whether is large enough is to perform two independent MCMC runs initialized with different pseudorandom number generator states (“seeds”) and check consistency of the posterior estimates from the two runs (we checked consistency of the posterior estimates after stabilizing the estimates using adaptive shrinkage, as described above). In simulated data sets (below), comparison of two independent MCMC runs suggested that was sufficient to obtain reasonably accurate estimates of posterior means and posterior z-scores for all genes (Additional file 1: Fig. S2). Therefore, we performed initial MCMC simulations for all single-cell data sets using . The runtimes for performing these MCMC simulations on the single-cell data sets (described below), with , are given in Table 1.
Table 1.
Although this consistency check suggested that running a simulation with would be “good enough,” to provide additional assurance we performed another consistency assessment in the PBMC data set. We found that even better consistency was achieved with (Additional file 1: Fig. S12). Therefore, to provide more reliable results, the final GoM DE results were generated with .
The GoM DE analysis methods are implemented in the de_analysis function in the fastTopics package [148].
Single-cell data sets
All data sets analyzed were stored as sparse matrices , where n was the number of cells and m was the number of genes or regions. The data sets are summarized in Table 1.
Preparation of scRNA-seq data
Since the topic model is a multinomial model of count data, no log-normalization or other transformation of the scRNA-seq molecule counts was needed. Furthermore, we kept all genes other than those with no variation in the data set (this is done in part to demonstrate that our methods are robust to including genes with little variation). Also note that due to the use of sparse matrix techniques in our software implementations, including genes with low variation did not greatly increase computational effort.
Preparation of single-cell ATAC-seq data
As previously suggested [19, 133–135]), we “binarized” the single-cell ATAC-seq data, that is, we assigned (“accessible”) when least one fragment in cell i mapped to peak or region j; otherwise, (“inaccessible”). There are at least a couple reasons for doing this. For small peaks (say, kb), read counts do not provide a reliable quantitative measure of accessibility in single cells. This is because the (random) first insertion restricts the space for subsequent insertions. Additionally, insertions could occur within the same site on the same allele or on each of the two alleles, complicating interpretation of the read counts.
Like the RNA molecule count data (see above), we kept all regions except those that showed no variation.
PBMC data from Zheng et al. (2017)
We combined reference transcriptome profiles generated from 10 bead-enriched subpopulations of PBMCs (donor A) processed using Cell Ranger 1.1.0 [29, 149]. We downloaded the “Gene/cell matrix (filtered)” tar.gz file from the 10x Genomics website for each of the following 10 FACS-purified data sets: CD14+ monocytes, CD19+ B cells, CD34+ cells, CD4+ helper T Cells, CD4+/CD25+ regulatory T Cells, CD4+/CD45RA+/CD25- naive T cells, CD4+/CD45RO+ memory T Cells, CD56+ natural killer cells, CD8+ cytotoxic T cells, and CD8+/CD45RA+ naive cytotoxic T cells. After combining these 10 data sets, then filtering out unexpressed genes, the combined data set contained molecule counts for 94,655 cells and 21,952 genes; 97.1% of the molecule counts were zero.
In Fig. 1, the 54,132 cells from these data sets were labeled as “T cells”: CD4+ helper T Cells, CD4+/CD25+ regulatory T Cells, CD4+/CD45RA+/CD25- naive T cells, CD4+/CD45RO+ memory T Cells, and CD8+/CD45RA+ naive cytotoxic T cells.
Epithelial airway data from Montoro et al. (2018)
We analyzed a mouse epithelial airway data set from [86, 150]. These were gene expression profiles of trachea epithelial cells in C57BL/6 mice obtained using droplet-based 3′ scRNA-seq, processed using the GemCode Single Cell Platform. We downloaded file GSE103354_Trachea_droplet_UMIcounts.txt.gz. This file also contained the cluster assignments that we compared with (in [86], the samples were subdivided into 7 clusters using a community detection algorithm). After removing genes that were not expressed in any of the cells, the data set contained molecule counts for 7193 cells and 18,388 genes (90.7% of counts were zero).
Mouse Atlas data from Cusanovich et al. (2018)
Cusanovich et al. [97] profiled chromatin accessibility by single-cell combinatorial indexing ATAC-seq (sci-ATAC-seq) [151, 152] in nuclei from 13 distinct tissues of a 8-week-old male C57BL/6J mouse. Replicates for 4 of the 13 tissues were obtained by profiling chromatin accessibility in a second mouse. We downloaded the (sparse) binarized peak cell matrix in RDS format, atac_matrix.binary.qc_filtered.rds, from the Mouse sci-ATAC-seq Atlas website [153]. We also downloaded cell_metadata.txt which included cell types estimated by a clustering of the cells (see Table S1 in [97]). The full data set used in our analysis (13 tissues, including 4 replicated tissues) consisted of the binary accessibility values for 81,173 cells and 436,206 peaks (1.2% overall rate of accessibility). Note that all peaks had fragments mapping to at least 40 cells, so no extra step was taken to filter out peaks.
Separately, we analyzed the sci-ATAC-seq data from kidney only, in which peaks with fragments mapping to fewer than 20 kidney cells were removed, resulting in data set containing binary accessibility values for 6431 cells and 270,864 peaks. Base-pair positions of the peaks were based on Mouse Genome Assembly mm9 (NCBI and Mouse Genome Sequencing Consortium, Build 37, July 2007).
From the Mouse sci-ATAC-seq website, we also downloaded the file master_cicero_conns.rds containing the Cicero co-accessibility predictions [95, 153], which we used to link chromatin accessibility peaks to genes. For the kidney data, we connected a peak given in the “Peak2” column of the Cicero co-accessibility data table to a gene given in the “peak1.tss.gene_id” column if the “cluster” column was 11, 18, 22, or 25 (these four clusters were the main kidney-related clusters identified in [97]). This extracted, for each gene, the distal and proximal sites connected to the gene associated with Peak1 (specifically, a gene in which the transcription start site overlaps with Peak1). Among the 22,194 genes associated with at least one peak, the median number of peaks connected to a gene was 19, and the largest number of peaks was 179 (for Bahcc1 on chromosome 11). Among the 270,864 peaks included in the topic modeling analysis, 113,489 (42%) were connected to at least one gene, 95% of peaks were connected to 10 genes or fewer, and the largest number of connected genes was 60.
Human hematopoietic system data from Buenrostro et al. (2018)
Buenrostro et al. [104] used FACS to isolate 10 hematopoietic cell populations from human bone marrow and blood; then, the cells were assayed using single-cell ATAC-seq. The processed single-cell ATAC-seq data were downloaded from [154], specifically file GSE96769_scATACseq_counts.txt.gz containing the fragment counts and file GSE96769_PeakFile_20160207.bed.gz containing peaks obtained from bulk ATAC-seq data [104]. Although there may be benefits to calling peaks using aggregated single-cell data instead [155], we used the original accessibility data based on the bulk ATAC-seq peaks so that our analysis was more directly comparable to the analysis of [104].
Following [104, 155], we extracted the 2034 samples passing quality control filters; then, we “binarized” the counts. The list of 2034 cells considered “high quality” was obtained from file metadata.tsv included in the online benchmarking repository [155]. After removing peaks with fragments mapping to fewer than 20 cells, the final data set used in our analysis consisted of binary accessibility values for 2034 cells and 126,719 peaks (4.6% overall rate of accessibility). Base-pair positions of the peaks were based on human genome assembly 19 (Genome Reference Consortium Human Build 37, February 2009).
In [104], a large, patient-specific batch effect was identified in the accessibility profiles for the HSC cells, and therefore, steps were taken in [104] to normalize the accessibility data before performing PCA. We instead fit the topic model to the unnormalized binary accessibility values, in part to find out how well the topic model can cope with the complication of a batch effect. In agreement with [104], this batch effect is at least partly captured by the topics, although in our analysis, the batch effect also appeared in MPP cells and, to a lesser extent, in CMP cells (Additional file 1: Fig. S11).
Fitting the topic models
In brief, we took the following steps to fit a topic model. All these steps are implemented in the R package fastTopics.
First, we fit a Poisson NMF model [37, 156],
15 |
where is a matrix of the same dimension as with entries giving the Poisson rates for the counts . The parameters of the Poisson NMF model are stored as two matrices, , , with non-negative entries . fastTopics has efficient implementations of algorithms for computing maximum-likelihood estimates (MLEs) of [48].
Second, we recovered MLEs of from MLEs of by a simple reparameterization [48].
In an empirical comparison of Poisson NMF algorithms with count data sets, including scRNA-seq data sets [48], we found that a simple co-ordinate descent (CD) algorithm [157, 158], when accelerated with the extrapolation method of Ang and Gillis [159], almost always produced the best Poisson NMF (and topic model) fits, and in the least amount of time. To confirm this, we compared topic model fits obtained by running the same four algorithms that were compared in [48]—EM and CD, with and without extrapolation—on the PBMC data set and assessed the quality of the fits. We evaluated the model fits in two ways: using the likelihood and using the residuals of the Karush-Kuhn-Tucker (KKT) first-order conditions (the residuals of the KKT system should vanish as the algorithm approaches maximum-likelihood estimates of ). Following [48], to reduce the possibility that multiple optimizations converge to different local maxima of the likelihood, which could complicate these comparisons, we first ran 1000 EM updates; then, we examined the performance of the algorithms after this initialization phase (Additional file 1: Figs. S13, S14). Consistent with [48], the extrapolated CD updates always produced the best fit, or at the very least a fit that was no worse than the other algorithms, and almost always converged on a solution more quickly than the other algorithms. Therefore, subsequently we used the extrapolated CD updates to fit the topic models.
In more detail, the pipeline for fitting topic models consisted of the following steps: (1) initialize using Topic-SCORE [160], (2) perform 10 CD updates of , with fixed, (3) perform 1000 EM updates (without extrapolation) to get close to a solution (“prefitting phase”), (4) run an additional 1000 extrapolated CD updates to improve the fit (“refinement phase”), and (5) recover from by a simple transformation. The prefitting phase was implemented by calling fit_poisson_nmf from fastTopics with these settings: numiter = 1000, method = “em”, control = list(numiter = 4). The refinement phase was implemented with a second call to fit_poisson_nmf, with numiter = 1000, method = “scd”, control = list(numiter = 4,extrapolate = TRUE), in which the model fit was initialized using the fit from the prefitting phase. The topic model fit was recovered by calling poisson2multinom in fastTopics. Note that only the estimates of were used in the GoM DE analysis.
For each data set, we fit topic models with different choices of K and compared the fits for each K by comparing their likelihoods (Additional file 1: Fig. S5).
Visualizing the membership proportions
The membership proportions matrix can be viewed as an embedding of the cells in a continuous space with dimensions [50] (it is dimensions because of the constraint that the membership proportions for each cell must add up to 1). A simple way to visualize this embedding in 2-d is to apply a nonlinear dimensionality reduction technique such as t-SNE [11, 161] or UMAP [12] to ([49] used t-SNE). We have also found that plotting principal components (PCs) of the membership proportions can be an effective way to explore the structure inferred by the topic model (Additional file 1: Figs. S1, S4). However, we view these visualization techniques as primarily for exploration, and a more powerful approach is to visualize all dimensions simultaneously using a Structure plot [70, 71]. Here, we describe some improvements to the Structure plot for better visualization. These improvements are implemented in the structure_plot function in fastTopics.
When cells were labeled, we compared topics against labels by grouping the cells by these labels in the Structure plot. We then applied t-SNE to the matrix, separately for each group, to arrange the cells on a line within each group. For this, we used the R package Rtsne [162] (in fastTopics, we also implemented options to arrange the cells in each group using UMAP or PCA, but in our experience we found that UMAP and PCA produced “noisier” visualizations).
Arranging the cells by 1-d t-SNE worked best for smaller groups of cells with less complex structure. For large groups of cells, or for unlabeled single-cell data sets, we randomly subsampled the cells to reduce t-SNE runtime (when cells number in the thousands, it is nearly impossible to distinguish individual cells in the Structure plot anyhow). Even with this subsampling, the Structure plot sometimes did not show fine-scale substructures or rare cell types. Therefore, in more complex cases, we first subdivided the cells into smaller groups based on the membership proportions, then ran t-SNE on these smaller groups. These groups were either identified visually from PCs of or in a more automated way by running k-means on PCs of (see [163]).
Gene enrichment analysis based on differential accessibility of peaks connected to genes
Here, we describe a simple approach to obtain gene-level statistics from the results of a differential accessibility analysis. This approach was applied in the topic modeling analysis of the Mouse Atlas kidney cells.
Cusanovich et al. [97] used the Cicero co-accessibility predictions and the binarized single-cell ATAC-seq data to compute a “gene activity score” for each gene k and cell i. Here, we have a related but different goal: we would like to use the results of the differential accessibility analysis, which generates differential accessibility estimates and related statistics for each peak and each topic, to rank genes according to their importance to a given topic. A difficulty, however, with ranking the genes is that the Cicero co-accessibility predictions are uncertain, and they are only partially informative about which peaks are relevant to a gene. In aggregate, however, the expectation is that the “most interesting” genes will be genes that are (a) predicted by Cicero to be connected many peaks that are differentially accessible and (b) the differences in accessibility are mainly in the same direction. This suggests an enrichment analysis in which, for each gene, we test for enrichment of differential acccessibility among the peaks connected to that gene. Here, we describe a simple enrichment analysis for (a) and (b).
For (a), we computed a Bayes factor [164] measuring the support for the hypothesis that at least one of the peaks is differentially accessible (the LFC is not zero) against the null hypothesis that none of the peaks are differentially accessible. For (b), we computed the average LFC among all differentially accessible peaks (that is, peaks with nonzero LFC according to some significance criterion).
We implemented this gene enrichment analysis by running adaptive shrinkage [82] separately for each gene and topic. This had the benefit of adapting the shrinkage separately to each gene in each topic. In particular, in comparison to the usual adaptive shrinkage step for a GoM DE analysis (see above), it avoided overshrinking differences for genes exhibiting strong patterns of differential accessibility. We took the following steps to implement this adaptive shrinkage analysis. First, we ran function ash from the ashr package [144] once on the posterior mean l.e. LFC estimates and their standard errors for all topics k and all peaks j, with settings mixcompdist = “normal”, method = “shrink”. This was done only to determine the variances in the mixture prior and to get a “default” model fit to be used in the subsequent adaptive shrinkage analyses.
Next, we ran ash separately for gene and each topic k using the l.e. LFC estimates and standard errors from the peaks j connected to that gene. We set the variances in the mixture prior to the variances determined from all the l.e. LFC estimates, and used ash settings mixcompdist = “normal” and pointmass = FALSE. One issue with running adaptive shrinkage using only the l.e. LFC estimates for the peaks connected to a gene is that some genes have few Cicero connections, leading to potentially unstable fits and unreliable posterior estimates. We addressed this issue by encouraging the fits toward the “default model” that was fitted to all genes and all topics; specifically, we set the Dirichlet prior on the mixture proportions to be with prior sample sizes , where here K denotes the number of components of the prior mixture (not the number of topics), denotes the kth mixture proportion in the adaptive shrinkage prior for the fitted “default” model, and . This stabilized the fits for genes with few Cicero connections while still allowing some ability to adapt to genes with many connections.
Finally, we used the logLR output from ash as a measure of support for enrichment (this is the Bayes factor on the log-scale), and we computed the mean l.e. LFC as the average of the posterior mean estimates of the l.e. LFCs taken over all peaks j connected to the gene and with posterior .
Motif enrichment analysis for differentially accessible regions
We used HOMER [105] to identify transcription factor (TF) motifs enriched for differentially accessible regions, separately for each topic estimated from the single-cell ATAC-seq data. For each topic , we applied the HOMER Motif Analysis tool findMotifsGenome.pl to estimate motif enrichment in differentially accessible regions; specifically, we took “differentially accessible regions” to be those with p-value less than 0.05 in the GoM DE analysis (Additional file 1: Fig. S16). These differentially accessible regions were stored in a BED file positions.bed. The exact call from the command-line shell was findMotifsGenome.pl positions.bed hg19 homer -len 8,10,12 -size 200 -mis 2 -S 25 -p 4 -h.
Note that the adaptive shrinkage step was skipped in the GoM DE analysis, so these are the p-values for the unmoderated l.e. LFC estimates. The reason for skipping the adaptive shrinkage step is that the shrinkage is performed uniformly for the LFC estimates for all regions, and since the vast majority of regions have l.e. LFC estimates are that are indistinguishable from zero, the result is that very few differentially accessible regions remain shrinkage.
Gene sets
Human and mouse gene sets for the gene set enrichment analyses (GSEA) were compiled from the following gene set databases: NCBI BioSystems [165], Pathway Commons [166, 167], and MSigDB [168–170], which includes Gene Ontology (GO) gene sets [94, 171]. Specifically, we downloaded bsid2info.gz and biosystems_gene.gz from the NCBI FTP site (https://ftp.ncbi.nih.gov/gene) on March 22, 2020; PathwayCommons12.All.hgnc.gmt.gz from the Pathway Commons website (https://www.pathwaycommons.org) on March 20, 2020; and msigdb_v7.2.xml.gz from the MSigDB website (https://www.gsea-msigdb.org) on October 15, 2020. For the gene set enrichment analyses, we also downloaded human and mouse gene information (“gene info”) files Homo_sapiens.gene_info.gz and Mus_musculus.gene_info.gz from the NCBI FTP site on October 15, 2020. Put together, we obtained 37,856 human gene sets and 33,380 mouse gene sets. In practice, we filtered gene sets based on certain criteria before running the GSEA. To facilitate integration of these gene sets into our analyses, we have compiled these gene sets into an R package [172].
Gene set enrichment analysis
We took a simple multiple linear regression approach to the gene set enrichment analysis (GSEA), in which we modeled the l.e. LFC estimate for gene i in a given topic, here denoted by , as , , in which indicates gene set membership; if gene i belongs to gene set j, otherwise (we represented the gene-set membership as a sparse matrix since most s are zero). Here, n denotes the number of candidate gene sets, and is the residual variance to be estimated. The idea behind this simple approach was that the most relevant gene sets are those that best explain the log-fold changes , and therefore, in the multiple regression, we sought to identify these gene sets by finding coefficients that were nonzero with high probability. See [173, 174] for similar ideas using logistic regression. Additionally, since many genes were typically differentially expressed in a given topic, modeling LFCs helped distinguish among DE genes that showed only a slight increase in expression versus those that were highly overexpressed [175, 176]. Of course, this simple multiple linear approach ignores uncertainty in the LFC estimates , which is accounted for in most gene set enrichment analyses. We addressed this issue by shrinking the l.e. LFC estimates prior to running the GSEA, that is, we took to be the the posterior mean LFC estimate after applying adaptive shrinkage, as described above (see the “Quantifying uncertainty and stabilizing LFC estimates” section). The result was that genes that we were more uncertain about had have an l.e. LFC estimate that was zero or near zero.
We implemented this multiple linear regression approach using SuSiE (susieR version 0.12.10) [177]. A benefit to using SuSiE is that it automatically organized similar or redundant gene sets into “credible sets” (CSs), making it easier to quickly recognize complementary gene sets; see [178–183] for related ideas.
In detail, the GSEA was performed as follows. We performed a separate GSEA for each topic, . Specifically, for each topic k, we ran the susieR function susie with the following options: L = 10, intercept = TRUE, standardize = FALSE, estimate_residual_variance = TRUE, refine = FALSE, compute_univariate_zscore = FALSE and min_abs_corr = 0. We set L = 10 so that SuSiE returned at most 10 credible sets. For a given topic k, we reported a gene set as being enriched if it was included in at least one CS. We organized the enriched gene sets by (95%) credible sets. We also recorded the Bayes factor for each CS, which gives a measure of the level of support for that CS. For each gene set included in a CS, we reported the posterior inclusion probability (PIP) and the posterior mean estimate of the regression coefficient . In the results, we refer to as the “enrichment coefficient” for gene set j since it is an estimate of the expected increase in the l.e. LFC for genes that belong to gene set j relative to genes that do not belong to the gene set.
Often, a CS contained only one gene set, in which case the PIP for that gene set was close to 1. In several other cases, the CS contained multiple similar gene sets; in these cases, the smaller PIPs indicated that it was difficult to choose among the gene sets because they are similar to each other (note that the sum of the PIPs in a 95% CS should always be above 0.95 and less than 1). Occasionally, SuSiE returned a CS with a small Bayes factor containing a very large number of gene sets. We excluded such CSs from the results.
We repeated these gene set enrichment analyses with two collections of gene sets: (1) all gene sets other than the MSigDB collections C1, C3, C4, and C6 and “archived” MSigDB gene sets and (2) only gene sets from curated pathway databases, specifically Pathway Commons, NCBI BioSystems and “canonical pathways” (CP) in the MSigDB C2 collection, and Gene Ontology (GO) gene sets in the MSigDB C5 collection. In all cases, we removed gene sets with fewer than 10 genes and with more than 400 genes. Table 2 gives the exact number of gene sets included in each GSEA.
Table 2.
Data set | All gene sets | Curated only |
---|---|---|
PBMC | 23,193 | 12,225 |
Mouse epithelial airway | 20,917 | 9946 |
—rare epithelial cell types only | 20,288 | 9450 |
Simulations
For evaluating the DE analysis methods, we generated matrices of UMI counts for genes and or cells. We simulated the UMI counts from a Poisson NMF model (15) in which and were chosen to emulate UMI counts from scRNA-seq experiments.
The matrices and were generated as follows. First, for each cell i, we generated membership proportions then set , for , where is the total UMI count. To simulate the wide range of total UMI counts often seen in scRNA-seq data sets, total UMI counts were normally distributed on the log-scale, , where denotes the univariate normal distribution with mean and standard deviation .
Membership proportions for each cell i were generated so as to obtain a wide range of mixed memberships, according to the following procedure: the number of nonzero proportions was set to with probability ; the selected topics were drawn uniformly at random (without replacement) from ; then, the membership proportions for the selected topics were set to 1 when , or, when , they were drawn from the Dirichlet distribution with shape parameters .
Expression rates were generated so as to emulate the wide distribution of gene expression levels observed in single-cell data sets and to allow for differences in expression rates among topics. The procedure for generating the expression rates for each gene j was as follows: with probability 0.5, the expression rates were the same across all topics and were generated as , . Otherwise, with probability 0.5, the expression rates were the same in all topics except for one topic. The differing topic was chosen uniformly at random from ; then, the expression rate for topic was set to , . As a result, the expression rates were roughly normally distributed on the log-scale, and the expression differences were also normally distributed on the log-scale. About half of genes had an expression difference among the topics.
Using this simulation procedure, we generated three collections of data sets. The simulation settings were altered slightly for each collection. In the first, data sets were simulated with , , so that most membership proportions were equal or very close to 0 or 1. In the second, we used , , to allow for a range of mixed memberships. In the third, we generated data sets with , , .
For the data sets simulated with , , the cells could essentially be subdivided into two groups. Therefore, we ran MAST [83, 184] and DESeq2 [78, 84] to test for genes that were differentially expressed between the two groups. MAST (R package version 1.20.0) was called via the FindMarkers interface in Seurat [25] (Seurat 4.0.3, SeuratObject 4.0.2) with the following settings: ident.1 = “2” ,ident.2 = NULL, test.use = “MAST”, logfc.threshold = 0, min.pct = 0. DESeq was called from the DESeq2 R package (version 1.34.0) using settings recommended in the package vignette: test = “LRT”, reduced = 1, useT = TRUE, minmu = 1e-6, minReplicatesForReplace = Inf. Size factors were calculated using the calculateSumFactors method from scran version 1.22.1 [23]. The LFC estimates returned by DESeq were subsequently revised using adaptive shrinkage [82] by calling lfcShrink in DESeq2 with type = “ashr”, svalue = TRUE (as in the GoM DE analysis, the DESeq2 posterior z-scores were defined as the posterior means divided by the posterior standard errors returned by the adaptive shrinkage).
To perform the GoM DE analysis in each of the simulations, we first fit a Poisson NMF model to the simulated counts using fit_poisson_nmf from the fastTopics R package [48, 148] (version 0.6-97). The loadings matrix was fixed to the matrix used to simulate the data, and W was estimated by running 40 co-ordinate ascent updates on alone (update.loadings = NULL, method = “scd”, numiter = 40). The equivalent topic model fit was then recovered. Three GoM DE analyses were performed using the de_analysis function from the fastTopics R package, with the topic model fit provided as input: one analysis without adaptive shrinkage (shrink.method = “none”), and two analyses with adaptive shrinkage (shrink.method = “ash”, ashr version 2.2-51 [144]) in which the MCMC was initialized with different pseudorandom number generator states. In all three runs, posterior calculations were performed with . Comparison of the two MCMC runs (with adaptive shrinkage) suggested that was sufficient to obtain reasonably accurate posterior estimates in these simulations (Additional file 1: Fig. S2).
Computing environment
Most computations on real data sets were run in R 3.5.1 [185], linked to the OpenBLAS 0.2.19 optimized numerical libraries, on Linux machines (Scientific Linux 7.4) with Intel Xeon E5-2680v4 (“Broadwell”) processors. For performing the Poisson NMF optimization, which included some multithreaded computations, as many as 8 CPUs and 16 GB of memory were used. The DESeq2 analysis of the PBMC data was performed in R 4.1.0, using 4 CPUs and 264 GB of memory. The evaluation of the DE analysis methods in simulated data sets was performed in R 4.1.0, using as many and 8 CPUs as 24 GB of memory. More details about the computing environment, including the R packages used, are recorded in the workflowr pages in the companion code repositories [186, 187].
Supplementary information
Acknowledgements
We thank Mihai Anitescu, Nicolas Chevrier, Michihiro Takahama, Kushal Dey, Adam Gruenbaum, Youngseok Kim, John Novembre, Alan Selewa, Katie Rhodes, Yusha Liu, Dongyue Xie, Zepeng Mu, and Jason Willwerscheid for their valuable input. We also thank the staff at the Research Computing Center for providing the high-performance computing resources used to implement the numerical experiments.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 8.
Authors’ contributions
PC implemented the methods based on mathematical derivations by PC and MS, with contributions from KL, AS, and SP. PC performed empirical evaluations of the methods using simulated data sets. PC and KL analyzed the single-cell data sets. PC wrote the draft manuscript, with contributions, revisions, and discussion by all authors. All authors approved the final manuscript.
Funding
This work was supported by the NHGRI at the National Institutes of Health under award number 5R01HG002585.
Availability of data and materials
The fastTopics R package is available on GitHub (https://github.com/stephenslab/fastTopics) and CRAN (https://cran.r-project.org/package=fastTopics). A Seurat wrapper for fastTopics is available from https://github.com/stephenslab/seurat-wrappers. The data sets supporting the conclusions of this article are available in Zenodo repositories [186, 187]. These Zenodo repositories also include the source code implementing the analyses and workflowr websites [188] for browsing the code and results. Permission to use the source code in these repositories is granted under the MIT license. Numerical implementations of the contributed statistical methods, including tools for visualizing the results generated by these methods, are available from the fastTopics R package [48, 148] under the MIT license. The gene sets used in the GSEA were compiled into an R package [172], also distributed under the MIT license. All data sets used in the study were obtained from public sources [149, 150, 153, 154]. A description of how these data sets were used is provided in the “Methods” section.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Kharchenko PV. The triumphs and limitations of computational methods for scRNA-seq. Nat Methods. 2021;18(7):723–32. doi: 10.1038/s41592-021-01171-x. [DOI] [PubMed] [Google Scholar]
- 2.Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–45. doi: 10.1038/nrg3833. [DOI] [PubMed] [Google Scholar]
- 3.Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol. 2016;34(11):1145–60. doi: 10.1038/nbt.3711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research. 2018;7:1141. [DOI] [PMC free article] [PubMed]
- 5.Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Research. 2018;7:1297. [DOI] [PMC free article] [PubMed]
- 6.Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–82. doi: 10.1038/s41576-018-0088-9. [DOI] [PubMed] [Google Scholar]
- 7.Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20:269. doi: 10.1186/s13059-019-1898-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019;20:295. doi: 10.1186/s13059-019-1861-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417–41. [Google Scholar]
- 10.Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 2018;9:284. doi: 10.1038/s41467-017-02554-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(86):2579–605. [Google Scholar]
- 12.McInnes L, Healy J, Saul N, Großberger L. UMAP: uniform manifold approximation and projection. J Open Source Softw. 2018;3(29):861. [Google Scholar]
- 13.Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38–44. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]
- 14.Cooley SM, Hamilton T, Aragones SD, Ray JCJ, Deeds EJ. A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-seq data. bioRxiv. 2022. 10.1101/689851.
- 15.Chari T, Pachter L. The specious art of single-cell genomics. PLoS Comput Biol. 2023;19(8):1011288. doi: 10.1371/journal.pcbi.1011288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10(1):5416. doi: 10.1038/s41467-019-13056-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kobak D, Linderman GC. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nat Biotechnol. 2021;39(2):156–7. doi: 10.1038/s41587-020-00809-z. [DOI] [PubMed] [Google Scholar]
- 18.Wattenberg M, Viégas F, Johnson I. How to use t-SNE effectively. Distill. 2016. 10.23915/distill.00002.
- 19.Granja JM, Corces MR, Pierce SE, Bagdatli ST, Choudhry H, Chang HY, et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021;53(3):403–11. doi: 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Heiser CN, Lau KS. A Quantitative Framework for Evaluating Single-Cell Data Structure Preservation by Dimensionality Reduction Techniques. Cell Rep. 2020;31(5):107576. doi: 10.1016/j.celrep.2020.107576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods. 2019;16(3):243–5. doi: 10.1038/s41592-018-0308-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Linderman GC, Steinerberger S. Clustering with t-SNE. Provably SIAM J Math Data Sci. 2019;1(2):313–32. doi: 10.1137/18m1216134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Lun ATL, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016;5:2122. [DOI] [PMC free article] [PubMed]
- 24.McCarthy DJ, Campbell KR, Lun ATL, Wills QF. Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017;33(8):1179–86. doi: 10.1093/bioinformatics/btw777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat Methods. 2021;18(11):1333–41. doi: 10.1038/s41592-021-01282-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun. 2021;12:1337. doi: 10.1038/s41467-021-21583-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. 2004;101(12):4164–9. doi: 10.1073/pnas.0308531101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Donoho D, Stodden V. When does non-negative matrix factorization give a correct decomposition into parts? In: Proceedings of the 16th International Conference on Neural Information Processing Systems. Cambridge: MIT Press; 2003. p. 1141–1148.
- 32.Durif G, Modolo L, Mold JE, Lambert-Lacroix S, Picard F. Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics. 2019;35(20):4011–9. doi: 10.1093/bioinformatics/btz177. [DOI] [PubMed] [Google Scholar]
- 33.Gong W, Rasmussen TL, Singh BN, Koyano-Nakagawa N, Pan W, Garry DJ. Dpath software reveals hierarchical haemato-endothelial lineages of Etv2 progenitors based on single-cell transcriptome analysis. Nat Commun. 2017;8:14362. doi: 10.1038/ncomms14362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res. 2020;30(2):195–204. doi: 10.1101/gr.251603.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ho YJ, Anaparthy N, Molik D, Mathew G, Aicher T, Patel A, et al. Single-cell RNA-seq analysis identifies markers of resistance to targeted BRAF inhibitors in melanoma cell populations. Genome Res. 2018;28(9):1353–63. doi: 10.1101/gr.234062.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kotliar D, Veres A, Nagy MA, Tabrizi S, Hodis E, Melton DA, et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-seq. eLife. 2019;8:43803. [DOI] [PMC free article] [PubMed]
- 37.Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788–91. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
- 38.Levitin HM, Yuan J, Cheng YL, Ruiz FJ, Bush EC, Bruce JN, et al. De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Mol Syst Biol. 2019;15:8557. doi: 10.15252/msb.20188557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, Macosko EZ. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell. 2019;177(7):1873–1887.e17. doi: 10.1016/j.cell.2019.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Shao C, Höfer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics. 2016;33(2):235–42. doi: 10.1093/bioinformatics/btw607. [DOI] [PubMed] [Google Scholar]
- 41.Sun S, Chen Y, Liu Y, Shang X. A fast and efficient count-based matrix factorization method for detecting cell types from single-cell RNAseq data. BMC Syst Biol. 2019;13:28. doi: 10.1186/s12918-019-0699-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Venkatasubramanian M, Chetal K, Schnell DJ, Atluri G, Salomonis N. Resolving single-cell heterogeneity from hundreds of thousands of cells through sequential hybrid clustering and NMF. Bioinformatics. 2020;36(12):3773–80. doi: 10.1093/bioinformatics/btaa201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zhang S, Yang L, Yang J, Lin Z, Ng MK. Dimensionality reduction for single cell RNA sequencing data using constrained robust non-negative matrix factorization. NAR Genomics Bioinforma. 2020;2(3):lqaa064. 10.1093/nargab/lqaa064. [DOI] [PMC free article] [PubMed]
- 44.Pierson E, Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 2015;16:241. doi: 10.1186/s13059-015-0805-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022. [Google Scholar]
- 46.DuVerle DA, Yotsukura S, Nomura S, Aburatani H, Tsuda K. Cell Tree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data. BMC Bioinformatics. 2016;17:363. doi: 10.1186/s12859-016-1175-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Dey KK, Hsiao CJ, Stephens M. Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genet. 2017;13(3):1006599. [DOI] [PMC free article] [PubMed]
- 48.Carbonetto P, Sarkar A, Wang Z, Stephens M. Non-negative matrix factorization algorithms greatly improve topic model fits. 2021. arXiv preprint arXiv:2105.13440.
- 49.González-Blas C, Minnoye L, Papasokrati D, Aibar S, Hulselmans G, Christiaens V, et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat Methods. 2019;16(5):397–400. doi: 10.1038/s41592-019-0367-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference. New York: Association for Computing Machinery; 1999. p. 50–57.
- 51.Bielecki P, Riesenfeld SJ, Hütter JC, Torlai Triglia E, Kowalczyk MS, Ricardo-Gonzalez RR, et al. Skin-resident innate lymphoid cells converge on a pathogenic effector state. Nature. 2021;592:128–32. doi: 10.1038/s41586-021-03188-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Housman G, Briscoe E, Gilad Y. Evolutionary insights into primate skeletal gene regulation using a comparative cell culture model. PLoS Genet. 2022;18(3):1010073. doi: 10.1371/journal.pgen.1010073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Hung A, Housman G, Briscoe EA, Cuevas C, Gilad Y. Characterizing gene expression in an in vitro biomechanical strain model of joint health. F1000Research. 2022;11:296.
- 54.Rhodes K, Barr KA, Popp JM, Strober BJ, Battle A, Gilad Y. Human embryoid bodies as a novel system for genomic studies of functionally diverse cell types. eLife. 2022;11:71361. [DOI] [PMC free article] [PubMed]
- 55.Schenkel JM, Herbst RH, Canner D, Li A, Hillman M, Shanahan SL, et al. Conventional type I dendritic cells maintain a reservoir of proliferative tumor-antigen specific TCF-1+ CD8+ T cells in tumor-draining lymph nodes. Immunity. 2021;54(10):2338–2353.e6. doi: 10.1016/j.immuni.2021.08.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Xu H, Ding J, Porter CBM, Wallrapp A, Tabaka M, Ma S, et al. Transcriptional atlas of intestinal immune cells reveals that neuropeptide -CGRP modulates group 2 innate lymphoid cell responses. Immunity. 2019;51(4):696–708. [DOI] [PMC free article] [PubMed]
- 57.Ding C, Li T, Peng W. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput Stat Data Anal. 2008;52(8):3913–27. [Google Scholar]
- 58.Gaussier E, Goutte C. Relation between PLSA and NMF and implications. In: Proceedings of the 28th Annual International ACM SIGIR Conference. New York: Association for Computing Machinery; 2005. p. 601–602.
- 59.Gillis N. Nonnegative matrix factorization. Philadelphia: Society for Industrial and Applied Mathematics; 2021. [Google Scholar]
- 60.Kim J, Park H. Sparse nonnegative matrix factorization for clustering. Georgia Institute of Technology; 2008.
- 61.Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods. 2018;15(4):255–61. doi: 10.1038/nmeth.4612. [DOI] [PubMed] [Google Scholar]
- 62.Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics. 2019;20:40. doi: 10.1186/s12859-019-2599-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Erosheva E, Fienberg S, Lafferty J. Mixed-membership models of scientific publications. Proc Natl Acad Sci. 2004;101(Supplement 1):5220–7. doi: 10.1073/pnas.0307760101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20:194. doi: 10.1186/s13059-019-1795-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Diaz-Mejia JJ, Meng EC, Pico AR, MacParland SA, Ketela T, Pugh TJ, et al. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data. F1000Research. 2019;8:296. [DOI] [PMC free article] [PubMed]
- 66.Blei DM, Lafferty JD. Topic models. In: Srivastava AN, Sahami M, editors. Text mining: classification, clustering, and applications. Boca Raton: Chapman and Hall/CRC; 2009. pp. 71–94. [Google Scholar]
- 67.Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci. 2004;101(Supplement 1):5228–35. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Hofmann T. Unsupervised learning by probabilistic latent semantic analysis. Mach Learn. 2001;42(1):177–96. [Google Scholar]
- 69.Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet. 2021;53(6):770–7. doi: 10.1038/s41588-021-00873-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Rosenberg NA. Genetic Structure of Human Populations. Science. 2002;298(5602):2381–5. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
- 71.Rosenberg NA. distruct: a program for the graphical display of population structure. Mol Ecol Notes. 2004;4(1):137–8. [Google Scholar]
- 72.Haak W, Lazaridis I, Patterson N, Rohland N, Mallick S, Llamas B, et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature. 2015;522(7555):207–11. doi: 10.1038/nature14317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Pereira BI, De Maeyer RPH, Covre LP, Nehar-Belaid D, Lanna A, Ward S, et al. Sestrins induce natural killer function in senescent-like CD8+ T cells. Nat Immunol. 2020;21(6):684–94. doi: 10.1038/s41590-020-0643-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Ashuach T, Reidenbach DA, Gayoso A, Yosef N. PeakVI: a deep generative model for single-cell chromatin accessibility analysis. Cell Rep Methods. 2022;2(3):100182. doi: 10.1016/j.crmeth.2022.100182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7. doi: 10.1093/bioinformatics/btm453. [DOI] [PubMed] [Google Scholar]
- 77.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26(1):139–40. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Cable DM, Murray E, Shanmugam V, Zhang S, Zou LS, Diao M, et al. Cell type-specific inference of differential expression in spatial transcriptomics. Nat Methods. 2022;19(9):1076–87. doi: 10.1038/s41592-022-01575-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Becker-Herman S, Lantner F, Shachar I. Id2 Negatively Regulates B cell Differentiation in the Spleen. J Immunol. 2002;168(11):5507–13. doi: 10.4049/jimmunol.168.11.5507. [DOI] [PubMed] [Google Scholar]
- 82.Stephens M. False discovery rates: a new deal. Biostatistics. 2016;18(2):275–94. doi: 10.1093/biostatistics/kxw041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015;16:278. doi: 10.1186/s13059-015-0844-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Zhu A, Ibrahim JG, Love MI. Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. 2019;35(12):2084–92. doi: 10.1093/bioinformatics/bty895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Squair JW, Gautier M, Kathe C, Anderson MA, James ND, Hutson TH, et al. Confronting false discoveries in single-cell differential expression. Nat Commun. 2021;12:5692. doi: 10.1038/s41467-021-25960-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Montoro DT, Haber AL, Biton M, Vinarsky V, Lin B, Birket SE, et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature. 2018;560(7718):319–24. doi: 10.1038/s41586-018-0393-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci. 2008;105(4):1118–23. doi: 10.1073/pnas.0706851105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Coifman RR, Lafon S, Lee AB, Maggioni M, Nadler B, Warner F, et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc Natl Acad Sci. 2005;102(21):7426–31. doi: 10.1073/pnas.0500334102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Ruiz García S, Deprez M, Lebrigand K, Cavard A, Paquet A, Arguel MJ, et al. Novel dynamics of human mucociliary differentiation revealed by single-cell RNA sequencing of nasal epithelial cultures. Development. 2019;146(20):dev177428. 10.1242/dev.177428. [DOI] [PMC free article] [PubMed]
- 90.Barkauskas CE, Chung MI, Fioret B, Gao X, Katsura H, Hogan BLM. Lung organoids: current uses and future promise. Development. 2017;144(6):986–97. doi: 10.1242/dev.140103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Rawlins EL, Okubo T, Xue Y, Brass DM, Auten RL, Hasegawa H, et al. The role of Scgb1a1+ clara cells in the long-term maintenance and repair of lung airway, but not elveolar. Epithelium Cell Stem Cell. 2009;4(6):525–34. doi: 10.1016/j.stem.2009.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Spassky N, Meunier A. The development and functions of multiciliated epithelia. Nat Rev Mol Cell Biol. 2017;18(7):423–36. doi: 10.1038/nrm.2017.21. [DOI] [PubMed] [Google Scholar]
- 93.Zhao H, Zhu L, Zhu Y, Cao J, Li S, Huang Q, et al. The cep63 paralogue deup1 enables massive de novo centriole biogenesis for vertebrate multiciliogenesis. Nat Cell Biol. 2013;15(12):1434–44. doi: 10.1038/ncb2880. [DOI] [PubMed] [Google Scholar]
- 94.The Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 2020 12;49(D1):325–34. [DOI] [PMC free article] [PubMed]
- 95.Pliner HA, Packer JS, McFaline-Figueroa JL, Cusanovich DA, Daza RM, Aghamirzaie D, et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol Cell. 2018;71(5):858–8718. doi: 10.1016/j.molcel.2018.06.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Wu C, Tao Y, Li N, Fei J, Wang Y, Wu J, et al. Prediction of cellular targets in diabetic kidney diseases with single-cell transcriptomic analysis of db/db mouse kidneys. J Cell Commun Signal. 2023;17:169–88. [DOI] [PMC free article] [PubMed]
- 97.Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell. 2018;174(5):1309–24. doi: 10.1016/j.cell.2018.06.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Der E, Ranabothu S, Suryawanshi H, Akat KM, Clancy R, Morozov P, et al. Single cell RNA sequencing to dissect the molecular heterogeneity in lupus nephritis. JCI Insight. 2017;2(9):93009. doi: 10.1172/jci.insight.93009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Grassmeyer J, Mukherjee M, DeRiso J, Hettinger C, Bailey M, Sinha S, et al. Elf5 is a principal cell lineage specific transcription factor in the kidney that contributes to Aqp 2 and Avpr 2 gene expression. Dev Biol. 2017;424(1):77–89. doi: 10.1016/j.ydbio.2017.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Park J, Shrestha R, Qiu C, Kondo A, Huang S, Werth M, et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science. 2018;360(6390):758–63. doi: 10.1126/science.aar2131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Ghezzi C, Loo DDF, Wright EM. Physiology of renal glucose handling via SGLT1, SGLT2 and GLUT2. Diabetologia. 2018;61(10):2087–97. doi: 10.1007/s00125-018-4656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Thiagarajan RD, Georgas KM, Rumballe BA, Lesieur E, Chiu HS, Taylor D, et al. Identification of Anchor Genes during Kidney Development Defines Ontological Relationships, Molecular Subcompartments and Regulatory Pathways. PLoS ONE. 2011;6(2):17286. doi: 10.1371/journal.pone.0017286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Gopal E, Umapathy NS, Martin PM, Ananth S, Gnana-Prakasam JP, Becker H, et al. Cloning and functional characterization of human SMCT2 (SLC5A12) and expression pattern of the transporter in kidney. Biochim Biophys Acta Biomembr. 2007;1768(11):2690–7. doi: 10.1016/j.bbamem.2007.06.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Buenrostro JD, Corces MR, Lareau CA, Wu B, Schep AN, Aryee MJ, et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell. 2018;173(6):1535–48. doi: 10.1016/j.cell.2018.03.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell. 2010;38(4):576–89. doi: 10.1016/j.molcel.2010.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Booeshaghi AS, Pachter L. Normalization of single-cell RNA-seq counts by log(x + 1) or log(1 + x) Bioinformatics. 2021;37(15):2223–4. doi: 10.1093/bioinformatics/btab085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Lun A. Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. bioRxiv. 2018. 10.1101/404962.
- 108.Warton DI. Why you cannot transform your way out of trouble for small counts. Biometrics. 2018;74:362–8. doi: 10.1111/biom.12728. [DOI] [PubMed] [Google Scholar]
- 109.Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods. 2020;17(2):137–45. doi: 10.1038/s41592-019-0654-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Barkas N, Petukhov V, Nikolaeva D, Lozinsky Y, Demharter S, Khodosevich K, et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat Methods. 2019;16(8):695–8. doi: 10.1038/s41592-019-0466-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16(12):1289–96. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Haghverdi L, Lun AT, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421–7. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014;30(19):2757–63. doi: 10.1093/bioinformatics/btu375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21:12. doi: 10.1186/s13059-019-1850-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Richards LM, Riverin M, Mohanraj S, Ayyadhury S, Croucher DC, Díaz-Mejía JJ, et al. A comparison of data integration methods for single-cell RNA sequencing of cancer samples. bioRxiv. 2021. 10.1101/2021.08.04.453579.
- 116.Fan J, Slowikowski K, Zhang F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med. 2020;52(9):1452–65. doi: 10.1038/s12276-020-0422-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Gouvert O, Oberlin T, Févotte C. Negative binomial matrix factorization for recommender systems. 2018. arXiv preprint arXiv:1801.01708.
- 118.Gu J, Wang X, Halakivi-Clarke L, Clarke R, Xuan J. BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data. BMC Bioinformatics. 2014;15(S9):S6. doi: 10.1186/1471-2105-15-S9-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Wang C, Sun D, Huang X, Wan C, Li Z, Han Y, et al. Integrative analyses of single-cell transcriptome and regulome using MAESTRO. Genome Biol. 2020;21:198. doi: 10.1186/s13059-020-02116-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Lareau CA, Duarte FM, Chew JG, Kartha VK, Burkett ZD, Kohlway AS, et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat Biotechnol. 2019;37(8):916–24. doi: 10.1038/s41587-019-0147-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Kartha VK, Duarte FM, Hu Y, Ma S, Chew JG, Lareau CA, et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genomics. 2022;2(9):100166. doi: 10.1016/j.xgen.2022.100166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Bravo González-Blas C, De Winter S, Hulselmans G, Hecker N, Matetovici I, Christiaens V, et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat Methods. 2023;20(9):1355–67. doi: 10.1038/s41592-023-01938-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell. 2020;183(4):1103–1116.e20. doi: 10.1016/j.cell.2020.09.056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Cao J, Cusanovich DA, Ramani V, Aghamirzaie D, Pliner HA, Hill AJ, et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science. 2018;361(6409):1380–5. doi: 10.1126/science.aau0730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Stuart T, Satija R. Integrative single-cell analysis. Nat Rev Genet. 2019;20(5):257–72. doi: 10.1038/s41576-019-0093-7. [DOI] [PubMed] [Google Scholar]
- 126.Shiga M, Seno S, Onizuka M, Matsuda H. SC-JNMF: single-cell clustering integrating multiple quantification methods based on joint non-negative matrix factorization. PeerJ. 2021;9:12087. doi: 10.7717/peerj.12087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Zhang S, Liu CC, Li W, Shen H, Laird PW, Zhou XJ. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 2012;40(19):9379–91. doi: 10.1093/nar/gks725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Yang Z, Michailidis G. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics. 2015;32(1):1–8. doi: 10.1093/bioinformatics/btv544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Jin S, Zhang L, Nie Q. scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome Biol. 2020;21:25. doi: 10.1186/s13059-020-1932-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Hao Y, Stuart T, Kowalski MH, Choudhary S, Hoffman P, Hartman A, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2023 doi: 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018;14:8124. doi: 10.15252/msb.20178124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21:111. doi: 10.1186/s13059-020-02015-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Baker SM, Rogerson C, Hayes A, Sharrocks AD, Rattray M. Classifying cells with Scasat, a single-cell ATAC-seq analysis tool. Nucleic Acids Res. 2019;47(2):10. doi: 10.1093/nar/gky950. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Nair S, Kim DS, Perricone J, Kundaje A. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics. 2019;35:108–16. doi: 10.1093/bioinformatics/btz352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Pott S, Lieb JD. Single-cell ATAC-seq: strength in numbers. Genome Biol. 2015;16:172. doi: 10.1186/s13059-015-0737-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Yan F, Powell DR, Curtis DJ, Wong NC. From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis. Genome Biol. 2020;21:22. doi: 10.1186/s13059-020-1929-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Taddy M. Distributed multinomial regression. Ann Appl Stat. 2015;9(3):1394–414. [Google Scholar]
- 138.Fisher RA. On the interpretation of from contingency tables, and the calculation of P. J R Stat Soc. 1922;85(1):87–94.
- 139.Good IJ. Some Statistical Applications of Poisson’s Work. Stat Sci. 1986;1(2):157–70. [Google Scholar]
- 140.Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. 3. Boca Raton: CRC Press; 2013. [Google Scholar]
- 141.Andrieu C, de Freitas N, Doucet A, Jordan MI. An introduction to MCMC for machine learning. Mach Learn. 2003;50:5–43. [Google Scholar]
- 142.Robert CP. Monte Carlo statistical methods. 2. New York: Springer; 2004. [Google Scholar]
- 143.Devroye L. Non-uniform random variate generation. New York: Springer-Verlag; 1986. [Google Scholar]
- 144.Stephens M, Carbonetto P, Gerard D, Lu M, Sun L, Willwerscheid J, et al. ashr: methods for adaptive shrinkage, using empirical Bayes. 2020. R package version 2.2-51. https://github.com/stephens999/ashr. Accessed 5 Mar 2023.
- 145.Chen MH, Shao QM. Monte Carlo estimation of Bayesian credible and HPD intervals. J Comput Graph Stat. 1999;8(1):69–92. [Google Scholar]
- 146.Box GEP, Tiao GC. Bayesian inference in statistical analysis. Reading: Addison-Wesley; 1992.
- 147.Gelman A, Hill J, Yajima M. Why we (usually) don’t have to worry about multiple comparisons. J Res Educ Eff. 2012;5(2):189–211. [Google Scholar]
- 148.Carbonetto P, Luo K, Dey K, Hsiao J, Sarkar A, Hung A, et al. fastTopics: fast algorithms for fitting topic models and non-negative matrix factorizations to count data. 2022. R package version 0.6-142. https://github.com/stephenslab/fastTopics.
- 149.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Data sets. 10x Genomics. 2017. https://www.10xgenomics.com/support/single-cell-gene-expression. Accessed 5 Mar 2023. [DOI] [PMC free article] [PubMed]
- 150.Montoro DT, Haber AL, Biton M, Vinarsky V, Chen S, Villoria J, et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Data sets. Gene Expression Omnibus; 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103354. Accessed 5 Mar 2023. [DOI] [PMC free article] [PubMed]
- 151.Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson KL, et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science. 2015;348(6237):910–4. doi: 10.1126/science.aab1601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 152.Cusanovich DA, Reddington JP, Garfield DA, Daza RM, Aghamirzaie D, Marco-Ferreres R, et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature. 2018;555(7697):538–42. doi: 10.1038/nature25981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Data sets. Mouse sci-ATAC-seq Atlas; 2018. https://shendurelab.github.io/mouse-atac/. Accessed 5 Mar 2023. [DOI] [PMC free article] [PubMed]
- 154.Buenrostro JD, Corces MR, Lareau CA, Wu B, Schep AN, Aryee MJ, et al. Single-cell epigenomics maps the continuous regulatory landscape of human hematopoietic differentiation. Data sets. Gene Expression Omnibus; 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96772. Accessed 5 Mar 2023.
- 155.Chen H, Lareau C, Andreani T, Vinyard ME, Garcia SP, Clement K, et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 2019;20:241. doi: 10.1186/s13059-019-1854-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 156.Hien LTK, Gillis N. Algorithms for nonnegative matrix factorization with the Kullback-Leibler divergence. J Sci Comput. 2021;87(3):93. [Google Scholar]
- 157.Hsieh CJ, Dhillon IS. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference. New York: Association for Computing Machinery; 2011. p. 1064–1072.
- 158.Lin X, Boutros PC. Optimization and expansion of non-negative matrix factorization. BMC Bioinformatics. 2020;21:7. doi: 10.1186/s12859-019-3312-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 159.Ang AMS, Gillis N. Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 2019;31(2):417–39. doi: 10.1162/neco_a_01157. [DOI] [PubMed] [Google Scholar]
- 160.Ke ZT, Wang M. A new SVD approach to optimal topic estimation. 2019. arXiv preprint arXiv:1704.07016.
- 161.van der Maaten L. Accelerating t-SNE using tree-based algorithms. J Mach Learn Res. 2014;15(93):3221–45. [Google Scholar]
- 162.Krijthe JH. Rtsne: t-distributed stochastic neighbor embedding using Barnes-Hut implementation. 2015. R package version 0.15. https://github.com/jkrijthe/Rtsne. Accessed 5 Mar 2023.
- 163.Ding C, He X. K-means clustering via principal component analysis. In: 21st International Conference on Machine Learning. New York: Association for Computing Machinery; 2004.
- 164.Kass RE, Raftery AE. Bayes Factors. J Am Stat Assoc. 1995;90(430):773–95. [Google Scholar]
- 165.Geer LY, Marchler-Bauer A, Geer RC, Han L, He J, He S, et al. The NCBI BioSystems database. Nucleic Acids Res. 2009;38(supplement-1):492–6. [DOI] [PMC free article] [PubMed]
- 166.Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur Ö, Anwar N, et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2010;39(supplement 1):685–90. doi: 10.1093/nar/gkq1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 167.Rodchenkov I, Babur Ö, Luna A, Aksoy BA, Wong JV, Fong D, et al. Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. Nucleic Acids Res. 2019;48(D1):489–97. doi: 10.1093/nar/gkz946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 168.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 169.Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015;1(6):417–25. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 170.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40. [DOI] [PMC free article] [PubMed]
- 171.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 172.Carbonetto P, Stephens M. pathways: gene set enrichment analysis using human and mouse gene sets. 2021. R package version 0.1-20. https://github.com/stephenslab/pathways. Accessed 5 Mar 2023.
- 173.Fang T, Davydov I, Marbach D, Zhang JD. Gene-set enrichment with regularized regression bioRxiv. 2019 doi: 10.1101/659920. [DOI] [Google Scholar]
- 174.Sartor MA, Leikauf GD, Medvedovic M. LRpath: A Logistic Regression Approach for Identifying Enriched Biological Groups in Gene Expression Data. Bioinformatics. 2009;25(2):211–7. doi: 10.1093/bioinformatics/btn592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 175.Harrison PF, Pattison AD, Powell DR, Beilharz TH. Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list. Genome Biol. 2019;20:67. doi: 10.1186/s13059-019-1674-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 176.McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25(6):765–71. doi: 10.1093/bioinformatics/btp053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 177.Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Ser B. 2020;82(5):1273–300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 178.Bauer S, Gagneur J, Robinson PN. GOing Bayesian: model-based gene set Analysis of Genome-Scale Data. Nucleic Acids Res. 2020;38(11):3523–32. doi: 10.1093/nar/gkq045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 179.Ebrahimpoor M, Spitali P, Hettne K, Tsonaka R, Goeman J. Simultaneous Enrichment Analysis of All Possible Gene-sets: unifying Self-Contained and Competitive Methods. Brief Bioinform. 2019;21(4):1302–12. doi: 10.1093/bib/bbz074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 180.Fontanillo C, Nogales-Cadenas R, Pascual-Montano A, Rivas JDL. Functional analysis beyond enrichment: non-redundant reciprocal linkage of genes and biological terms. PLoS ONE. 2011;6(9):24289. doi: 10.1371/journal.pone.0024289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 181.Lu Y, Rosenfeld R, Simon I, Nau GJ, Bar-Joseph Z. A Probabilistic Generative Model for GO Enrichment Analysis. Nucleic Acids Res. 2008;36(17):109. doi: 10.1093/nar/gkn434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 182.Simillion C, Liechti R, Lischer HEL, Ioannidis V, Bruggmann R. Avoiding the Pitfalls of Gene Set Enrichment Analysis with SetRank. BMC Bioinformatics. 2017;18:151. doi: 10.1186/s12859-017-1571-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 183.Vivar JC, Pemu P, McPherson R, Ghosh S. Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in omics studies and “big data” biology. Omics. 2013;17(8):414–22. [DOI] [PMC free article] [PubMed]
- 184.McDavid A, Finak G, Yajima M. MAST: model-based analysis of single cell transcriptomics. 2021. R package version 1.20.0. https://github.com/RGLab/MAST. Accessed 5 Mar 2023.
- 185.R Core Team. R: a language and environment for statistical computing. Vienna; 2018. R Found Stat Comput. https://www.R-project.org. Accessed 5 Mar 2023.
- 186.Carbonetto P, Lao K, Sarkar A, Hung A, Tayeb K, Pott S, et al. Analysis of single-cell RNA-seq data sets for this manuscript. Zenodo. 2023 doi: 10.5281/zenodo.7962782. [DOI] [Google Scholar]
- 187.Carbonetto P, Lao K, Sarkar A, Hung A, Tayeb K, Pott S, et al. Analysis of single-cell ATAC-seq data sets for this manuscript. Zenodo. 2023 doi: 10.5281/zenodo.7962831. [DOI] [Google Scholar]
- 188.Blischak JD, Carbonetto P, Stephens M. Creating and sharing reproducible research code the workflowr way. F1000Research. 2019;8:1749. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The fastTopics R package is available on GitHub (https://github.com/stephenslab/fastTopics) and CRAN (https://cran.r-project.org/package=fastTopics). A Seurat wrapper for fastTopics is available from https://github.com/stephenslab/seurat-wrappers. The data sets supporting the conclusions of this article are available in Zenodo repositories [186, 187]. These Zenodo repositories also include the source code implementing the analyses and workflowr websites [188] for browsing the code and results. Permission to use the source code in these repositories is granted under the MIT license. Numerical implementations of the contributed statistical methods, including tools for visualizing the results generated by these methods, are available from the fastTopics R package [48, 148] under the MIT license. The gene sets used in the GSEA were compiled into an R package [172], also distributed under the MIT license. All data sets used in the study were obtained from public sources [149, 150, 153, 154]. A description of how these data sets were used is provided in the “Methods” section.