Summary
Recent advances in CRISPR-Cas9 engineering and single-cell assays have enabled the simultaneous measurement of single-cell transcriptomic and phylogenetic profiles. However, there are few computational tools enabling users to integrate and derive insight from a joint analysis of these two modalities. Here, we describe “PhyloVision”: an open-source software for interactively exploring data from both modalities and for identifying and interpreting heritable gene modules whose concerted expression are associated with phylogenetic relationships. PhyloVision provides a feature-rich, interactive, and shareable web-based report for investigating these modules while also supporting several other data and meta-data exploration capabilities. We demonstrate the utility of PhyloVision using a published dataset of metastatic lung adenocarcinoma cells, whose phylogeny was resolved using a CRISPR-Cas9-based lineage-tracing system. Together, we anticipate that PhyloVision and the methods it implements will be a useful resource for scalable and intuitive data exploration for any assay that simultaneously measures cell state and lineage.
Keywords: single-cell RNA-seq, lineage tracing, phylogenetics, web-based tools, software, cancer, metastasis
Graphical abstract
Highlights
-
•
PhyloVision quantifies phylogenetic autocorrelation of gene expression signatures
-
•
PhyloVision identifies de novo heritable gene expression modules
-
•
PhyloVision creates a shareable, web-based report for exploring datasets
-
•
Identification of heritable gene modules associated with metastatic behavior
Motivation
Technologies enabling investigators to simultaneously profile the lineage and transcriptional states of single cells hold great promise for profiling dynamic processes such as tumor growth or organ development. Yet the analysis of these data requires the deployment of sophisticated computational tools that might preclude users without computational training from deriving insights. Here, we have developed an integrated analysis pipeline and user interface, PhyloVision, that performs fundamental analyses for these multimodal lineage-tracing assays and creates a web-based report to explore the cells’ lineage relationships, transcriptomes, and association. The report can be shared online to better enable collaborative projects.
Jones et al. present “PhyloVision,” a computational resource for analyzing single-cell lineage-tracing data. PhyloVision implements an integrated framework for identifying gene expression modules associated with lineage relationships, as well as enabling collaborative projects by generating an interactive, web-based report for data exploration that can be shared among colleagues.
Introduction
Cellular lineages underlie several important biological phenomena—from embryogenesis to differentiation to cancer progression—and understanding the nature and dynamics of these lineages remains a central focus of research. Indeed, the piecing together of the developmental lineage of Caenorhabditis elegans by Sulston and colleagues via visual observations (Sulston et al., 1983) has facilitated decades of critical work using the deterministic development of C. elegans as a model system to study development (Packer, 2019), aging (Kenyon, 2010), and even human diseases like neurodegeneration (Lu et al., 2014). Yet, many higher-order organisms cannot be studied by visual observation alone, and thus, a robust understanding of cell lineages underlying these organisms remains elusive. To this end, several technologies have emerged to track cellular lineages over varying timescales, as reviewed in previous work (Kester and van Oudenaarden, 2018; McKenna and Gagnon, 2019).
Recently, the integration of CRISPR-Cas9-based engineering and single-cell sequencing has enabled the synthetic tracing of cellular lineages at unprecedented resolution (Frieda et al., 2017; McKenna et al., 2016; Kalhor et al., 2017; Raj et al., 2018; Chan et al., 2019; Bowling et al., 2020). Several of these technologies enable the simultaneous measurement of cell lineage and transcriptomic state via single-cell RNA sequencing (scRNA-seq), thus creating exciting opportunities to study the transcriptional evolution of dynamic processes and motivating innovative approaches for integrating these two critical modalities (Wagner and Klein, 2020). As with high-dimensional measurements like those from scRNA-seq, it is clear that specialized, interactive tools for data exploration, visualization, and analysis are necessary for realizing the full potential of these lineage-tracing assays.
There exist several useful software tools for visualization of phylogenetic or lineage-tracing data. For example, the interactive Tree of Life (iTOL; Letunic and Bork, 2021) is a scalable web-server-based tool that allows users to upload tree structures and various annotation files for interactive viewing. However, to indefinitely host and share these reports requires a paid subscription. More recently, CeLaVi was introduced as a publicly available software tool for generating interactive, web-based reports expressly for cell-lineage viewing (Salvador-Martínez et al., 2021). Although both tools are scalable to up to thousands of cells and are versatile for integrating various data modalities (e.g., gene expression measurements and spatial location) with phylogenies for visualization, they do not offer capabilities for joint analysis and automated interpretation of information on lineages and gene expression.
Here, we introduce PhyloVision: an open-source, interactive analysis and visualization tool that is expressly built for integrating single-cell gene expression and lineage data. PhyloVision builds on useful existing work, like iTOL and CeLaVi, for interactively visualizing phylogenies while possibly overlaying the expression of individual genes. In addition to this, however, PhyloVision also employs other analysis frameworks developed by our group for automated interpretation of the variation in gene expression across the lineage structure. Specifically, PhyloVision supports features that identify heritable gene expression programs and interprets these programs using gene signature enrichment analysis.
To demonstrate the utility of PhyloVision, we apply it to a clone of 1,127 cells from a recent CRISPR-Cas9 lineage-tracing dataset investigating metastatic spread in a mouse model of lung adenocarcinoma. In doing so, we show that the derived statistics and web-based user interface can be used to effectively characterize subpopulations within this aggressive tumor population. These molecular characterizations, not discussed in the original study, can be used to generate hypotheses about how metastatic ability evolves within a tumor subpopulation.
PhyloVision is distributed publicly on Github at https://github.com/YosefLab/VISION. Along with the software, we include several tutorials and example reports of published datasets allowing users to explore the user-interface. Additionally, we include a detailed manual and description of the user-interface in the supplemental information.
Results
PhyloVision is an integrated pipeline for interactive analysis of single-cell expression and lineage profiles
PhyloVision is simultaneously a tool for interactive exploration of multimodal single-cell lineage-tracing data using our web-based front end and for analysis of the evolutionary dynamics of expression data. Our interactive web-based report is built on our VISION front-end (DeTomaso et al., 2019). Here, we have developed an interactive phylogeny viewer and have integrated it into the default interface, enabling the user to select cells (for visualization on a low-dimensional embedding of the respective scRNA-seq data or for differential expression analysis), perform various manipulations on the observed tree (e.g., node collapsing), and overlay gene expression, signature scores, or other data onto the leaves of the tree (Figures 1 and S1–S4; Video S1). The dynamic phylogeny viewer is scalable, allowing low-latency selections and subtree collapsing for large trees (we tested trees of up to ∼4,000 leaves). Importantly, a web report generated by a user can be viewed locally, shared privately among colleagues, or staged publicly on a web server; moreover, users can download the state of any report for reproducing visualizations separately.
While the interactive web report is a useful tool for data exploration, the PhyloVision pipeline additionally supports statistical analysis for deriving joint insight from the expression and lineage data (Figure 2). In this, PhyloVision takes as input (at a minimum) an expression data matrix, a phylogeny, and a collection of gene signatures, each representing a certain pathway or a transcriptional response to a certain change in conditions (as publicly available from resources like MSigDB; Subramanian et al., 2005). In one mode of analysis, PhyloVision will conduct a phylogenetic autocorrelation analysis with the user-defined gene signatures. Here, for a given signature, a score will be computed for each cell with VISION (DeTomaso et al., 2019), indicating the cumulative activity of the respective genes (note that the score also accounts for “signed” transcriptional response signatures, in which one subset of genes is marked as up-regulated and another as down-regulated). An autocorrelation statistic will be then computed by evaluating the consistency of the signature scores between nearby cells on the phylogeny, using Geary’s C statistic (Geary, 1954) as in DeTomaso et al., 2019 (STAR Methods). With this analysis, a user can identify gene signatures that are significantly associated with the tree structure, suggesting evolutionary patterns of interest. Our PhyloVision pipeline also includes modules to analyze user-provided meta-data, (e.g., the tissue of origin, extent of somatic mutations, or cell-level fitnesses inferred with external models; Neher et al., 2014), quantify cell-level plasticities with respect to categorical data (Yang et al., 2021), and interactively visualize these cell-level data (STAR Methods). Together, these “signature-level” analyses enable users to identify cell-level properties whose variation is consistent with the tree structure and to highlight phenotypes that represent subclonal, heritable phenotypes.
In another mode of analysis, users can identify de novo gene modules (i.e., not guided by pre-defined signatures) that are learned from the intersection of phylogeny and expression data using our Hotspot algorithm (DeTomaso and Yosef, 2021). Briefly, this analysis uses autocorrelation to identify individual genes whose expression is consistent with the phylogeny—namely, genes that are expressed at a more similar level in phylogenetically adjacent cells than in cells that are distant. It then uses a pairwise extension of the autocorrelation statistic to arrange these genes into modules whose expression patterns on the phylogeny are similar, thus representing conserved transcriptomic modules that each operate in a concerted fashion. The PhyloVision pipeline adds interpretability to the Hotspot modules by assessing the overlap between their respective gene sets and the user-provided gene signatures. PhyloVision provides a quantification of this overlap with an enrichment statistic and an assessment of statistical significance. Together, these two analyses enable a user to identify important sources of transcriptomic variation on the phylogeny, as well as discover new and interpretable gene sets.
Case study: Analysis of a metastatic lung adenocarcinoma tumor with PhyloVision
To demonstrate PhyloVision’s usefulness in interrogating data from multimodal single-cell lineage-tracing technologies, we applied the pipeline to a clone from our recently published dataset in which we studied the metastatic behavior of an aggressive human lung adenocarcinoma cell line in a xenograft mouse model (Quinn et al., 2021).
As previously described, we used our CRISPR-Cas9 lineage-tracing technology (Chan et al., 2019) to trace approximately 100 clones over the course of 2.5 months as each clone metastasized between tissues in a mouse model of lung adenocarcinoma. In this analysis, we used the reconstructed single-cell phylogenies from the original study, which were inferred with the Cassiopeia package (Jones et al., 2020). In these phylogenies, each leaf corresponds to a cell, with data corresponding to the single-cell expression profile and the tissue from which it was sampled. In the original study, we described how metastatic rates of single cells could be inferred directly from these phylogenies and combined with expression profiles to identify transcriptional regulators of this process.
In the present analysis, we evaluated a clone of 1,127 cells using the PhyloVision pipeline (see Data and code availability). In this case study, we used signatures downloaded from MSigDB (Subramanian et al., 2005) and focused on the results from the Hotspot analysis. Hotspot identifies three non-overlapping modules of genes (Figure 3A). To evaluate the cumulative activity of each module at each cell in the dataset, we computed module scores for every cell using the signature-scoring procedure in VISION (DeTomaso et al., 2019; STAR Methods). Interestingly, when compared with the metastatic rate inferred from the phylogeny (which is provided as meta-data to this session), we observed that one module is negatively correlated (module 1; Pearson’s ρ = −0.30; Figure 3B [left]), whereas one is positively correlated with this metastatic rate (module 2; Pearson’s ρ = 0.26; Figure 3B [middle]); module 3 does not correlate with the metastatic rate in either direction (Pearson’s ρ = 0.07; Figure 3B [right]). These results, therefore, point to candidate transcriptional programs (each represented by a module) that are heritable and are associated with different metastatic abilities.
To interpret the biological signal intrinsic to these Hotspot modules, we first projected the single-cell transcriptomic profiles onto two dimensions using Uni-form Mani-fold Approximation and Projection (UMAP; McInnes et al., 2018) and overlaid the Hotspot modules scores (Figures 3C–3E). Upon inspection, we observed that the Hotspot modules localized to distinct regions of the two-dimensional projection and therefore marked different cellular states. We observed a similar pattern, on the phylogenies themselves, with each Hotspot module marking a specific set of subclades.
Next, we examined which gene signatures had significant overlap with each module. In the module negatively correlated with metastatic rate (module 1), we found significant enrichments corresponding to the innate immune response, maintenance of the gastrointestinal epithelium, and the endosomal vacuolar pathway, among others (all FDR < 1e−3, hypergeometric test; Figure 3C). Together, these gene sets indicate that fewer metastatic cells in this clone are characterized by antigen presentation and maintenance of an epithelial-like state, supporting the hypothesis that tumor progression is required for metastatic competency in lung adenocarcinoma (Caswell et al., 2014).
On the other hand, the module positively correlated with metastatic rate (module 2) had significant enrichments in gene signatures associated with fibrinolysis, UV response, smooth-muscle adhesion, inflammatory response, and other metastatic signatures (e.g., liver cancer metastasis; all FDR <0.05, hypergeometric test; Figure 3D). These gene signatures therefore point to several mechanisms of enhancing metastatic rates in this clone, such as fibrinolysis and up-regulation of adhesion molecules. The diversity of these signatures underscores the importance of inflammatory signaling, ECM remodeling, and cell adhesion discussed in our previous work (Quinn et al., 2021).
Module 3 was characterized by a set of gene signatures associated with tumor progression but distinct from the metastatic rate. Intriguingly, the gene sets that significantly overlapped with module 3 included those that regulated apoptosis and WNT-signaling, as well as those indicating sensitivity to serum and rapamycin treatment (all FDR < 1e−3, hypergeometric test; Figure 3E). These observations suggest that module 3 distinguishes cells in an altogether different metabolic state uncorrelated with metastatic ability but appearing to be associated with increased survival ability and cell proliferation, perhaps due to relatively high amounts of KRAS signaling. These results also support the finding that combination therapies involving inhibitors of the mammalian target of rapamycin (mTOR), which sits downstream of RAS signaling, is a viable therapeutic opportunity in KRAS-driven non-small-cell lung cancer (Vasan et al., 2014). Future work may investigate whether this therapy-sensitive population is mainly characterized by differential metabolic signaling, perhaps leveraging recent work modeling metabolic profiles from scRNA-seq data (Wagner et al., 2021).
Overall, these results indicate that the PhyloVision joint analysis of single-cell expression and lineage provides an efficient approach for dissecting phylogeny-based transcriptional heterogeneity. In the case study above, we demonstrate that the analysis pipeline and web-based user interface were able to identify gene programs associated with increased or decreased metastatic ability and altogether new programs not previously described in the original study. As a whole, this analysis provides testable hypotheses and intricate molecular characterizations for subpopulations in a single clone.
Discussion
We introduced PhyloVision, a tool for the integrated analysis of scRNA-seq and single-cell lineage-tracing data. PhyloVision offers a feature-rich, user-friendly, and interactive web report for exploring the evolutionary underpinnings of scRNA-seq profiles. Moreover, PhyloVision is embedded within useful analysis pipelines, thus enabling rapid characterization of interesting structure in the high-dimensional data. In this way, to the best of our knowledge, PhyloVision is the first interactive tool of its kind to provide a bridge between single-cell analysis tools and single-cell lineage-tracing data. We show the effectiveness of this approach in a case study of a single clone from our recent work, illustrating rich heterogeneity in this dataset previously unexplored.
In addition to the Hotspot analysis presented here, PhyloVision includes other analysis features such as identification of heritable meta-data (e.g., metastatic rate) and its transcriptional correlates (using gene signatures and Hotspot modules), identification of differentially expressed genes (e.g., by manually choosing subclones to be compared), and visualization of gene expression and meta-data while stratifying the cells according to the phylogeny (e.g., with histograms corresponding to different clades).
While the case study presented in this work focused on CRISPR-Cas9-based lineage tracing, we expect that this tool will also be useful for any type of data that jointly measure cell lineage and transcriptomic state such as those from B cell phylogenies (inferred from their antigen receptor) or single-cell phylogenies built from whole-genome sequencing of tumor samples. Moreover, the shareability of generated web-based reports allows collaborators without computational experience to smoothly explore high-dimensional datasets. All of that taken together, we anticipate that PhyloVision will be useful across applications and technologies and will provide critical support for the interpretation of these multimodal datasets.
Limitations of the study
Our implementation of PhyloVision provides several opportunities for future development. First, there are many useful approaches for quantifying additional cellular characteristics from phylogenies, such as relative fitness, that are not directly supported within our software. Future efforts to incorporate these approaches into the PhyloVision pipeline would be useful. Second, our analysis of the metastasis dataset presented in this study still lacks experimental validation to support the generated hypotheses. Finally, our implementation of PhyloVision in the R programming language may provide challenges for utilizing separate software available in Python, like Cassiopeia (Jones et al., 2020), and future efforts might focus on facilitating interoperability across these ecosystems.
STAR★Methods
Key resources table
REAGENT or RESOURCE | SOURCE | IDENTIFIER |
---|---|---|
Deposited data | ||
Embyrogenesis PhyloVision report | Chan et al. 2019 (GEO Accession GSE117542) | Zenodo: https://doi.org/10.5281/zenodo.6354746 |
Metastasis PhyloVision report | Quinn et al. 2021 (GEO Accession GSE161363) | Zenodo: https://doi.org/10.5281/zenodo.6354746 |
Software and algorithms | ||
PhyloVision | This study | Zenodo: https://doi.org/10.5281/zenodo.3345984 |
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact Nir Yosef (niryosef@berkeley.edu).
Materials availability
This study did not generate new unique reagents.
Method details
The PhyloVision pipeline
PhyloVision builds on the VISION analysis toolkit for signature autocorrelation analysis (DeTomaso et al., 2019). As input, PhyloVision requires a gene expression matrix (typically count-normalized, but not log-normalized), a set of signature gene sets (e.g., publicly available from sources like MSigDB), and a phylogeny (stored as a tree structure in the ape R package (Paradis and Schliep, 2019)). Amongst other data that can be optionally passed into PhyloVision are numerical and categorical metadata, as well as a two-dimensional projections of the cells for visualization purposes (e.g. from t-distributed stochastic neighbor embedding [tSNE] (van der Maaten, 2008) of the main principal components or of an embedding learned by methods such as scVI (Lopez et al., 2018; Gayoso et al., 2022). In the original VISION pipeline, cell-level clustering and consistency evaluation was performed on a user-specified “latent space” (a low-dimensional embedding such as the top principal components or an embedding inferred with tools like scVI). In the PhyloVision pipeline the phylogeny over all cells is treated as this latent space, and all clustering and consistency analysis is performed using the cell-cell similarities encoded by the phylogeny (to see how cell clusters are utilized, see section “Analysis of metadata”). Optionally, a user can still specify an additional latent space that will be used for inferring additional single-cell visualizations using algorithms like tSNE and UMAP. Finally, users can also specify additional pre-computed Hotspot objects that can be used for the Hotspot analysis (as described below).
The signature-centric (VISION) part of the analysis pipeline in PhyloVision (invoked by default when running PhyloVision) begins by clustering cells according to substructure on the tree (see below, section “Generating stratification (clustering) of the cells according to the phylogeny”) and computing scores for each user-provided signature and for every cell (computed using cell-level Z-scores as in (DeTomaso et al., 2019)). It then assesses the agreement between the signature scores and the phylogeny with a phylogenetic autocorrelation statistic (see section entitled “Phylogenetic autocorrelation analysis”). Signatures that are significantly auto-correlated (i.e., cells that are nearby in the phylogeny have more similar scores than expected by chance) are clustered using the previously described VISION pipeline (DeTomaso et al., 2019) and displayed in the web-based report in a collapsable table for each cluster (Figure 1, inset 3). The user can explore the scores of these gene signatures by overlaying their scores on the two-dimensional representation of the gene expression data (Figure 1, inset 1) and the phylogeny (Figure 1 inset 2), viewing histograms of these scores possibly subsetted according to the phylogeny (Figure 1, inset 4), and browsing through the respective genes (Figure 1 panel 4; where genes are ranked their covariance with the signature score).
As with VISION, cell-level metadata is handled similarly to signature scores computed within PhyloVision. Specifically, separate methodology is used to infer autocorrelation of numerical and categorical metadata. However, PhyloVision builds on this autocorrelation analysis by providing utilities for quantifying cell-level plasticity scores for each categorical metadata item. Intuitively, these scores represent how often a categorical variable transitioned between states along the phylogeny (for example, the scMetRate from Quinn et al., 2021). For more details, refer to the sections below “Analysis of metadata” and “Single-cell plasticity analysis”.
PhyloVision additionally supports a phylogenetic analysis with Hotspot, which can be invoked with the runHotspot function. This function interfaces with the Hotspot tool, implemented in Python, using the reticulate R package. Using the previously described -Hotspot pipeline (DeTomaso and Yosef, 2021), PhyloVision first identifies individual genes whose expression is coherent with the tree structure (i.e., nearby cells in the tree express the gene at a similar level, as compared to random chance). It then performs a two-dimensional autocorrelation analysis to group the identified genes into modules. Finally, PhyloVision computes an enrichment statistic between user-defined gene signatures and the identified Hotspot modules (see section entitled “Assessment of statistical significance of the overlap between Hotspot modules and user-provided gene signature”). Gene signatures with significant overlap are included in the web-based report in a collapsable table for each Hotspot module (Figure 1, inset 3). The user can explore the scores of these gene signatures or the score of entire hotspot modules by overlaying their cumulative expression (computed identically to user-specified signatures, as in (DeTomaso et al., 2019)) on the 2-dimensional representation of the gene expression data (Figure 1, inset 1) and the phylogeny (Figure 1, inset 2), viewing histograms of these scores possibly subsetted according to the phylogeny (Figure 1, inset 4), and browsing through the respective genes (Figure 1, panel 4; where genes are ranked by covariance as with user-specified signatures).
Visualization of phylogenies
PhyloVision’s web-based user interface displays the provided phylogeny using a custom Plotly Javascript package (PhyloPlot.js) in either a radial or linear layout. Users are able to select individual cells and clades on the phylogeny for use in Differential Expression and to view on the UMAP. Leaves reflect the same cell-coloring as the UMAP. Users are able to collapse clades to summary nodes, using the mode, arithmetic mean, geometric mean or median of the numerical values selected for the node’s leaves. Users can also collapse nodes by depth from the root of the tree. The phylogeny in both radial and linear layouts is converted to ultrametric edge lengths using the following formula:
Generating stratification (clustering) of the cells according to the phylogeny
In many cases, a phylogeny can be used to stratify groups of cells into specific groups by “cutting” the tree at a specific depth and assigning cells to subclonal lineages. For example, if a phylogeny describes a differentiation process between several cell types, a clustering of cells into subclades might yield cell-type specific clusters. PhyloVision performs not only a clustering of the phylogeny but also an assessment of how meaningful the clustering is on the tree via an autocorrelation statistic (representing how similar the cluster assignment is of neighboring cells).
PhyloVision clusters cells on a phylogeny by performing a breadth-first search over internal nodes. Specifically, the algorithm maintains a queue of internal nodes and updates the queue by popping off the internal node with the largest child clade size, and adding it’s immediate children to the queue. This algorithm begins with the root node, and terminates once the queue has reached a target length (defaulted to 10) at which point PhyloVision merges the smallest clade with its neighbor until the exact target number is reached. The algorithmic pseudocode is detailed as follows:
Cluster-Phylogeny (target := 10)
Queue := Priority Queue {Node : Maximum size of Node’s child clades}
Insert the root into the queue
While length(queue) <= target do:
Node := queue.pop(1)
Insert Node’s children to Queue
Clusters := {[Children]} Nodes Queue
While length(Clusters) > target do:
Select smallest cluster from Clusters
Merge that cluster with its phylogenetically nearest cluster
Return Clusters
While these clusters can be accessed via the metadata variable “VISION_Clusters_Tree”, additional clusterings derived from the phylogeny may be computed and stored as metadata (see PhyloVision vignette available with the package (https://yoseflab.github.io/VISION/articles/phyloVision.html)).
Quantification and statistical analysis
Phylogenetic autocorrelation analysis for gene sets (signature-centric analysis)
To compute the extent to which a value (e.g., signature score, module score, or continuous covariate) can explain the cellular relationships on the phylogeny, we make use of the Geary’s C statistic for local autocorrelation. This statistic is defined as
where wij represents the cophenetic distance (i.e., distance between cells using the branch lengths of the user-specified phylogeny) between cells i and j, xi is a value of interest, N is the total number of cells, and W is the sum of all weights. (A small amount of random noise is introduced to the cophenetic distances to break ties.) In our case, the value of interest (i.e. x) are the ranks of the normalized signature score in each cell, as defined previously (DeTomaso et al., 2019). We report C’ = 1 - C such that a score of 1 indicates perfect autocorrelation and 0 means no autocorrelation. While the C′ statistic provides an effect size, we evaluate the significance of gene signature scores with an empirical p-value (FDR corrected with the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995)), comparing the signature score to a background of randomly generated signatures as described in our previous work (DeTomaso et al., 2019). In the collapsable tables displayed on the UI (Figure 1, inset 3), we retain only gene signatures with an FDR <0.05 and report both C′ statistics and the FDR.
In the phylogenetic autocorrelation analysis, we utilize a K-nearest neighbor (KNN) graph where weights are only non-zero between a cell i and it’s closest k neighbors. Specifically, if cell j is not a k-nearest neighbor of cell i, then wij is taken to be 0. K-nearest neighbors of cell i are found using the distances on the phylogeny, where the distance between cells i and j is defined as the sum of the edge lengths on the path between the two cells. If edge lengths for the dendrogram are not provided, every edge is defaulted to length 1. Ties are randomly broken.
Phylogenetic autocorrelation analysis for individual genes and the identification of modules (HotSpot analysis)
PhyloVision employs a gene-level clustering into modules using the Hotspot autocorrelation analysis (DeTomaso and Yosef, 2021) on the user-defined phylogeny. (Here we describe how the algorithm is applied in the PhyloVision pipeline, for mathematical details please refer to our previous work (DeTomaso and Yosef, 2021).) First, using the cophenetic distances on the tree (defined as the phylogenetic distance separating cells), a K-nearest neighbor (KNN) graph is constructed (using a default K = sqrt(N), where N is the number of cells). Then, genes are selected that are significantly autocorrelated with the phylogenetic KNN graph using the “compute_autocorrelations” function in Hotspot. By default, the depth-adjusted negative binomial (“danb”) model is used and the top 1000 genes are selected (as measured by Hotspot’s Z-transformed Geary’s C) that pass an 0.05 FDR threshold, though both these parameters can be controlled by the user. Then, genes are grouped into modules by clustering the pairwise autocorrelation matrix computed with Hotspot’s “compute_local_correlations” function. By default, we use a minimum gene threshold of 20 and a clustering FDR of 0.5, though both parameters are controllable by the user. Gene signatures corresponding to the genes in a module are added to the PhyloVision object. Additionally, for each module and user-specified gene-signature pair, a new signature is created by computing the gene overlap and added to the PhyloVision object.
Assessment of statistical significance of the overlap between Hotspot modules and user-provided gene signature
Given a full set of N genes, PhyloVision’s compares the genes in module set M identified by Hotspot and the genes in an existing signature set S by first computing an enrichment statistic:
Then, we assess significance using a hypergeometric test with R:
where P(M,S) indicates the p-value of the overlap. P-values are then FDR-corrected using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995).
Analysis of metadata
PhyloVision utilizes the tools in VISION to conduct autocorrelation analysis on metadata. In this, PhyloVision analyzes metadata differently depending on if is numerical (i.e., continuous covariates like the number of genes detected in a cell) or categorical (i.e., discrete covariates like the batch in which a cell profile was sampled). While most metadata is specified by the user, PhyloVision additionally clusters the cells into subclades using a tree-based clustering method (see section below entitled “Generating stratification (clustering) of the cells according to the phylogeny”).
In the context of numerical metadata, PhyloVision utilizes an approach to assess autocorrelation and significance identical to that of gene-signatures (see above, “Phylogenetic autocorrelation analysis for gene sets (signature-centric analysis), with the exception that scores need to be computed as they are provided as precomputed scores by the user.
In the context of categorical data, PhyloVision uses the Cramer’s V statistic as an autocorrelation statistic. As described in our previous work (DeTomaso et al., 2019), the Cramer’s V statistic is a transformation of a chi-squared test statistic, computed on the local neighborhood of each cell. Specifically, for each cell i, we compute a local proportion of each variable value m across its K phylogenetic neighbors (indexed by j):
where wij are computed as above in the gene-centric analysis (see section “Phylogenetic autocorrelation analysis for gene sets (signature-centric analysis)”), cj represents the value of the discrete variable of interest in cell j and Im(x) is an indicator function that takes on the value of 1 if x == m and 0 otherwise. From these values, a contingency table X is computed as
The chi-squared test is then performed on this contingency table X to estimate a p-value. From the chi-squared test statistic, t, the Cramer’s V statistic is computed as
where n is the sum of all the values in the contingency table X, N is the number of rows in the contingency table and M is the number of columns.
Single-cell plasticity analysis
PhyloVision enables users to quantify the plasticity of a particular categorical metadata for each leaf on the tree using previously described methodology (Quinn et al., 2021; Yang et al., 2021). In this, plasticity indicates how often a given cell’s ancestors changed between categories (e.g., cellular states). Specifically, this plasticity analysis begins with computing the small parsimony of a particular categorical metadata using the Fitch-Hartigan algorithm (Fitch, 1971; Hartigan, 1973) for each subtree contained within a tree (where a subtree is defined as the set of nodes downstream from a particular internal node). Each of these parsimony scores is normalized to the number of edges in the subtree. Then, the single-cell plasticity score for each cell is defined as the average of these normalized parsimony scores for all the subtrees that contain this particular leaf. This is accomplished by performing a depth-first-traversal from the root to a leaf, averaging the normalized parsimony scores of each internal node along the way.
Differential expression analysis
Differential Expression is performed using one of several tests on the gene-level expression between two groups of cells chosen by the user. By default, a Wilcoxon Rank Sums test is used, though PhyloVision also supports other tests implemented in the Seurat package such t-test and logistic regression (Satija et al., 2015). The cell groups can be chosen from user-provided metadata factors such as sample, tissue or cluster, or UI selections on the UMAP or phylogeny. Users can select cells on the UMAP using a box select or lasso select. Users can select cells on the phylogeny using a lasso select, or by choosing the parent internal node of the cells they wish to include. Users can compare metadata groups or selections to the remaining unselected cells, or to other groups or selections. The log fold change, AUC and FDR-adjusted p value are reported for each gene in the dataset.
Analysis of the lung cancer data
The tumor phylogeny for CP004 was reconstructed using the Cassiopeia-Hybrid (Jones et al., 2020) algorithm from processed single-cell lineage tracing data, as described in the original study (Quinn et al., 2021). Cells present in both the expression matrix and the lineage for CP004 were used; otherwise, cells were pruned from the lineage using the ape R package (Paradis and Schliep, 2019) or removed from the expression matrix. All unifurcations (i.e., nodes containing exactly one child) were collapsed using the collapse.singles function in the R package ape. Before analysis with PhyloVision, the cells of expression matrix of raw UMI counts were library-size-normalized to the median number of UMIs in CP004. Informative genes were found using the filterGenesFano function in VISION with default parameters and passed to PhyloVision via the projection_genes parameter. Signatures were downloaded from the MSigDB database (https://www.gsea-msigdb.org/gsea/msigdb/) and the Hallmark, C2, and C5 (BP) collections were used for analysis. Meta data corresponding to each cell was downloaded from NCBI GEO, series GSE161363 and dataset GSM4905334. PhyloVision was run with default parameters, except for setting num_neighbors = 30 and projection_methods = c(“UMAP”, “tSNE30”). After PhyloVision analysis, Hotspot was invoked using the runHotspot command with the following parameters: model = “normal”, tree=TRUE, min_gene_threshold=70, n_neighbors = 30, number_top_genes = 1000. Reported pairwise gene autocorrelations, gene modules, and enrichment results were generated from the PhyloVision pipeline as described above.
Acknowledgments
The authors would like to thank Alex Khodaverdian, Richard Zhang, Sebastian Prillo, Joseph Min, David DeTomaso, and the members of the Yosef lab for their helpful discussions in the development of this project. The authors would also like to thank Galen Xing, Liron Grossman, and Shirley Hui for helpful testing and feedback of the software. This work was supported by UCSF Discovery Fellowship (to M.G.J.) and NIH-NIAID U19AI090023 (to N.Y.). N.Y. is also a member of the Chan Zuckerberg Biohub investigator program.
Author contributions
All authors contributed to the design of software and analysis. N.Y. and M.G.J. conceived of this project, and N.Y. directed the work. Y.R. and M.G.J. wrote the software. M.G.J. performed the analysis on the xenograft cancer dataset. All authors wrote and approved the final manuscript.
Declaration of interests
N.Y. is an adviser and/or has equity in Cellarity, Celsius Therapeutics, and Rheos Medicines.
Published: April 25, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2022.100200.
Supplemental information
Data and code availability
-
•
This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. Interactive PhyloVision reports are deposited on Zenodo and are publicly available as the date of publication. DOIs are listed in the key resources table.
-
•
All original code is available on Github (https://github.com/Yoseflab/VISION) and on Zenodo and is publicly available as of date of the publication. DOIs are listed in the key resources table.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
References
- Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 1995;57:289–300. [Google Scholar]
- Bowling S., Sritharan D., Osorio F.G., Nguyen M., Cheung P., Rodriguez-Fraticelli A., Patel S., Yuan W.-C., Fujiwara Y., Li B.E., et al. An engineered CRISPR-Cas9 mouse line for simultaneous readout of lineage histories and gene expression profiles in single cells. Cell. 2020;181:1410–1422.e27. doi: 10.1016/j.cell.2020.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caswell D.R., Chuang C.-H., Yang D., Chiou S.-H., Cheemalavagu S., Kim-Kiselak C., Connolly A., Winslow M.M. Obligate progression precedes lung adenocarcinoma dissemination. Cancer Discov. 2014;4:781–789. doi: 10.1158/2159-8290.CD-13-0862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chan M.M., Smith Z.D., Grosswendt S., Kretzmer H., Norman T.M., Adamson B., Jost M., Quinn J.J., Dian Y., Jones M.G., et al. Molecular recording of mammalian embryogenesis. Nature. 2019;570:77–82. doi: 10.1038/s41586-019-1184-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeTomaso D., Jones M.G., Subramaniam M., Ashuach T., Ye C.J., Yosef N. Functional interpretation of single-cell similarity maps. Nat. Commun. 2019;10:4376. doi: 10.1101/403055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeTomaso D., Yosef N. Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell Syst. 2021;12:446–456.e9. doi: 10.1016/j.cels.2021.04.005. [DOI] [PubMed] [Google Scholar]
- Fitch W.M. Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology. 1971;20:406–416. [Google Scholar]
- Frieda K.L., Linton J.M., Hormoz S., Choi J., Chow K.H.K., Singer Z.S., Budde M.W., Elowitz M.B., Cai L. Synthetic recording and in situ readout of lineage information in single cells. Nature. 2017;541:107–111. doi: 10.1038/nature20777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gayoso A., Lopez R., Xing G., Boyeau P., Valiollah Pour Amiri V., Hong J., Wu K., Jayasuriya M., Mehlman E., Langevin M., Liu Y. A Python library for probabilistic analysis of single-cell omics data. Nature Biotechnology. 2022;40:163–166. doi: 10.1038/s41587-021-01206-w. [DOI] [PubMed] [Google Scholar]
- Geary R.C. The Contiguity ratio and statistical mapping. The Incorporated Statistician. 1954;5:115–146. [Google Scholar]
- Hartigan J.A. Minimum mutation fits to a given tree. Biometrics. 1973;29:53–65. [Google Scholar]
- Jones M.G., Khodaverdian A., Quinn J.J., Chan M.M., Hussmann J.A., Wang R., Xu C., Weissman J.S., Yosef N. Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol. 2020;21:92. doi: 10.1186/s13059-020-02000-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kalhor R., Mali P., Church G.M. Rapidly evolving homing CRISPR barcodes. Nat. Methods. 2017;14:195–200. doi: 10.1038/nmeth.4108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kenyon C.J. The genetics of ageing. Nature. 2010;464:504–512. doi: 10.1038/nature08980. [DOI] [PubMed] [Google Scholar]
- Kester L., van Oudenaarden A. Single-cell transcriptomics meets lineage tracing. Cell Stem Cell. 2018;23:166–179. doi: 10.1016/j.stem.2018.04.014. [DOI] [PubMed] [Google Scholar]
- Letunic I., Bork P. Interactive Tree of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293–W296. doi: 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lopez R., Regier J., Cole M.B., Jorden M.I., Yosef N. Deep generative modeling for single-cell transcriptomics. Nat. Methods. 2018;15:1053–1058. doi: 10.1038/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu T., Aron L., Zullo J., Pan Y., Kim H., Chen Y., Yang T.-H., Kim H.-M., Drake D., Liu X.S., et al. REST and stress resistance in ageing and Alzheimer’s disease. Nature. 2014;507:448–454. doi: 10.1038/nature13163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Maaten L. Visualizing Data using t-SNE. J. Machine Learn. Research. 2008;9:2579–2605. [Google Scholar]
- McInnes L., Healy J., Melville J. UMAP: Uniform Manifold approximation and projection for dimension reduction. arXiv. 2018 Preprint at. 1802.03426. [Google Scholar]
- McKenna A., Findlay G.M., Gagnon J.A., Horwitz M.S., Schier A.F., Shendure J. Whole organism lineage tracing by combinatorial and cumulative genome editing. Science. 2016;353:aaf7907. doi: 10.1126/science.aaf7907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A., Gagnon J.A. Recording development with single cell dynamic lineage tracing. Development. 2019;146:dev169730. doi: 10.1242/dev.169730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neher R.A., Russell C.A., Shraiman B.I. Predicting evolution from the shape of genealogical trees. Elife. 2014;3:e03568. doi: 10.7554/eLife.03568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Packer J.S. A lineage resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science. 2019;365:eaax1971. doi: 10.1126/science.aax1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paradis E., Schliep K. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
- Quinn J.J., Jones M.G., Okimoto R.A., Nanjo S., Chan M.M., Yosef N., Bivona T.G., Weissman J.S. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science. 2021;371:eabc1944. doi: 10.1126/science.abc1944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raj B., Wagner D.E., McKenna A., Pandey S., Klein A.M., Shendure J., Gagnon J.A., Schier A.F. Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat. Biotechnol. 2018;36:442–450. doi: 10.1038/nbt.4103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salvador-Martínez I., Grillo M., Averof M., Telford M.J. CeLaVi: an interactive cell lineage visualization tool. Nucleic Acids Res. 2021;49:W80–W85. doi: 10.1093/nar/gkab325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Satija R., Farrell J.A., Gennert D., Schier A.F., Regev A. Spatial reconstruction of single-cell gene expression data. Nature biotechnology. 2015;33:495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S., Mesirov J.P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sulston J.E., Schierenberg E., White J.G., Thomson J.N. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 1983;100:64–119. doi: 10.1016/0012-1606(83)90201-4. [DOI] [PubMed] [Google Scholar]
- Vasan N., Boyer J.L., Herbst R.S. A RAS renaissance: emerging targeted therapies for KRAS-mutated non-small cell lung cancer. Clin. Cancer Res. 2014;20:3921–3930. doi: 10.1158/1078-0432.CCR-13-1762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wagner A., Wang C., Fessler J., DeTomaso D., Avilo-Pacheco J., Kaminski J., Zaghouani S., Christian E., Thakore P., Schellhaass B., et al. Metabolic modeling of single Th17 cells reveals regulators of autoimmunity. Cell. 2021;184:4168–4185. doi: 10.1016/j.cell.2021.05.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wagner D.E., Klein A.M. Lineage tracing meets single-cell omics: opportunities and challenges. Nat. Rev. Genet. 2020;21:410–427. doi: 10.1038/s41576-020-0223-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang D., Jones M.G., Naranjo S., Rideout W.M., Min K.H.J., Ho R., Wu W., Replogle J.M., Page J.L., Quinn J.J., Horns F. Lineage Recording Reveals the Phylodynamics, Plasticity and Paths of Tumor Evolution. Preprint at bioRxiv. 2021 doi: 10.1101/2021.10.12.464111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. Interactive PhyloVision reports are deposited on Zenodo and are publicly available as the date of publication. DOIs are listed in the key resources table.
-
•
All original code is available on Github (https://github.com/Yoseflab/VISION) and on Zenodo and is publicly available as of date of the publication. DOIs are listed in the key resources table.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.