Abstract
Understanding cellular birth rate differences is crucial for predicting cancer progression and interpreting tumor-derived genetic data. Lineage tracing experiments enable detailed reconstruction of cellular genealogies, offering new opportunities to measure branching rate heterogeneity. However, the lineage tracing process can introduce complex tree features that complicate this effort. Here, we examine tree characteristics in lineage tracing-derived genealogies and find that editing window placement leads to multifurcations at a tree’s root or tips. We propose several ways in which existing tree topology-based metrics can be extended to test for rate heterogeneity on trees even in the presence of lineage-tracing associated distortions. Although these methods vary in power and robustness, a test based on the statistic effectively detects branching rate heterogeneity in simulated lineage tracing data. Tests based on other common statistics ( and the Sackin index) show interior performance to . We apply our validated methods to xenograft experimental data and find widespread rate heterogeneity across multiple study systems. Our results demonstrate the potential of tree topology statistics in analyzing lineage tracing data, and highlight the challenges associated with adapting phylogenetic methods to these systems.
Introduction
Differences in cancer cell division rates drive the expansion of specific cell populations and determine the genetic patterns observed in cancer sequencing efforts. The extent and timing of asymmetric growth in tumor development is an active area of study, examined via cell staining for division markers (1–3), transcriptomics (4–6), and phylogenetic (7–9) and population genetic analysis (10–12) on naturally occurring genetic variation. Nearly all tree-based analyses focus on the relationships between cancer subpopulations or “clones”, with limited opportunities to reconstruct higher resolution genealogical relationships from most sequencing data.
Evolvable barcode-tracked xenografts permit higher resolution reconstruction of tumor growth and spread dynamics compared to reconstructions from naturally occurring mutations (13–15). In these experiments, modified cancer cells contain specific genomic regions engineered to rapidly mutate via CRISPR-based targeting. When inserted into mice, these cells record their division history as they grow, allowing lineage reconstruction through barcode sequencing. While these experiments have yielded insights into cancer plasticity (14–17), metastasis (18,19) and progression (1,5,20), the branching structures of lineage tracing trees are underexplored, despite their potential to reveal cell division heterogeneity.
While existing phylogenetic approaches such as tree balance statistics (8,9,21–26) and multi-state birth-death models (7) have previously detected rate heterogeneity in cancer trees, their application to higher resolution lineage tracing data has been limited (27). One potential reason is that lineage tracing introduces multiple new complexities into tree structures. First, lineage tracing systems typically permit only single, permanent edits on a small number of editing regions. These regions often saturate by the time of barcode sequencing, generating multifurcations near tree leaves where no additional edits can be recorded. Second, lineage tracing systems are often designed to be inducible to capture more divisions in recent genealogical history, leading to multifurcations at the tree root. These issues pose challenges in applying rate heterogeneity tests requiring branch lengths. While there are important advances in designing higher resolution editing systems (28–30) and modeling more complex clocks to incorporate branch lengths (27,31,32), opportunities remain to quantify cellular division heterogeneity in these systems using existing tree topologies.
As recognized since Yule 1925 (33), tree topology and balance encodes important information about branching process dynamics. However, many tree balance statistics cannot be applied to multifurcating trees and lineage tracing’s impact on the statistics that are applicable is unknown. In particular, because both rapid lineage expansion and lineage tracing technologies generate multifurcations (34), disentangling these signatures will be necessary to detect branching rate heterogeneity.
In this paper, we investigate how tree balance statistics enable the detection of branching rate heterogeneity in lineage tracing data. We first demonstrate how lineage tracing distorts tree features predictably across multiple branching rate models but retains information about the underlying branching model. Second, we propose how new and existing tree balance statistics can be employed to test for deviations from branching rate uniformity and evaluate their power and robustness. We identify a test based around the statistic as most robust across sample sizes, branch rate heterogeneity types and lineage tracing systems, while the widely used Sackin index underperforms in lineage tracing settings. Finally, we apply these tests to existing lineage tracing datasets, finding widespread but not ubiquitous evidence of branching rate heterogeneity in vivo.
Results
First, we introduce two models of branching rate heterogeneity that produce distinct signatures in bifurcating genealogical trees. Second, we describe how lineage tracing systems impact the shapes of lineage trees under both heterogeneous and equally branching rates. Third, we evaluate the power and type I error rate of existing and modified tree shape statistics to determine how sample size and branching rate heterogeneity strength impact our ability to reject rate uniformity. Fourth, we apply these methods to intra-host lineage tracing data to assess branching rate heterogeneity in vivo in the presence of lineage tracing distortions.
1. Models of branching rate heterogeneity in cancer data
We consider two ways in which cellular branching rates could diverge from an equal branching rate (EBR) Yule process. The models mimic different biological processes through which the branching rate heterogeneity can emerge in cancers. The first model assumes a continuous, gradual change of branching rate over time, referred to here as continuous rate heterogeneity (CRH). For simplicity, we assume that the log-transformed branching rate follows Brownian motion without directional drift over time. This model mimics branching rate’s dependence on gradual expression changes with small effects. The second model assumes discrete transitions between branching rates following a Poisson process, referred to here as discrete rate heterogeneity (DRH). The discrete rate model mimics branching rate’s dependence on mutations or transitions between stably inherited cell states with large effects. In both models, past branching events have no direct impact on branching rate and we consider ultrametric trees in which all extant descendants are sampled.
2. Lineage tracing distorts tree shapes
In both models described above and in cellular division in the body, a single lineage splits into exactly two descendant lineages. However, lineage tracing often incompletely recovers those bifurcations, leaving unresolved polytomies in which the branching event order is unknown. We examine how lineage tracing distorts bifurcating trees into those with multifurcations and other complexities. To simulate lineage tracing, we generate bifurcating trees and stochastically introduce permanent, unique edits into a simulated genetic barcode over time. Edits begin a set period of time after the start of population growth, and occur at creating a window in which edits are possible. Once the population reaches the desired size, we collapse adjacent nodes with identical barcodes into unresolved multifurcations.
For our three models of branching rate (EBR, CRH, DRH), we simulated multiple lineage tracing conditions on true binary trees (Figure 1, Figure S1) which created visually distinct patterns according to the barcode editing window size and placement. We summarized these patterns with four tree characteristics: normalized maximum pendant clade size , normalized outdegree at root , fraction of binary nodes and degree of multifurcation (i.e., the proportion of nodes as compared to a fully bifurcating tree). We found that early editing windows generated large pendant clades near the leaves of the tree (high ) but preserved early binary splits (low ) while late editing windows generated the opposite pattern (small pendant clades with large multifurcating root). We also found that the length of the editing window matters (Figure S1, S2): a long editing window retained the most internal nodes in the multifurcating tree and produced a tree shape that is most similar to a bifurcating tree, while a short editing window distorted tree shape significantly. In general, we found that the lineage tracing parameters altered the four summary statistics to a greater degree than the mode of branching rate heterogeneity. Nevertheless, differences in tree statistics between models of rate heterogeneity remained across lineage tracing settings, suggesting EBR and rate heterogeneity can potentially be disentangled.
The lineage tracing settings create trees that match empirical lineage tracing trees from (15) (Figure S2). We therefore use these parameters as defaults for translating binary trees into lineage tracing trees in the next section unless otherwise marked. We note that the trees from (14) showed slightly shorter latency periods (average ). We explore robustness to lineage tracing editing parameters in section 3.5.
3. Evaluating the power to reject EBR in simulated lineage tracing trees
3.1. Brief descriptions of tested approaches
Above, we demonstrate that lineage tracing distorts tree structures but lineage tracing trees with and without rate heterogeneity retain distinct features. We now evaluate if these features permit rejection of an EBR null hypothesis. We considered four approaches leveraging tree topologies without branch lengths (schematic representations of the four tests can be found in Figure S3). In all cases, we propose specific ways to deconvolute the signal of branching rate heterogeneity from that of lineage tracing so tree shape statistics, either new or pre-existing, can be used to reject EBR.
Three of our tested approaches involved computing a scalar statistic and comparing it against a null:
-
1
, the sum of log daughter clade sizes (each minus 1) (35).
-
2
, the clade-size weighted average Shannon equitability of daughter clade sizes across all internal nodes (36).
-
3
Sackin index, the sum of the number of ancestral nodes across all tips in a tree (37).
In all three cases, tree statistics are compared to an EBR null. In the case of , this null is normal with an analytically-described mean and variance (38). In the case of and the Sackin index, an empirical null distribution is generated from simulated EBR trees of matching sizes. However, these nulls only represent EBR expectations for the statistic in fully bifurcating trees – multifurcations generated via lineage tracing alter tree shape (Figure 1, S1, S2) and therefore the expected EBR null. Therefore, these null distributions require translation into lineage tracing space, a task complicated in practice by unknown lineage tracing parameters. We apply two strategies: for , the focal multifurcating lineage tracing tree is randomly resolved into a binary tree for comparison against the binary null, as recommended by an implementation in R package apTreeShape (39). For and the Sackin index, we first infer the lineage tracing parameters from the focal tree and transform the EBR bifurcating trees with these parameters before calculating the statistics to generate the null distribution. As we demonstrate below, this inference step introduces additional error that affects the power of the tests, so we also include the nulls generated under the true (but generally unknown) lineage tracing parameters for comparison.
The previous methods summarize the tree into a scalar statistic for comparison against an EBR null. Our final test instead summarizes the tree into distribution over its internal nodes:
-
4
Uniform Nodal Probability (UnifNP) summarizes, for each node, the probability that EBR generates a less balanced daughter clade split than observed, controlling for outdegree and the total number of tips.
We define the nodal probability for a multifurcating node with an outdegree as the probability that the lineages descending from the focal node have their clade sizes less evenly distributed than expected under EBR. We expect a uniform distribution of the nodal probabilities under EBR and that the nodal probabilities skew towards lower values (i.e., an excess of asymmetrical node splits) under branching rate heterogeneity. We test the distribution of nodal probabilities against the uniform distribution using a Kolmogorov-Smirnov test (see Methods for full details).
3.2. Power and type I error rate
We evaluated performance of the four tests described above in trees with either CRH or DRH of various strengths and sizes ranging from 50 to 6250, intended to span current lineage tracing experiments. For each set of conditions, we generated 100 binary trees which were then transformed with the lineage tracing system described above ( and ). We evaluated the power of the tests on both the transformed lineage tracing trees and the untransformed binary trees for ease of comparison, and report the proportion of tests detecting significant deviation from EBR . We also applied each test to 100 EBR trees transformed by the same lineage tracing procedure to evaluate the type I error rate.
Tests showed good power and type I error rate on binary trees
On binary trees, we found that all tests yielded a type I error rate aligning with the alpha level across modes of branching rate heterogeneity (Figure 2, dashed lines). All tests had increasing power to detect branching rate heterogeneity at more extreme rate heterogeneity and larger tree sizes, although there were a small number of exceptions (Figure 2, Figure S2, solid lines). Overall, branching rate heterogeneity, if present, can be reliably detected by some, if not all, tests in binary trees with sufficiently large tree size (usually N≥250) and sufficiently strong BRH, regardless of its mode.
Tests varied in robustness to multifurcations introduced by lineage tracing systems.
We found that and UnifNP were relatively robust to multifurcations introduced by lineage tracing in the range of experimental lineage tracing systems. UnifNP showed very similar power between binary trees and multifurcating trees across all modes of branching rate heterogeneity and tree sizes (Figure 2), and had a type I error rate somewhat above expectations (on average 8% as opposed to 5% in trees with tips, but rising to 16% for trees with N=50). showed the expected type I error rate across conditions, and had nearly identical power between binary and multifurcating trees generated under CRH (Figure 2A). Surprisingly, the test gained power on multifurcating trees generated under DRH (Figure 2B), especially when the true lineage tracing parameters were known. This increased power possibly derives from the uneven distribution of branching rate heterogeneity signals across the phylogeny: recent, small clades contain weaker rate heterogeneity signal compared to those in older, larger clades, and the lineage tracing process merges these recent clades into pendants when barcodes saturate, reducing their relative contribution to the .
In contrast, the Sackin index and had poorer performance in multifurcating trees relative to binary trees across a range of CRH and DRH conditions. The Sackin index with estimated lineage tracing parameters showed decreased power across most CRH and DRH conditions relative to the binary trees (Figure 2) and an elevation of type I error rate in very large trees (Figure S4, N = 6250 leaves). Using the true rather than inferred lineage tracing parameters to generate EBR null distributions restored the type I error rate to the expected level and increased test power in the case of DRH, demonstrating lineage tracing parameter inference substantially harmed test performance. The test consistently suffered from an inflated type I error rate across all tree sizes for multifurcating lineage tracing trees (Figure 2, Figure S4, dashed lines). This inflated type I error rate led to falsely increased power in CRH and DRH conditions with weak rate heterogeneity, although power was not especially high at strong rate heterogeneity, especially for CRH. This inflated type I error rate suggests that the random resolution of multifurcating trees into binary trees before the computation of is inappropriate for evaluating branch rate heterogeneity in lineage tracing trees.
Tree size and branching rate heterogeneity strength determine test power
Tree size and the strength of branching rate heterogeneity (effect size) are the two primary determinants of tests’ power to detect deviations from EBR. In general, tests possessed greater power to detect branch rate heterogeneity in larger trees (Figure S5) and in trees with stronger rate heterogeneity (Figure 3). However, responses to stronger rate heterogeneity were not uniform across tests: the Sackin index and the test did not show increasing power as heterogeneity strength increased in larger trees (N≥1250) under CRH (Figure 3A), while the and UnifNP tests were insensitive to rate heterogeneity effect size in DRH trees with low transition rates (Figure S4).
All methods performed best with strong rate heterogeneity in very large trees (N = 6250 leaves), although power varied between tests and across different modes of BRH. For CRH with a scaling coefficient of 10, all tests had power >70% with and UnifNP approaching a power of 100% (Figure 3A). For DRH with a fold change of 10 and a transition rate of 1, all tests had power >95% with the exception of the Sackin index (power <50%, Figure 3B).
As effect sizes became weaker, large trees were required to maintain good power. For example, under CRH (Figure 3A), power dropped below 0.5 only when the scaling coefficient became less than 0.5 (for N = 6250) or 1 (for N = 1250). Similar trends were observed in DRH (Figure 3B). Even the largest set of examined trees (N = 6250) had very limited power with weak effect sizes (CRH 0.1 and DRH 1.1 fold-change). This effect was present in both binary and multifurcating trees, suggesting these topological tree statistics are in general not powerful enough to pick up small effect sizes, and the impact of lineage tracing-associated uncertainty does not drive test power in these instances.
Similarly, only very strong rate heterogeneity was detectable in small trees. Under CRH, only the strongest effect sizes () were detectable in trees with 250 nodes or fewer, apart from those EBR rejections likely associated with an elevated false discovery rate. Under DRH, even strong effect sizes (fold change of 10) were not often detectable in trees with 250 nodes or fewer.
Power generally stronger in CRH versus DRH
Throughout our testing, we noted that our power to detect deviations from EBR was higher in trees generated by CRH than that in trees generated by DRH (Figure 3 and S4). This is perhaps unsurprising given that the summary statistics examined in Figure 1 showed less departure from EBR under DRH rather than CRH. Specifically, we observed that the power of tests was more sensitive to tree size under DRH than under CRH. In addition, when the power plateaued over effect sizes, this plateau was usually higher in CRH trees than DRH trees. We further note the power in DRH trees became weaker if the transition rate among DRH states increased or decreased (Figure S4). These trends were consistent in both binary and multifurcating lineage tracing trees, suggesting that they stem from the intrinsic properties of the rate heterogeneity model, and not lineage tracing distortions. One possible explanation is that the finite states in the DRH model defined in this study effectively constrained the variation in the average branching rate between large sister clades, while the continuous spectrum in the CRH model did not.
Advantages and disadvantages of using versus UnifNP
In multifurcating trees, and UnifNP clearly outperformed and the Sackin index for detecting deviations from EBR, although we note that had among the best power in bifurcating trees across BRH modes and the fastest computational time (Figure S6). Between and UnifNP, each test had specific situations in which it outperformed the other. had the best power under CRH across sample sizes and scaling coefficients, and weakly outperformed UnifNP under DRH when the transition rates between branching rates were low (Figure 3, Figure S4). On other hand, UnifNP weakly outperformed under DRH under faster switching rates, with this effect being slightly pronounced in larger tree sizes (Figure S4D). In aggregate, has the best robustness of power in the greatest number of scenarios. UnifNP outperforms in computational time (Figure S6), given that it does not require whole-tree simulations to build an empirical null distribution and estimate lineage tracing parameters. UnifNP has the potential to be useful over at some scales of data (e.g., hundreds of trees with thousands of tips), if the lineage tracing settings are in a regime with reasonable type I error rate.
3.3. Robustness across different lineage tracing types
We repeated the analysis on trees transformed by different relative latency and editing rate settings with a fixed tree size of N = 1250 tips (Figure S7). We found that certain lineage tracing settings substantially impacted the performance of all four tests. All tests were particularly sensitive to the latency timing, where edits permitted only in the last 10% of the population growth resulted in non-linear power with respect to strength of rate heterogeneity under CRH and >90% type I error rates in the case of both UnifNP and . Short latency periods and very low editing rates also resulted in elevated type I error rates for both UnifNP and in both CRH and DRH. In general, although UnifNP had generally strong performance in the lineage tracing settings explored above, its performance suffered considerably under more varied settings, especially when those settings led to strong reductions in the number of internal nodes (Figures S1, S2). Because UnifNP relies on the aggregated signal across internal nodes on a tree, elimination of these nodes strongly affected test performance. The statistic was consistently the most powerful test and with the best calibrated type I error rate across lineage tracing settings, but also showed nonlinearities with respect to branch rate heterogeneity under some conditions. We note that the trees resulting from these extreme lineage tracing settings generally did not resemble the actual trees derived from xenograft data, but an important conclusion of our work is that depending on the specific lineage tracing-specific settings, topology distortion is extreme enough so as to render these tests unreliable.
4. Applications to in vivo data reveal branching rate heterogeneity is widespread but not ubiquitous in cancer genealogical trees
To evaluate the presence of rate heterogeneity in cancer proliferation in vivo, we applied the statistic and UnifNP to two sets of cancer genealogical trees derived from lineage tracing experiments: a murine xenograft study of pancreatic ductal adenocarcinoma (PDAC (15)) and a murine xenograft study of lung cancer (14). We found widespread evidence of proliferation rate heterogeneity among tumor clones in the lung cancer xenografts but less evidence in the PDAC xenografts (Table 1 and S1). Specifically, the test indicated that all but one (69/70) cancer clone (3520_NT_T1) exhibited significant signatures of rate heterogeneity. UnifNP showed more conservative but confirmatory results, estimating that 93% (65 out of 70) of trees showed rate heterogeneity. These discovery rates approximately mirror the power of and UnifNP on trees of these approximate sizes (~250–1250 tips) with CRH and 1 < c < 5, suggesting widespread rate heterogeneity in these lung cancer samples.
Table 1. The prevalence of branching rate heterogeneity in tumor genealogical trees.
Reference (number of trees) | Dataset subgroup (number of trees) | #significant by (proportion) | #significant by UnifNP (proportion) |
---|---|---|---|
Simeonov et al 2021 (12) | M1 (6) | 3 (50%) | 2 (33%) |
M2 (6) | 2 (33%) | 2 (33%) | |
Yang et al 2022 (70) | Apc (22) | 22 (100%) | 21 (95%) |
Lkb1 (23) | 23 (100%) | 22 (96%) | |
NT (25) | 24 (96%) | 22 (88%) |
In the PDAC xenografts, only a small proportion of trees exhibited significant signals of branching rate heterogeneity. We observed significant branching rate heterogeneity via both tests in the composite tree M1 but not in M2, reflecting divergent evolutionary trajectories of the same cell line in two distinct mice. The M1 rate heterogeneity signal is driven by the largest clone (M1_Clone_1) but some smaller clones also showed significant signals. In M2, the statistic showed significant signal in two small clones (N = 70 and 209) while the UnifNP statistic showed significant signal in the composite tree and a different small clone (N = 121). However, the estimated lineage tracing parameters suggest that UnifNP’s significant signal in that small clone tree might be an artifact due to increased type I error rate.
The contrast in the prevalence of branching rate heterogeneity between the two studies and between the PDAC biological replicates suggests that proliferation rate heterogeneity is widespread but not ubiquitous. We recommend formal testing for branching rate heterogeneity when analyzing tree topologies emerging from lineage tracing experiments.
Discussion
Tree-based cancer data from patient samples and experiments have already yielded extensive insights into cancer evolutionary dynamics (7,40–43). Borrowing from a rich literature preceding mass cancer sequencing, researchers have used tree balance to suggest departures from uniform growth in cancer data via well-established statistics (8,9,21,22,24,25,44). However, the application of these methods to lineage-tracing trees has been limited. Major obstacles include a lack of understanding about how the lineage tracing process itself distorts tree shape and the degree to which tree balance signatures are retained.
In this paper, we find that although lineage tracing distorts genealogy shapes, these trees can retain substantial signals of growth rate heterogeneity. We demonstrate that several statistics leveraging tree balance are well-powered and well-calibrated to detect deviations from an EBR null, particularly and UnifNP (under lineage tracing settings matching empirical data). Although an a priori understanding of lineage tracing parameters improves test performance, these tests can perform well even when this information is unknown. Despite the Sackin index being the most widely used tree balance statistic among cancer trees with naturally occurring variation (8,9,22–24), we found it to be relatively underpowered when testing for EBR departures in lineage tracing data. Application of the and UnifNP test to evolving barcode lineage tracing experiments reveals that branching rate variation is widespread but not ubiquitous in xenograft experiments, and its presence can vary between systems and biological replicates. In future analyses, tests for rate heterogeneity applied to tree subclades with specific characteristics may permit the robust identification of conditions that associate with branch rate heterogeneity.
We note several caveats with our analyses. 1) We consider only ultrametric trees with all extant taxa sampled. Incomplete and especially biased sampling stands to distort the proposed statistics and tests. We likely encountered incomplete but relatively balanced sampling in the lineage tracing data we analyzed. New approaches to account for biased sampling would extend the applicability of these tests. 2) We do not consider tree uncertainty from the reconstruction process beyond the collapsing of nodes with identical barcodes in simulated data. We note that tree reconstruction from lineage barcodes is an active area of research (27,31,32,45) with additional technical challenges that were not directly modeled here, including barcode deletions, and late edits that mask earlier ones. While we considered a general model of evolvable barcode lineage tracing, extensive consideration of system-specific features could further improve our understanding of method applicability. 3) Despite our ability to detect deviations from EBR, our tests cannot distinguish if these deviations derive from CRH or DRH (Supplemental Figure 8) or if their basis is genetic, transcriptional, environmental or combination thereof. These tests also will not detect temporal branch rate heterogeneity affecting all lineages of the tree simultaneously (for example, branching rate deceleration as the population reaches carrying capacity (46)). 4) While we examined two models of rate heterogeneity with a range of rate strengths, rate heterogeneity can emerge via a variety of complex modes not considered here, and the model specifics likely alter the power of the tests reported.
Although here we consider the specific problems and perform benchmarking specifically in the case of lineage-tracing based tumor xenograft data, the method modifications we propose are general and could potentially be applied in other instances where trees have widespread multifurcations. While the future of lineage tracing trees will likely involve experimental and interference systems that tightly interweave longer division tracking periods, system-specific reconstructions and branching rate modeling, existing topological data continues to represent a rich source of biological information about underlying evolutionary processes, albeit with its own complexities.
Materials and Methods
Tree simulations
We simulated binary trees under three different branching rate models: equal branching rate (EBR), continuous branching rate heterogeneity (CRH) and discrete branching rate heterogeneity (DRH). In all models, branching events occur according to a Poisson process with rate that can vary over time. For EBR, we used a constant rate Yule model with branching events per unit time. For CRH, we modeled a trait that varies according to Brownian motion over time and determines the branching rate via an scaled exponential transformation where the scaling coefficient . This is similar to ESsim (47), but we here employ an exponential rather than bounded linear function to examine an idealized scenario with no branching rate constraints. The Brownian diffusion coefficient D is set to 0.1 and the initial cell starts with . Because we constrained tree size, variation in and have the same effect on tree shape, so we did not vary across trials. For DRH, we modeled branching rate transition between two discrete states following another Poisson process with rate transitions per unit time, mirroring the structure of multi-state speciation extinction models (48). One branching rate is fixed at and we examine tree patterns emerging when the other is , 2 or 10. The common ancestor starts at , leading to identical initial conditions across the three models. For all models, we simulated trees of size 50, 250, 1250 and 6250 with 100 replicates each using the R package ‘diversitree’ (49).
Simulating tree structures under lineage tracing
To evaluate our ability to reject the EBR hypothesis in lineage tracing data, we transformed our binary genealogical trees into simulated lineage tracing trees using three parameters: the number of editing sites , the latency of editing , and the rate of editing . Lineage tracing only begins after time , representing the non-editing time preceding lineage tracing induction and branching events preceding are collapsed into a multifurcating node at the root. After lineage tracing induction, barcode editing occurs according to an independent Poisson process in each cell lineage at a constant rate of mean events per cell per unit time. Each editing event permanently and uniquely marked one of the cell’s edit sites, which could not then be edited again. All barcodes are inherited by both the daughter cells. If all of a cell’s edit sites are edited, no more edits are permitted and that node and its descendants form a pendant clade. Because the absolute rate of cell proliferation and the barcode editing rate are often unknown, we normalize the stochasticity in the root-to-tip distance among simulated trees of the same size by parameterizing these rates in time relative to root-to-tip distance. Specifically,
We simulated tree structures with ranging from 0 to 0.9 and ranging from −0.5 to 0.9, both at parameter intervals of 0.1, and equal to 20. To generate lineage tracing trees for power and type I error rate estimation, we simulated tree structures with , and equal to 20. We present the results with and both equal to 0.5 as this combination of parameters produced EBR trees that best matched the proportion of multifurcating nodes (see detailed definition below) as Simeonov et al’s M2 tree (Figure S2). To generate lineage tracing trees for empirical null distributions for each given tree, we estimated the parameters as described below.
Estimating latency and rate parameters from a lineage tracing tree
For both the and Sackin index tests, we compare an observed statistic to a distribution of statistics generated under EBR to test for divergence from EBR. The trees forming this empirical null also require lineage tracing transformation to be comparable to the observed tree. We infer the parameters of the lineage tracing system from the focal tree and the experimental design as follows: first, we take the number of edit sites, , as a known parameter from the construction of the lineage tracing system. For the simulated data, we set = 20. For the empirical trees, we matched to the number of editing sites from the raw data, which ranged from 5 to 78 with a median of 30 (see Table S1).
Second, although in principle the latency and rate parameters can also be determined from the experimental design, we treated the latency and rate parameters as unknown and estimated them from the observed lineage tracing trees. We matched lineage tracing transformed EBR binary trees and empirical trees according to four statistics: 1) the outdegree at the root node , 2) the maximum pendant clade size , 3) the proportion of multifurcating nodes , calculated as the ratio between the multifurcating nodes and the total number of nodes in the tree and 4) the degree of multifurcation , calculated as the ratio between the number of nodes in the observed tree and in a binary tree of the same size. We normalized and by dividing them with the tree size to produce and . We simulated 20 binary trees of the same size as the lineage tracing tree and transformed them combinatorially with ranging from 0 to 0.9 and ranging from −0.5 to 0.9, both at parameter intervals of 0.1. We summarized the transformed trees with the same four summary statistics, and selected the combination of latency and saturation parameters that yielded the smallest unweighted average sum-of-square error to the observed lineage tracing tree. We then used that parameter combination to transform EBR binary trees into multifurcating trees from which we calculated the empirical null distributions for the and Sackin index tests. Note, we also attempted to use tree distance metrics for parameter matching but found few options that could compute distances in multifurcating trees of these sizes, and the best candidate metric (50) failed to produce trees that matched most summary statistics (results not shown).
Tests for departures from EBR
statistic
The statistic is the clade-size weighted average Shannon equitability of daughter clade sizes across all internal nodes (36). In the context of the genealogical trees in this study, can be computed as
where is the number of tips descending from the focal node and is the Shannon equitability of the daughter clade sizes at the focal node. In a perfectly balanced tree, = 1 and in a maximally imbalanced tree, = 0. We compute a null distribution for by simulating 1000 size-matched EBR trees, applying the lineage tracing settings inferred for the focal tree, and then computing on the resulting lineage-tracing transformed trees. We then test the one-sided alternative hypothesis that the observed statistic is smaller than expected under EBR (i.e., more unbalanced due to rate heterogeneity).
UnifNP
In a bifurcating EBR tree, the size of a randomly selected daughter clade is uniformly discretely distributed between 1 and , the total clade size with its sister minus 1, providing a null expectation against which branching rate heterogeneity trees depart (51). The corresponding statistic, nodal probability, is the probability that a node splits into daughters in a way that is less balanced than expected under EBR. We refer to this method as UnifNP (Uniform Nodal Probability). Here, we generalize the standard bifurcating nodal probability to multifurcating trees. To evaluate the balance of split on a multifurcating node with an outdegree of , we calculate an intermediate statistic as half the sum of the absolute differences between descending clade sizes and their rounded average :
where is rounded up for 𝑛 𝑚𝑜𝑑 𝑚 and rounded down for > n mod m.
Slowinsky and Guyer provide an analytic solution for the distribution of in bifurcating trees (51). For multifurcating trees, we generate node-specific empirical distributions of through 999 Monte Carlo simulations of clade size splits expected under EBR. For a node with descendants separated into clades, we repeatedly and randomly partition tips into groups, sort them in ascending order and compute . This distribution of s forms the node’s empirical null distribution (for extended justification that this represents the null clade size distributions under EBR, see Supplemental Note). We then compute , the count of times that the observed for a given node is less than its empirical null, with a greater reflecting a lesser divergence from the most balanced split than expected under random chance. Because the maximum of ’s possible values, , is constrained by and , can only take on at most + 1 possible values. If is small across many nodes, this presents a problem when aggregating quantile values across nodes. We resolve this by redistributing p into , a random uniform quantile position within its bin of tied statistic quantiles. Specifically, if the observed is tied with values from the empirical null, we set , where 𝜀 is a random uniform number draw between 0 and 1. In the case of no tie, . We then compute the nodal probability as . We exclude small clades from the distribution of nodal probabilities, as branching rate heterogeneity leaves little imbalance signature. Under EBR, the distribution of . across all internal nodes will be uniform. We test against this null hypothesis via a Kolmogorov-Smirnov test.
Sackin index
The Sackin index is the sum of the number of ancestors (internal nodes) across all tips in a tree (37). For size-matched trees, larger Sackin indices indicate less balanced trees. We used the same procedure to generate null distributions for the Sackin index as for . The one-sided alternative hypothesis is that the observed Sackin index is greater than expected under EBR.
statistic
The statistic is calculated as the sum of log clade size minus 1 across all internal nodes on a fully bifurcating tree (35), and we use the implementation of in the R package apTreeShape (39). asymptotically follows a normal distribution with mean 1.204 and variance 0.168 , where is the size of the tree (38). We randomly resolved multifurcating nodes under EBR following apTreeshape’s documentation before testing for departures from EBR. The one-sided alternative hypothesis that the observed is greater than expected under EBR is tested.
Estimating power and type I error rates
The power of each method in detecting branching rate heterogeneity was determined by counting the frequency that a method yielded a significant p-value to reject the null hypothesis when the focal tree is generated under a non-EBR model. The type I error rate is estimated by calculating the frequency of significant test results when the focal tree is generated under the EBR model.
Benchmarking the computation time of various methods
The computation time of each method was estimated by the average time used to finish one test with no precomputed Monte Carlo trees. For null distributions, we generated 1000 trees in each set of simulations, and for lineage tracing parameter estimation, we generated 20 trees in each set. We estimated the computation time for binary and multifurcating trees (with lineage tracing parameters , and 𝑘 = 20) of size 50, 250, 1250, and 6250 with 10 replicates (half from EBR and half from CRH with 𝑐 = 1) in each combination. The benchmarking was done on a Mac mini (macOS Ventura 13.2, Apple M2) using a single core.
Empirical trees
We examined lineage tracing trees from two murine xenograft studies of tumor proliferation and metastasis (14,15). In Simeonov et al’s data (15), we subsetted the data to only examine subtrees from the primary tumor site (both individual clone trees and the composite tree that aggregates the clone trees together). For Yang et al’s data (14), we used the lineage tracing trees with no metastases. We excluded trees with less than 50 tips. We obtained 12 trees from Simeonov et al and 70 trees from Yang et al. For both studies, we used the tree structures originally inferred by the authors. Simeonov et al used TreeUtils (52) to infer the maximum parsimony phylogeny via PHYLIP (53). Yang et al reconstructed tumor phylogeny using Cassiopeia (45). We used customized scripts and the R package “ape” to parse the phylogeny deposited by the authors to “phylo” class objects in R (54).
Supplementary Material
Funding information
This study is funded by the NIH grant #1DP2CA280623-01.
Footnotes
Competing interests
All authors declare no competing interests.
Data availability
Full data and scripts used in this study will be publicly available on a repository TBD upon paper acceptance. Key data and scripts to reproduce the analyses, figures and tables are included in the supplementary material.
Bibliography
- 1.Reeves MQ, Kandyba E, Harris S, Del Rosario R, Balmain A. Multicolour lineage tracing reveals clonal dynamics of squamous carcinoma evolution from initiation to metastasis. Nat Cell Biol. 2018. Jun;20(6):699–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gaglia G, Kabraji S, Rammos D, Dai Y, Verma A, Wang S, et al. Temporal and spatial topography of cell proliferation in cancer. Nat Cell Biol. 2022. Mar;24(3):316–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zheng X, Weigert A, Reu S, Guenther S, Mansouri S, Bassaly B, et al. Spatial Density and Distribution of Tumor-Associated Macrophages Predict Survival in Non–Small Cell Lung Carcinoma. Cancer Res. 2020. Oct 15;80(20):4414–25. [DOI] [PubMed] [Google Scholar]
- 4.Martínez-Ruiz C, Black JRM, Puttick C, Hill MS, Demeulemeester J, Larose Cadieux E, et al. Genomic–transcriptomic evolution in lung cancer and metastasis. Nature. 2023. Apr;616(7957):543–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fennell KA, Vassiliadis D, Lam EYN, Martelotto LG, Balic JJ, Hollizeck S, et al. Non-genetic determinants of malignant clonal fitness at single-cell resolution. Nature. 2022. Jan;601(7891):125–31. [DOI] [PubMed] [Google Scholar]
- 6.Househam J, Heide T, Cresswell GD, Spiteri I, Kimberley C, Zapata L, et al. Phenotypic plasticity and genetic control in colorectal cancer evolution. Nature. 2022. Nov;611(7937):744–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lewinsohn MA, Bedford T, Müller NF, Feder AF. State-dependent evolutionary models reveal modes of solid tumour growth. Nat Ecol Evol. 2023. Apr;7(4):581–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Salehi S, Dorri F, Chern K, Kabeer F, Rusk N, Funnell T, et al. Cancer phylogenetic tree inference at scale from 1000s of single cell genomes. Peer Community J. 2023. Jul 21;3:e63. [Google Scholar]
- 9.Chkhaidze K, Heide T, Werner B, Williams MJ, Huang W, Caravagna G, et al. Spatially constrained tumour growth affects the patterns of clonal selection and neutral drift in cancer genomic data. PLOS Comput Biol. 2019. Jul 29;15(7):e1007243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat Genet. 2018. Jun;50(6):895–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proc Natl Acad Sci. 2010. Oct 26;107(43):18545–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tilk S, Tkachenko S, Curtis C, Petrov DA, McFarland CD. Most cancers carry a substantial deleterious load due to Hill-Robertson interference. Taylor M, Przeworski M, Kuzmin E, Taylor M, editors. eLife. 2022. Sep 1;11:e67790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jones MG, Yang D, Weissman JS. New Tools for Lineage Tracing in Cancer In Vivo. Annu Rev Cancer Biol. 2023. Apr 11;7(Volume 7, 2023):111–29. [Google Scholar]
- 14.Yang D, Jones MG, Naranjo S, Rideout WM, Min KH (Joseph), Ho R, et al. Lineage tracing reveals the phylodynamics, plasticity, and paths of tumor evolution. Cell. 2022. May 26;185(11):1905–1923.e25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Simeonov KP, Byrns CN, Clark ML, Norgard RJ, Martin B, Stanger BZ, et al. Single-cell lineage tracing of metastatic cancer reveals selection of hybrid EMT states. Cancer Cell. 2021. Aug 9;39(8):1150–1162.e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Eyler CE, Matsunaga H, Hovestadt V, Vantine SJ, van Galen P, Bernstein BE. Single-cell lineage analysis reveals genetic and epigenetic interplay in glioblastoma drug resistance. Genome Biol. 2020. Jul 15;21(1):174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Schiffman JS, D’Avino AR, Prieto T, Pang Y, Fan Y, Rajagopalan S, et al. Defining ancestry, heritability and plasticity of cellular phenotypes in somatic evolution [Internet]. bioRxiv; 2023. [cited 2024 Jun 21]. p. 2022.12.28.522128. Available from: 10.1101/2022.12.28.522128v2 [DOI] [PMC free article] [PubMed]
- 18.Quinn JJ, Jones MG, Okimoto RA, Nanjo S, Chan MM, Yosef N, et al. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science. 2021. Feb 26;371(6532):eabc1944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang W, Bado IL, Hu J, Wan YW, Wu L, Wang H, et al. The bone microenvironment invigorates metastatic seeds for further dissemination. Cell. 2021. Apr 29;184(9):2471–2486.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lamprecht S, Schmidt EM, Blaj C, Hermeking H, Jung A, Kirchner T, et al. Multicolor lineage tracing reveals clonal architecture and dynamics in colon cancer. Nat Commun. 2017. Nov 10;8(1):1406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Werner B, Traulsen A, Sottoriva A, Dingli D. Detecting truly clonal alterations from multi-region profiling of tumours. Sci Rep. 2017. Mar 27;7(1):44991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu X, Zhang K, Kaya NA, Jia Z, Wu D, Chen T, et al. Tumor phylogeography reveals block-shaped spatial heterogeneity and the mode of evolution in Hepatocellular Carcinoma. Nat Commun. 2024. Apr 12;15:3169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jiang X, Tomlinson IPM. Why is cancer not more common? A changing microenvironment may help to explain why, and suggests strategies for anti-cancer therapy. Open Biol. 2020. Apr 15;10(4):190297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Scott JG, Maini PK, Anderson ARA, Fletcher AG. Inferring Tumor Proliferative Organization from Phylogenetic Tree Measures in a Computational Model. Syst Biol. 2020. Jul;69(4):623–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Schwarz RF, Trinh A, Sipos B, Brenton JD, Goldman N, Markowetz F. Phylogenetic Quantification of Intra-tumour Heterogeneity. PLOS Comput Biol. 2014. Apr 17;10(4):e1003535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lynch AR, Arp NL, Zhou AS, Weaver BA, Burkard ME. Quantifying chromosomal instability from intratumoral karyotype diversity using agent-based modeling and Bayesian inference. Marston AL, Akhmanova A, Graham TA, editors. eLife. 2022. Apr 5;11:e69799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Seidel S, Stadler T. TiDeTree: a Bayesian phylogenetic framework to estimate single-cell trees and population dynamic parameters from genetic lineage tracing data. Proc R Soc B Biol Sci. 2022. Nov 9;289(1986):20221844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lin D, Li X, Moult E, Park P, Tang B, Shen H, et al. Time-tagged ticker tapes for intracellular recordings. Nat Biotechnol. 2023. May;41(5):631–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Choi J, Chen W, Minkina A, Chardon FM, Suiter CC, Regalado SG, et al. A time-resolved, multi-symbol molecular recorder via sequential genome editing. Nature. 2022. Aug;608(7921):98–107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liu K, Deng S, Ye C, Yao Z, Wang J, Gong H, et al. Mapping single-cell-resolution cell phylogeny reveals cell population dynamics during organ development. Nat Methods. 2021. Dec;18(12):1506–14. [DOI] [PubMed] [Google Scholar]
- 31.Feng J, Iii WSD, McKenna A, Simon N, Willis AD, Iv FAM. Estimation of cell lineage trees by maximum-likelihood phylogenetics. Ann Appl Stat. 2021. Mar;15(1):343–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Prillo S, Ravoor A, Yosef N, Song YS. ConvexML: Scalable and accurate inference of single-cell chronograms from CRISPR/Cas9 lineage tracing data. bioRxiv. 2023. Dec 3;2023.12.03.569785.
- 33.Yule GU . II.—A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F. R. S. Philos Trans R Soc Lond Ser B Contain Pap Biol Character. 1997. Jan;213(402–410):21–87. [Google Scholar]
- 34.Neher RA, Russell CA, Shraiman BI. Predicting evolution from the shape of genealogical trees. McVean G, editor. eLife. 2014. Nov 11;3:e03568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Blum MGB, François O. Which Random Processes Describe the Tree of Life? A Large-Scale Study of Phylogenetic Tree Imbalance. Syst Biol. 2006. Aug 1;55(4):685–91. [DOI] [PubMed] [Google Scholar]
- 36.Lemant J, Le Sueur C, Manojlović V, Noble R. Robust. Universal Tree Balance Indices. Syst Biol. 2022. Sep 1;71(5):1210–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sackin MJ. “Good” and “Bad” Phenograms. Syst Biol. 1972. Jul 1;21(2):225–6. [Google Scholar]
- 38.Fill JA. On the distribution of binary search trees under the random permutation model. Random Struct Algorithms. 1996;8(1):1–25. [Google Scholar]
- 39.Bortolussi N, Durand E, Blum M, François O. apTreeshape: statistical analysis of phylogenetic tree shape. Bioinformatics. 2006. Feb 1;22(3):363–4. [DOI] [PubMed] [Google Scholar]
- 40.Marco Gerlinger, Rowan Andrew J., Horswell Stuart, Larkin James, Endesfelder David, Gronroos Eva, et al. Intratumor Heterogeneity and Branched Evolution Revealed by Multiregion Sequencing. N Engl J Med. 2012;366(10):883–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Casasent AK, Schalck A, Gao R, Sei E, Long A, Pangburn W, et al. Multiclonal Invasion in Breast Tumors Identified by Topographic Single Cell Sequencing. Cell. 2018. Jan 11;172(1):205–217.e12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Alves JM, Prado-López S, Cameselle-Teijeiro JM, Posada D. Rapid evolution and biogeographic spread in a colorectal cancer. Nat Commun. 2019. Nov 13;10(1):5139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat Rev Genet. 2017. Apr;18(4):213–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Noble R, Burri D, Le Sueur C, Lemant J, Viossat Y, Kather JN, et al. Spatial structure governs the mode of tumour evolution. Nat Ecol Evol. 2022. Feb;6(2):207–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Jones MG, Khodaverdian A, Quinn JJ, Chan MM, Hussmann JA, Wang R, et al. Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol. 2020. Apr 14;21(1):92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Vaghi C, Rodallec A, Fanciullino R, Ciccolini J, Mochel JP, Mastri M, et al. Population modeling of tumor growth curves and the reduced Gompertz model improve prediction of the age of experimental tumors. PLOS Comput Biol. 2020. Feb 25;16(2):e1007178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Harvey MG, Rabosky DL. Continuous traits and speciation rates: Alternatives to state-dependent diversification models. Methods Ecol Evol. 2018;9(4):984–93. [Google Scholar]
- 48.Maddison WP, Midford PE, Otto SP. Estimating a Binary Character’s Effect on Speciation and Extinction. Syst Biol. 2007. Oct 1;56(5):701–10. [DOI] [PubMed] [Google Scholar]
- 49.FitzJohn RG. Diversitree: comparative phylogenetic analyses of diversification in R. Methods Ecol Evol. 2012;3(6):1084–92. [Google Scholar]
- 50.Lewitus E, Morlon H. Characterizing and Comparing Phylogenies from their Laplacian Spectrum. Syst Biol. 2016. May 1;65(3):495–507. [DOI] [PubMed] [Google Scholar]
- 51.Slowinski JB, Guyer C. Testing whether certain traits have caused amplified diversification: an improved method based on a model of random speciation and extinction. Am Nat. 1993. Dec;142(6):1019–24. [DOI] [PubMed] [Google Scholar]
- 52.McKenna A, Findlay GM, Gagnon JA, Horwitz MS, Schier AF, Shendure J. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science. 2016. Jul 29;353(6298):aaf7907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. Department of Genome Sciences, University of Washington, Seattle; 2004. [Google Scholar]
- 54.Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004. Jan 22;20(2):289–90. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Full data and scripts used in this study will be publicly available on a repository TBD upon paper acceptance. Key data and scripts to reproduce the analyses, figures and tables are included in the supplementary material.