Abstract
Motivation
Copy number alterations are driving forces of tumour development and the emergence of intra-tumour heterogeneity. A comprehensive picture of these genomic aberrations is therefore essential for the development of personalised and precise cancer diagnostics and therapies. Single-cell sequencing offers the highest resolution for copy number profiling down to the level of individual cells. Recent high-throughput protocols allow for the processing of hundreds of cells through shallow whole-genome DNA sequencing. The resulting low read-depth data poses substantial statistical and computational challenges to the identification of copy number alterations.
Results
We developed SCICoNE, a statistical model and MCMC algorithm tailored to single-cell copy number profiling from shallow whole-genome DNA sequencing data. SCICoNE reconstructs the history of copy number events in the tumour and uses these evolutionary relationships to identify the copy number profiles of the individual cells. We show the accuracy of this approach in evaluations on simulated data and demonstrate its practicability in applications to two breast cancer samples from different sequencing protocols.
Availability and implementation
SCICoNE is available at https://github.com/cbg-ethz/SCICoNE.
1 Introduction
During tumour progression, cancer cells undergo complex and diverse genomic aberrations leading to heterogeneous cell populations and multiple evolving subclones (Greaves and Maley 2012, Nik-Zainal et al. 2012, Yates and Campbell 2012). Genomic sequencing has been powerfully used to uncover the heterogeneity across cancer types (Burrell and Swanton 2016) and to examine the link between tumour diversity and progression (Maley et al. 2006, Marusyk and Polyak 2010, Merlo et al. 2010). Heterogeneity may also allow the tumour additional ways to evolve resistance under treatment, such that intra-tumour genomic diversity is a cause of relapse and treatment failure (Burrell et al. 2013, McGranahan and Swanton 2015).
The need for a comprehensive understanding of the composition of each tumour for more precise and effective cancer therapies (Allison and Sledge 2014) can be addressed by sequencing tumours at the resolution of individual cells (Wang and Navin 2015, Gawad et al. 2016). The small amount of DNA in single cells has to be amplified before sequencing, which leads to characteristic and pronounced noise in the read count data. Specialized phylogenetic methods for single-cell sequencing data accounting for these noise patterns have been developed to detect point mutations and reconstruct the evolutionary history of tumours (Kuipers et al. 2017a; Zafar et al. 2018).
In addition to point mutations, cancerous cells often undergo more complex genomic rearrangements, including copy number alterations (CNAs) such as amplifications and deletions. Since the first single-cell DNA sequencing (Navin et al. 2011), which allowed for the identification of CNAs, rapid progress has led to high-throughput methods (Zahn et al. 2017, Laks et al. 2019) that can profile the whole genome of hundreds of single cells. By removing the need for pre-amplification, the protocols of Zahn et al. (2017) and Laks et al. (2019) have particularly uniform coverage allowing a resolution of CNAs down to the megabase scale. A commercial solution from 10x Genomics was also available for a time, likewise enabling the processing of hundreds of single cells, and used in a large-scale clinical project (Irmisch et al. 2021). The resolution and depth of the data depends heavily on the technology used, and the amount of sequencing subsequently performed.
A number of computational tools have been applied or specifically developed for copy number calling in single-cell DNA sequencing data (Garvin et al. 2015, Lai et al. 2016, Wang et al. 2018, Dong et al. 2020, Wang et al. 2020, Zaccaria and Raphael 2021, Ruohan et al. 2022) with reviews and comparisons in overview papers (Mallory et al. 2020a,b). The output of these methods is a copy number profile, a partition of the genome into segments, and a sequence of integers indicating the copy number state of each segment. While some of these tools pool information across cells (Wang et al. 2020, Zaccaria and Raphael 2021), the tools above do not take into account the shared evolutionary history of cells originating from the same individual or tumour. As previously shown for point mutations, the evolutionary relationships of tumour cells can be used to boost and correct the weak and noisy signal provided by single-cell sequencing reads (Singer et al. 2018). However, modelling and inference of evolutionary histories from single-cell data is notably more involved in the case of CNAs, as such events may physically overlap so that treating the events as independent may no longer be appropriate (Mallory et al. 2020b).
Methods for jointly calling copy numbers and reconstructing event histories are therefore currently under active development. Previous methods developed for inferring evolutionary histories based on copy numbers assume that copy number profiles of the individual cells are predefined. The methods can be separated into two categories, distance-based approaches (Navin et al. 2011, Schwarz et al. 2014, Kaufmann et al. 2022) and methods that reconstruct the evolutionary history based on a maximum parsimony heuristic (Chowdhury et al. 2014, El-Kebir et al. 2017, Gertz et al. 2016, Wang et al. 2021). Further distinction can be made by the way identical copy number changes of neighbouring segments are interpreted, as separate events (McPherson et al. 2016), a single event (Schwarz et al. 2014, El-Kebir et al. 2017, Kaufmann et al. 2022), or a hybrid thereof in which deletions/amplifications can either occur separately on individual segments or affect whole chromosomes or genomes (Chowdhury et al. 2014, Gertz et al. 2016). A limitation shared by all these methods is that errors in the copy number profiles will propagate into the inference of the copy-number trees.
Our approach, SCICoNE (https://github.com/cbg-ethz/SCICoNE), directly integrates the inference of copy number profiles with the reconstruction of copy number event histories tailored to the shallow read-depth of whole-genome single-cell sequencing data. Other tree-based methods have also been developed (Markowska et al. 2022, Liu et al. 2024): CONET (Markowska et al. 2022) builds the tree based on the likelihood of the presence or absence of breakpoints (locations in the genome marking the start or end of a CNA) and calls copy-numbers in post-processing using a discrepancy penalty although this does not directly model the evolution of copy-numbers. NestedBD (Liu et al. 2024) instead works in the cell lineage space, treating copy numbers in bins independently, and also calls copy numbers in a later inference step. We therefore benchmark SCICoNE against a suite of eight alternatives, including tree-based methods, and show strong improvements in performance, especially in the low read-depth and high noise regime typical of very shallow whole-genome single-cell sequencing, used for example in Irmisch et al. (2021). We further use SCICoNE to examine the copy number history of a xenograft breast cancer sample, and a breast cancer tissue sample processed with the 10x Genomics platform.
2 Materials and methods
2.1 Model overview
During tumour evolution, CNAs can occur and accumulate in cancer cell subpopulations. Since all cells in a tumour are related through sharing a common ancestor, CNAs occur on a cell lineage tree and encode the copy number profiles of each clone or cell (Supplementary Fig. S1a). For clinical applications, the history of copy number events is often more pertinent than the fully resolved cell genealogy, as the former is sufficient to determine which CNAs are mutually exclusive or co-occurring in the same tumour subclone, and to infer the order of CNA events. We therefore move from the cell or clone lineage tree to the CNA tree (Supplementary Fig. S1b) to model the evolutionary history of the tumour and call copy numbers.
With the whole-genome sequencing of single cells (10x Genomics, Zahn et al. 2017), we have read depth data for each cell and each bin along the genome (Fig. 1a). After correction for confounders, including read mappability and genomic GC content, for each bin (Baslan et al. 2012, Garvin et al. 2015), the corrected counts can be assumed to be proportional to the underlying copy number. The genome duplication of cell division does not happen everywhere along the genome at the same time so that cells captured during division may have parts of the genome amplified due to the replication rather than reflecting underlying CNAs. To avoid these effects impeding phylogenetic reconstruction, cycling cells should also be filtered (Salehi et al. 2023).
Figure 1.
Overview of CNA calling and tree inference with SCICoNE. (a) From single-cell DNA sequencing we obtain noisy read count data reflecting the underlying copy number profiles. (b) As a first step, we detect breakpoints to partition the genome into segments (each comprising several bins) that may experience CNAs. In this case, four breakpoints define the five segments . (c) From the data, we then infer the evolutionary history of copy number changes which accumulate (plus signifies an amplification, minus a deletion) at the nodes of an event tree. By attaching the single-cells to the event tree we obtain their CNAs by tracing the path from the root to call the copy numbers of each clone, or group of cells with the same profile (d). For example, Clone 4 has experienced two CN events after the diploid root, namely Event 1, a gain in a region spanning segments and , and Event 5, a further gain in , such that the copy number profile of Clone 4 is across the five segments.
To partition the genome into segments of consecutive bins which may have experienced CNAs, we develop a dynamic programming approach to detect breakpoints by combining evidence across the individual cells (Fig. 1b; Section 2.3). For each potential breakpoint, we compare the likelihood of a step change in copy number state to that of a constant model, and then collate the information across cells to obtain the total signal of a copy number change at that genomic position.
To account for the noise in the sequencing data, we develop a probabilistic model and an MCMC inference scheme for single-cell read counts. A major difference to MCMC schemes used for reconstructing trees of point mutations (Jahn et al. 2016) is that the infinite sites assumption, which excludes the possibility of multiple mutational hits at the same genomic site, can no longer be made (Kuipers et al. 2017b). CNAs may overlap and nest inside each other, so the model developed here allows for arbitrary violations of the infinite sites assumption and arbitrary reoccurrences of amplifications and deletions across different genomic regions. Reoccurrences are only resolved and modelled at the level of the bins of the genomic data, so that the individual break points may still occur at different genomic positions in the same bin. Along with allowing violations of the infinite sites assumption at the bin level, we explicitly model dependencies between bins as they are tied together within copy number events according to the CNA tree model (Fig. 1c).
By jointly estimating the copy number profiles of all cells and their underlying CNA tree, we leverage the evolutionary dependencies among cells for improved copy number calls. The output of SCICoNE comprises both (i) the reconstructed tree representing the evolutionary history of all cells (Fig. 1c) and (ii) the inferred copy number profile for each individual cell (Fig. 1d).
2.2 Binning and read count correction
Current protocols for copy number detection at single-cell resolution typically use shallow whole-genome sequencing (≤0.1x coverage per cell) (10x Genomics, Zahn et al. 2017) which prohibits coverage-based copy number calling at the level of individual loci. Instead one partitions the reference genome into equal sized bins (20–100 kb) and counts the reads per bin instead of per locus. The raw read count of a bin, is not only determined by the bin’s underlying copy number state, but also by its mappability and GC content. To reduce the bias introduced by these confounders, SCICoNE uses read counts (per bin and per cell) that have been corrected for both effects (Baslan et al. 2012, Garvin et al. 2015). With these confounders removed, we now assume that the probability of reads falling into each bin is proportional to the bin’s copy number state,
| (1) |
where is the probability of a read from cell falling into bin , and is the copy number state for that bin in that cell.
2.3 Breakpoint detection to define copy number segments
As copy number changes often affect regions much larger than a typical bin size, we collate neighbouring bins into genomic segments with the same copy number state. To collate the bins into genomic segments, we first detect potential breakpoints as bin boundaries where the read depth changes across subsets of cells. For this purpose, we developed a dynamic programming approach that combines evidence across cells to call the breakpoints (Supplementary Appendix A). The detection compares a likelihood-based model of a step change in copy number at each bin to a constant copy number model for each cell and then combines the signal over all cells. Bins with the strongest combined signal relative to a noise threshold are classified as potential breakpoints.
Once the breakpoints have been determined, we collate bins between consecutive breakpoints into segments. For each cell, we sum the counts in all bins belonging to each segment to arrive at a count matrix with entries for each cell and each segment .
The probability of a read falling into segment is proportional to the copy number state and the size of the segment
| (2) |
where is the probability of a read from cell falling into segment with size , and is the copy number state for that segment in that cell.
2.4 CNA trees
For clinical applications and copy number calling, we work with the history of copy number events represented as a CNA tree (Fig. 1c; Supplementary Fig. S1). This is in analogy to the mutation trees of SCITE (Jahn et al. 2016), where the tree nodes now encapsulate events corresponding to amplifications or deletions of the segments. The nodes in the tree can have arbitrary degree. We index the event nodes and additionally label them with the corresponding CNAs. All CNA events are stored in the vector , such that is the collection of CNAs of vertex . For the example of Fig. 1c, we have . The tree structure we can store as a parent vector where 0 represents the root.
If cell is attached to event node we can read off the expected copy number state for each segment by tracing all events back to the root for a given tree structure and event vector. In the example of Fig. 1c, the attachment vector is . We denote the expected copy number state of cell for a given attachment point as . Then we have for the probability of the reads of cell falling into segment
| (3) |
If a segment is entirely deleted so that the copy number state and probability is 0, then we would not expect to see any reads in that segment. However, due to mapping errors there may still be some residual reads in that segment. To account for this possibility, we instead set the minimum copy number state to .
2.5 Likelihood
To assess how well a CNA tree with its events, , fits our read count data, we define a likelihood model as follows. The data matrix entry stores the number of corrected counts cell has in segment , with total reads for that cell of . Given the probabilities of reads landing in each segment, we model the data with a Dirichlet-Multinomial distribution to account for overdispersion
| (4) |
where is the concentration parameter (inverse of the overdispersion). We scale by to aid the computational implementation since, using Equation (3), the likelihood contribution from cell is therefore
In the large limit, the model simplifies to the multinomial distribution
| (5) |
We can compute this likelihood for all cells and all possible attachment points efficiently in total time where is the number of cells and the number of event nodes. This time complexity is achieved with a tree traversal. During the tree traversal, at each step only the segments in one event node change their state and we only update a limited number of terms in the product. All possible attachment points of a cell can therefore be computed in linear time.
For numerical accuracy, we compute the log-likelihood of all attachment points for each cell relative to the root. When cell is attached at the root, which is node 0 so that , with a given constant ploidy , we have the score for that cell of
| (6) |
whereas for large the ploidy constant cancels and we simply have
| (7) |
Since the root is common across all tree models, this contribution only varies if is changed and is only recomputed in that case. The root does not need to have constant ploidy; e.g. the sex chromosomes can be set accordingly.
2.6 Posterior tree probability
With the likelihoods of each cell for each attachment point, we can directly compute the marginal likelihood when we average over all possible attachment points of each cell in the tree
| (8) |
with a uniform prior on the attachment points. With a noninformative prior on the tree size, and then a uniform prior also on the trees, the posterior becomes
| (9) |
where is a penalization term needed to account for combinatorial effects possible for larger trees which we detail in Supplementary Appendix B. We define and discuss the prior on the event vector in Supplementary Appendix C.
Since we marginalize out the cell attachments, we focus on the CNA tree and build a scheme to find the tree and vector with the highest posterior score. Not all combinations of trees and event vectors are meaningful. After a copy number state of 0 is reached at any point in the tree, the affected segment cannot be regained in any descendant nodes. We then assign a posterior score of 0 to any tree and event vector combination where this occurs and set .
Also, a tree with a placement of events such that it recreates the exact same genotypes repeatedly in different parts of the tree is redundant, as a simpler model using a smaller tree would fit the data equally well. Biologically such convergent evolution may occur, but as it cannot be distinguished by the data itself, we make the assumption of the more parsimonious model. In particular, if we denote to be the copy number state at node computed by collating the events in along the path to the root in the tree , we forbid the case for any different nodes and . To exclude this possibility we assign it a posterior score of 0 and set if .
As an alternative to marginalization, we can also place cells at their maximal attachment position and define the score
| (10) |
which removes the need for correcting for combinatorial effects via . To distinguish between these two alternatives, we denote inference using the marginalized score, Equation (9), with the term ‘sum’ and inference using the maximized score, Equation (10), with the term ‘max’.
2.7 MCMC scheme
To sample trees proportionally to we build an MCMC scheme to move in the space of trees and event vectors. We detail the moves in Supplementary Appendix D starting with the simple moves of prune and reattach and label swap used in SCITE (Jahn et al. 2016) which work for a fixed tree size and fixed set of events at the nodes. We additionally weight the move proposals to account for their computational costs. To change the event vector, we introduce an add or remove events move, while to change the tree size we developed the add or remove node and condense or split nodes moves to cover the full discrete space. To aid moving in such a complex space and finding simpler trees which generate the same set of possible genotypes we also introduce a genotype-preserving prune and reattach move. Since the set of genotypes is preserved, only the event vector prior needs to be recomputed making this move computationally very efficient, so we score and sample from the whole neighbourhood. For the overdispersion determined by the concentration parameter we run a random walk (in log space). For the complete MCMC scheme, at each iteration we first sample a move type with a fixed probability for the six different moves. Each move is equally likely to be chosen apart from the genotype-preserving prune and reattach which has a relative weight of 0.4.
2.8 Maximal search
We utilize the MCMC scheme to search for the tree with the highest posterior score, keeping track of the maximally scoring tree encountered. From the best scoring tree discovered, we can compute the best attachment of each cell and hence their predicted copy number profiles.
To aid finding the best scoring tree and event vectors, we work stepwise. First we cluster the data [using PhenoGraph (Levine et al. 2015)] and construct the average read counts per cluster. The tree likelihood computation only involves running over each cluster (weighted by the cluster size) rather than each cell, giving a large computational speed up. We initialize the states as the normalized read counts rounded to integers, and the tree by performing hierarchical clustering with Ward’s method and filling the events at the internal nodes with our extract common ancestor procedure. We run 10 different chains and check for robustness that at least half the chains return trees with similar high score (the log posterior is within 5 of the maximum). Nonrobust runs are repeated starting from the highest scoring tree found so far. Once the tree on the clustered data has been determined, we run the scheme on the full data, starting from the cluster tree and with the overdispersion learned from the full data with the cluster tree. We compute the distances between the trees returned by half the chains that attain the best scores in the full data, and check for robustness that the average distance is <0.02. We stop the scheme on the full data once either of the robustness criteria are met.
2.9 Root state
Instead of assuming a diploid state at the root, it is easy to start with any other state, e.g. tetraploid corresponding to a whole genome duplication. Although the average ploidy is not well defined for read depth data, because CNAs are integer, the differences induced can help determine the underlying states. Under the presence of further CNAs, the model comparison of the tree with a diploid root against that with a tetraploid root potentially allows identification of genome doubling relevant for tumour progression and prognosis (Bielski et al. 2018). We identify such a doubling in the 10x Genomics breast cancer dataset.
2.10 Simulation settings
For the simulation we consider a genome consisting of 10 000 bins and different numbers of reads per cell with an average of 2, 4, or 8 reads per bin. This is a relatively low read depth per bin, chosen to make the inference more challenging and comparable to the data from the 10x Genomics protocol with their default bin size of 20 kb.
For trees with size , we partitioned the bins into 40 or 80 segments. We sampled the tree structure uniformly and then, at each node in the tree, we sampled the change in copy number by first sampling the number of segments involved with a Poisson distribution with parameter and added 1, while for each segment we sampled the number of copies with a Poisson distribution with parameter and again added 1 and sampled the sign uniformly. Trees that violate biological constraints (e.g. having negative copy numbers states) are rejected and resampled. For each setting, we sampled 40 trees.
For each tree, we sample 400 cells, in line with the typical number of cells targeted with the 10x Genomics protocol for a single sequencing run, and sample their attachment point in the tree uniformly. This provides the true copy number profile of each cell, from which the reads per bin were sampled with a Dirichlet-multinomial with concentration .
For inferring the copy number profiles with our MCMC scheme, we first detected the breakpoints automatically (Supplementary Appendix A with ) to generate the segments. For our MCMC scheme we set and ran chains of length 4000, repeatedly until robust results were obtained, to find first the trees on the clustered data and then on the full data. We find the best trees both for the posterior score from the summation of Equation (9) and for the maximization of Equation (10).
For comparison we include HMMcopy (Ha et al. 2012, Lai et al. 2016) which runs a hidden Markov model on each cell to call the breakpoints and the copy number states which are inferred per bin for each cell individually. For single-cell low-read count data, however, default parameters of HMMcopy should be adjusted and we followed the recommendations of Mallory et al. (2020a) (log-transformed counts, , , strength of ). We also include Ginkgo (Garvin et al. 2015), which offers single-cell copy number calling, adapted to deal with cell-by-bin read count matrices and with the baseline, maximum and minimum ploidy set to 2, 6, and 0, respectively, and SCOPE (Wang et al. 2020), which we initialize by setting the top 10% cells with lowest Gini coefficient as diploid, and allow at most 20 copy number clones.
We further consider CONET (Markowska et al. 2022), a tree-based method where the main difference between SCICoNE and CONET is that CONET does not explicitly model integer copy number changes in their tree nodes, but rather models breakpoint presences or absences as the events in the tree. Copy numbers are called in a post-processing step of mapping to the nearest integer, and while the breakpoints share an evolutionary history, the copy number states are not assumed to, unlike in SCICoNE. The input data for CONET is correspondingly not a cells by regions matrix, but rather the difference in normalized read counts between subsequent bins. For the simulations we followed the recommendations in Markowska et al. (2022) and set , , s1=100 000, s2=100 000, and , and set the number of parallel tempering iterations to 200 000. In case the method took >24 h to run, we reran with their default setting of 100 000 iterations. CONET also requires candidate breakpoints as input for which we used the scheme provided by their software which relies on CBS (Olshen et al. 2004) and MergeLevels (Willenbrock and Fridlyand 2005) for genome segmentation.
We also include NestedBD (Liu et al. 2024), which takes as input the single-cell copy number profiles learned by Ginkgo and finds a lineage tree relating them. To enable inference on the 400 simulated single-cell profiles, we first apply hierarchical clustering on the Ginkgo copy number profiles to obtain 20 copy number clusters. After the NestedBD tree inference on the cluster medians, we use the corrected copy numbers per cluster provided by the method to obtain the single-cell copy number profiles, and compare with the ground truth. We used 100 000 MCMC iterations to fit our computational budget and discarded the default first 20% as burn-in. We obtained the maximum clade credibility tree using BEAST (Bouckaert et al. 2014) and extracted the variance for the normal error model by taking the mean of the posterior samples of the parameter ‘VR’ squared. The NestedBD code additionally required an ‘orig_dist’ variable which we set as the posterior mean of ‘OrigTime’ from BEAST.
We additionally include a diploid profile and two clustering methods. For the clustering we use both PhenoGraph (Levine et al. 2015) and hierarchical clustering (with the number of clusters chosen according to the Calinski-Harabasz index) and assigned all cells in each cluster the average read counts of that cluster. For each cluster and its associated cells, we then set the median counts per cell to 2 (by doubling and dividing by the median) assuming that diploid is the most common state and rounded the copy number states to the nearest integer. In the comparison, we compute as the root mean squared difference over all cells and bins, between the true copy number profiles and the inferred values.
For the comparison, we used the distance between the inferred and simulated copy number states, rather than a distance between the structures of the inferred and simulated trees since none of the alternative methods generates a CNA tree. The distance works on the set of predicted genotypes, so that errors in the tree structure which generate incorrect genotypes for cells will be picked up by this distance measure.
As a further measure we adapt the path difference metric (Steel and Penny 1993), previously used for clone and mutation trees (Yuan et al. 2015, Jahn et al. 2016), to CNA trees. Specifically we define a distance between a pair of cells on tree along the shortest path between the cells through their most recent common ancestor. Along the path we count the total number of events encountered, with each event weighted by the number of bins it affects. Equivalently for each bin we count how many times it changes state, and by how much, along the path and sum over all bins:
| (11) |
where is the number of bins, the number of segments and the shortest path between cells and on tree . We normalize the distance to count in units of ‘typical’ events by scaling by the expected size of an event (). To compare an inferred tree to the true generating tree we compute the root mean square difference between the (triangular) distance matrices for the two trees
| (12) |
When log-transforming for plots, to avoid numerical issues we add an offset of corresponding to a difference of a single event between one pair of cells.
2.11 Convergence analysis
Since searching and sampling in the space of event trees can be challenging, we examine the convergence of SCICoNE by running multiple chains and comparing their outputs. Across 10 chains we retain the half with the highest posterior or maximized likelihood score and compare their similarity with two metrics: (i) the average pairwise tree distances among the five trees, and (ii) their average difference in log score from the best tree. We ran several rounds of shorter chains (10 000 iterations) for the simulation setting of 20 nodes and 4 reads per bin. As the base case, we generated the data from uniformly sampled trees and started the tree inference from the tree learned from the clustered data. This process is repeated 40 times. The separate runs converge to similar solutions after several rounds (Supplementary Fig. S8, middle column), both in terms of tree distances and their scores.
If we generate data from linear trees instead of uniformly sampled random trees, the convergence is very similar, though there are a few simulations that require some additional rounds (Supplementary Fig. S8, left column). If we start the inference from a random tree instead of using the result from the clustered data, we observe very similar convergence, though with slightly larger score differences in the early rounds (Supplementary Fig. S8, right column). These observations suggest the scheme is robust to different types of evolutionary trees and the initialization of the inference. Finally, we also considered smaller trees with 10 nodes, where convergence is notably faster (Supplementary Fig. S8, bottom row), as would be expected given the smaller search space for trees with fewer nodes.
3 Results
3.1 Reconstructing copy number trees from tumour data
As a first demonstration of SCICoNE on whole-genome single-cell sequencing data, we considered a dataset of 260 cells of a triple negative breast cancer xenografted into a mouse (SA501X3F) (Zahn et al. 2017). The read-count data consists of 18 175 bins of 150 kb size, with a mean coverage of 153 reads per bin per cell (standard deviation of 82). The segmentation algorithm of SCICoNE (Section 2.3, with default parameters) revealed 316 candidate breakpoints.
The reconstructed tree (Fig. 2) starts with a large number of CNAs and then branches into two main populations which further differentiate into different clones. The CNA events are learned at a segment, and hence bin, level onto which we then map genes through their genomic coordinates to describe CNAs at the gene-level. The first node includes deletion in TP53, which is a common early driver of triple negative breast cancer (Koboldt et al. 2012). The first node also includes deletions in the MAPK genes (which are on different chromosomes) along with an amplification in PIK3CA and ATK1. Aberrations of the MAPK pathway have been linked to tumourigenesis, for which crosstalk with an activated PI3K/AKT pathway has been further implicated (Hashimoto et al. 2014), aligning with the copy number state of the ancestor node in the tree. Notably, AKT1 undergoes repeated amplifications, while ARID1B and ESR1 (both on chromosome arm 6q) are first amplified at the root and later undergo a deletion in one sub-branch. The genes NTRK3 and TBX3 instead experience deletions each in two parallel lineages. This may indicate chromosomal instability leading to repeated CNAs and the importance of allowing such recurrence in the evolutionary modelling. Overall, the tree highlights the complex evolutionary history of the sample.
Figure 2.
Inferred tree for 260 cells from a breast xenograft (Zahn et al. 2017). Inside the nodes of the CNA tree we highlight the total number of amplification or deletion events (in parentheses), the genes which are affected [amongst all genes from the COSMIC Cancer Gene Census (Sondka et al. 2018), with the 33 associated with breast cancer highlighted], including how much they are amplified or deleted, and the number of cells that best attach to each node. The CNAs are not displayed at the grey (leaf) nodes where only the number of cells attached is indicated. The number of cells attached to the leaf and internal nodes combine to the 260 cells in total. Example profiles of cells attaching to the two nodes with coloured (thick) borders (one from the largest subclone and one from the other main lineage) are displayed in Fig. 3c.
The inferred copy number profiles (Fig. 3b) are in good agreement with the normalized counts per bin (Fig. 3a), recapitulating the CNAs across cells while accounting for the noise in the raw data and the phylogenetic relationships between the cells. Joint inference of copy number profiles and their evolutionary history provides increased power to separate signal from noise, as emphasized by comparing the raw count and inferred copy number profiles of example cells (Fig. 3c). While some small changes for individual cells visible in the normalized counts (Fig. 3a) are filtered out, most changes occurring even in a small number of cells are detected (Fig. 3b). Overall, we obtain the main CNAs across cells as well as their phylogeny (Fig. 2).
Figure 3.
Inferred copy number profiles for 260 cells from a breast xenograft (Zahn et al. 2017). (a) Normalized counts per bin, ordered according to the tree in Fig. 2. (b) Copy number profiles estimated jointly with the CNA tree of Fig. 2. (c) Two examples of raw count data (black dots) and inferred copy number profiles (coloured lines) of the two cells indicated by arrows in the heatmaps.
To examine the relationship between CNAs and RNA expression changes at the single-cell level, we analysed the xenograft passage SA501X2B which underwent single-cell RNA sequencing (Campbell et al. 2019) and compared the two molecular profiles. By smoothing the RNA signal (Tirosh et al. 2016), we observe common strips of over- and underexpression (Supplementary Fig. S2). Comparing to the DNA summarized at the gene level (by averaging over the bins in each of the 6000 expressed genes), we find some agreement between genomic copy number changes and expression levels (Supplementary Fig. S2). However, quite a few of the signals visible in the RNA data, like in chromosome 19, e.g. have no basis in the DNA. The concordance and discrepancies between RNA expression and DNA count levels is further emphasized if we cluster the cells (Supplementary Fig. S3). The finer structure and clear breakpoints visible in the DNA, which are reflected in the inferred evolutionary history (Fig. 2) and copy number profiles (Fig. 3) are not well reflected in the RNA (Supplementary Fig. S2). Moreover, the correlation between RNA expression profiles and DNA read counts, while present, is fairly weak (Supplementary Fig. S4). The results confirm on the level of individual cells that expression profiles are not in a strict one-to-one correspondence with copy number profiles and that the latter can be inferred with higher accuracy and in more detail from DNA data.
Next, we examined 2053 single-cells from a triple negative breast cancer available as an example dataset from 10x Genomics (Breast Tissue nuclei section E). The 10x Genomics Cellranger pipeline filtered out 45 cells as low quality and performed GC correction, while we removed 57 outlier bins with >3 times the median counts. To robustly detect breakpoints from the 20 kb-sized bins, we used a window size of 100 bins, allowing for the detection of copy number events that span at least 2 Mb. Already from the read counts per cell (Fig. 4b), we can see a difference between the clusters based on the read-count profiles (Fig. 4c), with higher levels in the tumour cells (clones B–D) pointing to a possible whole-genome duplication. Comparing the results of SCICoNE with a diploid root against a run with a tetraploid root including a genome duplication, we observe much higher likelihoods in the latter case. We therefore proceed with a tetraploid root, for which the resulting cluster tree and copy number calls (Fig. 4a and d) recapitulate the main clonal architecture and their evolutionary relationships. Though the tree inference is run with a tetraploid root, cells that attach to the root do not display CNAs (Fig. 4, clone A) and can be reassigned as diploid nonmalignant cells in post-processing. The whole-genome duplication then becomes the initial event of the tumour clones.
Figure 4.
Inferred copy number profiles for 2053 cells from a breast cancer (10x Genomics). (a) Inferred tree on the clustered data, with the genes which are affected [among the 33 associated with breast cancer in the COSMIC Cancer Gene Census Sondka et al. 2018] displayed at each node. (b) Counts per cell for the different clusters. (c) Normalized counts per bin. (d) Copy number profiles estimated jointly with the CNA tree.
3.2 Benchmarking on simulated data
To benchmark SCICoNE we conducted a simulation study (Section 2.10), mimicking the very shallow coverage of a recent large clinical project (Irmisch et al. 2021), and compared the performance of SCICoNE to a suite of alternatives. For simulated trees with 20 nodes (Fig. 5 and Supplementary Fig. S5), we observe a steady increase in accuracy as we model our data in more detail. We first consider methods that cluster the data, and as a baseline, we cluster the normalized count data using hierarchical clustering or PhenoGraph (Levine et al. 2015) (Fig. 5) and assign cells the averaged profile of their clusters (rounded to integers and centred at diploid). Then we build trees on PhenoGraph clustered data using SCICoNE to leverage information across the clusters and their shared evolutionary history to improve the accuracy of learning the copy number profiles (Fig. 5). Next, we compare to methods that work at the single-cell level, and we perform the full tree inference with SCICoNE (Fig. 5), observing a strong improvement over the alternatives. On the full data, we find that the ‘max’ setting, which finds the best placement of each cell in the CNA tree, offers higher reconstruction quality relative to ‘sum’, which averages over their placements [Section 2, Equations (9) and (10)]. We therefore used ‘max’ for the breast cancer data above.
Figure 5.
Comparison of copy number calling for simulated data. For uniform random trees with 20 nodes, we attached 400 cells and simulated overdispersed read data according to each cell’s copy number profile over 10 000 bins. The total number of reads was 20k, 40k, and 80k for an average read depth of 2X, 4X, and 8X (colour intensity, left to right) per bin for each cell. The maximal number of segments affected by copy number changes was 40 and 80 (panels). The root mean squared difference between the true simulated copy number profiles and the corresponding inferred profiles over all bins and cells is summarized in each box plot (generated with ggplot2 default settings), for a neutral diploid profile, for profiles inferred by hierarchical , and PhenoGraph clustering as well as SCICoNE on PhenoGraph clustered data, followed by CONET (Markowska et al. 2022), and HMMcopy (Lai et al. 2016), Ginkgo (Garvin et al. 2015), and SCOPE (Wang et al. 2020), NestedBD (Liu et al. 2024), and SCICoNE on the full single-cell sequencing data. The comparison with a logarithmic transform of is displayed in Supplementary Fig. S5.
SCICoNE performs well in comparison to HMMcopy (Lai et al. 2016) (Fig. 5, bronze) which performs copy number calling per cell with a hidden Markov model, and has been found to have good overall performance in single-cell copy number calling (Mallory et al. 2020a). However, for the simulated data with an average read depth of 2–8 (comparable to 10x Genomics data), HMMcopy can have quite variable performance and is generally worse than the Phenograph clustering (Fig. 5, silver). Instead we found Ginkgo (Garvin et al. 2015) (Fig. 5, olive) to perform similarly to PhenoGraph clustering at lower depths and with more segments (and smaller copy number events) but better with fewer segments and especially at the highest depth. NestedBD (Liu et al. 2024) aims to improve the output of Ginkgo by enforcing that the cells are organized in a lineage tree, and we accordingly observe a slight improvement. SCOPE (Wang et al. 2020) (Fig. 5, green) is better still at the higher depths and a little worse than Ginkgo at the lowest depth, although its performance is quite variable across the repetitions. In contrast, SCICoNE on the clustered data is very robust to the read depth and performs better than the other clustering approaches (hierarchical and PhenoGraph), while on the full (un-clustered) data SCICoNE allows us to extract much more accurate copy number profiles than alternative (HMMcopy, Ginkgo and SCOPE) methods, along with the evolutionary history encoded in the tree itself. We observe similar performances in the easier setting of higher coverage [more akin to the coverage of the protocols of Zahn et al. (2017) and Laks et al. (2019)] and no overdispersion (Supplementary Fig. S6). This leads to a clearer improvement of NestedBD over Ginkgo and SCOPE, while SCICoNE still offers the best performance. Also HMMcopy performs notably worse due to it often calling the reference level incorrectly, while using SCICoNE to build a tree on the clustered data does not offer an advantage in copy number calling compared to the clustering itself.
In terms of runtimes (Supplementary Fig. S7), methods that do not learn a tree and which cluster the data (Hclust and PhenoGraph) are very fast, followed by those which call copy numbers per cell (HMMcopy and Ginkgo), apart from SCOPE which is quite computationally intensive. Reconstructing a tree is however typically more than an order of magnitude more demanding in terms of runtime and the various tree inference algorithms have a similar order of magnitude to each other (Supplementary Fig. S7, right panel). Although the exact runtime may depend on factors like the parameter settings and how breakpoints are detected (Section 2), SCICoNE is among those with longer runtimes, though with the benefit of learning better trees and calling the copy numbers more accurately.
CONET (Markowska et al. 2022), despite using a phylogenetic tree, performs very poorly in the simulations, at odds with the results they report. We tracked this discrepancy down to their simulations implicitly providing CONET with effectively much higher read-depth data than they provided to SCICoNE, and perform a detailed analysis in Supplementary Appendix E. To benchmark the tree learning we take the output of Ginkgo and SCOPE as input for MEDALT (Wang et al. 2021) which reconstructs a phylogenetic tree from called copy numbers, and compare to the trees from CONET (Markowska et al. 2022), NestedBD (Liu et al. 2024), and SCICoNE (Fig. 6). Similarly to the copy number calling (Fig. 5), CONET is the worst performer, but close to the performance of Ginkgo with MEDALT. SCOPE offers a better input to MEDALT than Ginkgo at the higher depths, but worse at the lowest, with quite high variability as a function of depth. NestedBD clearly improves over Ginkgo with MEDALT, especially at the higher coverages where it is comparable to SCOPE with MEDALT and the ‘sum’ setting of SCICoNE. Overall, the best performance comes from modelling the data with SCICoNE with the ‘max’ setting, in line with the results for copy number calling.
Figure 6.
Comparison of copy number tree reconstruction for simulated data. For the simulated data of Fig. 5 and Supplementary Fig. S6 we compute the tree distance (Section 2.10) between the true and inferred tree from CONET (Markowska et al. 2022), MEDALT (Wang et al. 2021) run on the output of Ginkgo (Garvin et al. 2015), and SCOPE (Wang et al. 2020), NestedBD (Liu et al. 2024) as well as SCICoNE. Since the tree distances cover several orders of magnitude we use a logarithmic axis.
4 Conclusions
For learning the copy number profiles of single cells, sharing information across cells and leveraging their shared evolutionary history boosts our ability to remove noise and accurately call copy numbers. Here we developed a novel phylogenetic framework for this purpose, enabling us to jointly infer the tree and copy number states and obtain better quality profiles. When learning the sequence of CNAs that occur in tumour samples from single-cell sequencing data, the possible overlap and reoccurrences of CNAs need to be accounted for. Our tree model allows regions of the genome to be arbitrarily amplified and deleted, while controlling for model complexity. Simulations demonstrate that our approach is accurate for the challenging task of reconstructing the evolutionary history of tumours and calling copy numbers in individual cells.
For a real data example of 260 xenograft breast cancer cells, we obtained the CNA tree and inferred copy number profiles of the cells and detected the main clonal structure as well as their phylogenetic relationship. Some smaller or weaker changes over small numbers of cells, visible in the normalized data (Fig. 3a) were not detected as copy number changes in the inference (Fig. 3b) as they were insufficient to justify a more complex model in our framework. Adjusting and relaxing the penalization for model complexity to detect finer changes may offer further improvements, though learning larger and more complex trees also increases the computation cost of the inference.
For the 2053 cells from a triple negative breast ductal carcinoma, sequenced with the 10x Genomics technology, we could detect a whole-genome amplification and use this to correct the ploidy for the tumour cells to accurately call their copy number states.
To learn the segments in the first place, we developed a dynamic programming approach to combine the evidence of breakpoints across all cells. Compared to methods which work on a per cell basis, our joint inference of breakpoints and then the full probabilistic model of the phylogeny provides a substantial improvement. The combination of bins into segments reduces the possible search space and speeds up the inference, but since the quality of the segments detected directly affects the downstream reconstruction, further improvements in this direction are important. In particular, once a phylogeny has been learnt based on the strongest breakpoints, the corresponding separation of cells into clones can help distinguish noise from signal in the breakpoint detection. Adding and removing breakpoints could also be incorporated as a move into the MCMC scheme itself.
For real data, where GC and mappability corrections are performed to normalize the bins, residual confounding effects still remain which can complicate the breakpoint detection and phylogenetic inference. In particular, the bin corrections depend on the underlying copy number state, indicating that future directions could consider jointly inferring the corrections along with the segments and the phylogeny, although increasing the complexity and computational cost of learning CNA trees.
Understanding and reconstructing the history of copy number events in a tumour could play a key role in predicting response to treatment, especially when resistance arises from adaptive selection of the existing clonal architecture. The combination with single-cell transcriptomics can uncover the interplay between evolutionary pressure and cellular reprogramming (Kim et al. 2018). The phylogenetic methods developed here for large-scale, complex and overlapping events could potentially also reconstruct trajectory structures reflected in the transcriptomic profiles of single-cell RNA sequencing, while the copy number trees reflect a common structure on which to analyses expression profiles (Campbell et al. 2019, Ferreira et al. 2021).
Although developed for shallow whole-genome single-cell sequencing protocols (10x Genomics, Zahn et al. 2017, Laks et al. 2019, Irmisch et al. 2021), our phylogenetic methods may also play a key role in copy number reconstruction for targeted single-cell sequencing (MissionBio), and offer relevant methodology for phylogenetics which integrate copy numbers and point mutations (Sollier et al. 2023) and for considering allele-specific events where methods currently look at the phasing through mutations but without jointly learning the phylogenetic history (Zaccaria and Raphael 2021, Ivanovic and El-Kebir 2024). Copy number reconstruction further complements multi-faceted single-cell profiling (Irmisch et al. 2021), e.g. to determine the downstream effects of tumour heterogeneity through evolutionary analyses across cohorts. Accurate copy number calling at the single-cell level, enabled through SCICoNE’s joint inference of the evolutionary history and copy number profiles, will enhance single-cell analysis for cancer biology.
Supplementary Material
Contributor Information
Jack Kuipers, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland; SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland.
Mustafa Anıl Tuncel, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland; SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland.
Pedro F Ferreira, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland; SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland.
Katharina Jahn, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland; SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland.
Niko Beerenwinkel, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland; SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland.
Author contributions
Jack Kuipers (Conceptualization [lead], Formal analysis [equal], Funding acquisition [supporting], Methodology [lead], Software [supporting], Supervision [supporting], Visualization [equal]), Mustafa Anıl Tuncel (Conceptualization [supporting], Data curation [equal], Formal analysis [supporting], Methodology [supporting], Software [lead]), Pedro F. Ferreira (Conceptualization [supporting], Data curation [lead], Formal analysis [supporting], Methodology [supporting], Software [lead], Visualization [supporting]), Katharina Jahn (Conceptualization [supporting], Formal analysis [supporting], Methodology [supporting]), and Niko Beerenwinkel (Conceptualization [supporting], Funding acquisition [lead], Methodology [supporting], Supervision [lead])
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest: None declared.
Funding
This work was supported by European Research Council Synergy Grant [609883] (http://erc.europa.eu/), Swiss National Science Foundation Grant [310030_179518] (http://www.snf.ch) and SystemsX.ch RTD Grant [2013/150] (http://www.systemsx.ch/).
Data availability
The xenograft sequencing datasets utilized in this study were generated by Zahn et al. (2017) and Campbell et al. (2019) and are available at the European Genome-phenome Archive (http://www.ebi.ac.uk/ega/) under accession number EGAS00001002170 and EGAD00001004552. The dataset from 10x Genomics can be downloaded from https://support.10xgenomics.com/single-cell-dna/datasets as ‘Breast Tissue nuclei section E 2000 cells’.
Code availability
SCICoNE has been implemented in C++ and is freely available under a GNU General Public License v3.0 license at https://github.com/cbg-ethz/SCICoNE. The code repository also provides a Python package to embed SCICoNE with data pre-processing steps and downstream analyses of the results, and includes an interface for data generated with CellRanger (10x Genomics). We additionally provide Snakemake (Mölder et al. 2021) workflows to reproduce all our results.
References
- 10x Genomics. https://www.10xgenomics.com/products/single-cell-cnv (25 February 2025, date last accessed).
- Allison KH, Sledge GW. Heterogeneity and cancer. Oncology (Williston Park) 2014;28:772–8. [PubMed] [Google Scholar]
- Baslan T, Kendall J, Rodgers L et al. Genome-wide copy number analysis of single cells. Nat Protoc 2012;7:1024–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bielski CM, Zehir A, Penson AV et al. Genome doubling shapes the evolution and prognosis of advanced cancers. Nat Genet 2018;50:1189–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bouckaert R, Heled J, Kühnert D et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 2014;10:e1003537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burrell RA, McGranahan N, Bartek J et al. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 2013;501:338–45. [DOI] [PubMed] [Google Scholar]
- Burrell RA, Swanton C. Re-evaluating clonal dominance in cancer evolution. Trends Cancer 2016;2:263–76. [DOI] [PubMed] [Google Scholar]
- Campbell KR, Steif A, Laks E, IMAXT Consortium et al. clonealign: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers. Genome Biol 2019;20:54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chowdhury SA, Shackney SE, Heselmeyer-Haddad K et al. Algorithms to model single gene, single chromosome, and whole genome copy number changes jointly in tumor phylogenetics. PLoS Comput Biol 2014;10:e1003740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong X, Zhang L, Hao X et al. SCCNV: a software tool for identifying copy number variation from single-cell whole-genome sequencing. Front Genet 2020;11:505441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- El-Kebir M, Raphael BJ, Shamir R et al. Complexity and algorithms for copy-number evolution problems. Algorithms Mol Biol 2017;12:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferreira PF, Kuipers J, Beerenwinkel N. Mapping single-cell transcriptomes to copy number evolutionary trees. bioRxiv, 2021, preprint: not peer reviewed.
- Garvin T, Aboukhalil R, Kendall J et al. Interactive analysis and assessment of single-cell copy-number variations. Nat Methods 2015;12:1058–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet 2016;17:175–88. [DOI] [PubMed] [Google Scholar]
- Gertz EM, Chowdhury SA, Lee W-J et al. Fishtrees 3.0: tumor phylogenetics using a ploidy probe. PLoS One 2016;11:e0158569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greaves M, Maley CC. Clonal evolution in cancer. Nature 2012;481:306–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ha G, Roth A, Lai D et al. Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res 2012;22:1995–2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hashimoto K, Tsuda H, Koizumi F et al. Activated PI3K/AKT and MAPK pathways are potential good prognostic markers in node-positive, triple-negative breast cancer. Ann Oncol 2014;25:1973–9. [DOI] [PubMed] [Google Scholar]
- Irmisch A, Bonilla X, Chevrier S, Tumor Profiler Consortium et al. The tumor profiler study: integrated, multi-omic, functional tumor profiling for clinical decision support. Cancer Cell 2021;39:288–93. [DOI] [PubMed] [Google Scholar]
- Ivanovic S, El-Kebir M. Evolution-aware deep reinforcement learning for single-cell DNA copy number calling. bioRxiv, 2024, preprint: not peer reviewed.
- Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biol 2016;17:86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaufmann TL, Petkovic M, Watkins TBK et al. MEDICC2: whole-genome doubling aware copy-number phylogenies for cancer evolution. Genome Biol 2022;23:241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim C, Gao R, Sei E et al. Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing. Cell 2018;173:879–93.e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koboldt DC, Fulton RS, McLellan MD et al. Comprehensive molecular portraits of human breast tumours. Nature 2012;490:61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuipers J, Jahn K, Beerenwinkel N. Advances in understanding tumour evolution through single-cell sequencing. Biochim Biophys Acta Rev Cancer 2017a;1867:127–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuipers J, Jahn K, Raphael BJ et al. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res 2017b;27:1885–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lai D, Ha G, Shah S. HMMcopy: Copy Number Prediction with Correction for GC and Mappability Bias for HTS Data. R Package Version 1.22.0. 2016.
- Laks E, McPherson A, Zahn H et al. ; CRUK IMAXT Grand Challenge Team. Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing. Cell 2019;179:1207–21.e22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levine JH, Simonds EF, Bendall SC et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 2015;162:184–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Edrisi M, Yan Z et al. NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model. Algorithms Mol Biol 2024;19:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maley CC, Galipeau PC, Finley JC et al. Genetic clonal diversity predicts progression to esophageal adenocarcinoma. Nat Genet 2006;38:468–73. [DOI] [PubMed] [Google Scholar]
- Mallory XF, Edrisi M, Navin N et al. Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data. PLoS Comput Biol 2020a;16:e1008012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallory XF, Edrisi M, Navin N et al. Methods for copy number aberration detection from single-cell DNA-sequencing data. Genome Biol 2020b;21:208–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Markowska M, Cąkała T, Miasojedow B et al. CONET: copy number event tree model of evolutionary tumor history for single-cell data. Genome Biol 2022;23:128–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marusyk A, Polyak K. Tumor heterogeneity: causes and consequences. Biochim Biophys Acta 2010;1805:105–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGranahan N, Swanton C. Biological and therapeutic impact of intratumor heterogeneity in cancer evolution. Cancer Cell 2015;27:15–26. [DOI] [PubMed] [Google Scholar]
- McPherson A, Roth A, Laks E et al. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer. Nat Genet 2016;48:758–67. [DOI] [PubMed] [Google Scholar]
- Merlo LM, Shah NA, Li X et al. A comprehensive survey of clonal diversity measures in barrett’s esophagus as biomarkers of progression to esophageal adenocarcinoma. Cancer Prev Res (Phila) 2010;3:1388–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MissionBio. https://missionbio.com/capabilities/snv-cnv/ (25 February 2025, date last accessed).
- Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake. F1000Res 2021;10:33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navin N, Kendall J, Troge J et al. Tumour evolution inferred by single-cell sequencing. Nature 2011;472:90–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nik-Zainal S, Van Loo P, Wedge DC et al. ; Breast Cancer Working Group of the International Cancer Genome Consortium. The life history of 21 breast cancers. Cell 2012;149:994–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olshen AB, Venkatraman ES, Lucito R et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004;5:557–72. [DOI] [PubMed] [Google Scholar]
- Ruohan W, Yuwei Z, Mengbo W et al. Resolving single-cell copy number profiling for large datasets. Brief Bioinform 2022;23:bbac264. [DOI] [PubMed] [Google Scholar]
- Salehi S, Dorri F, Chern K et al. Cancer phylogenetic tree inference at scale from 1000s of single cell genomes. Peer Commun J 2023;3:e63. [Google Scholar]
- Schwarz RF, Trinh A, Sipos B et al. Phylogenetic quantification of intra-tumour heterogeneity. PLoS Comput Biol 2014;10:e1003535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singer J, Kuipers J, Jahn K et al. Single-cell mutation identification via phylogenetic inference. Nat Commun 2018;9:5144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sollier E, Kuipers J, Takahashi K et al. COMPASS: joint copy number and mutation phylogeny reconstruction from amplicon single-cell sequencing data. Nat Commun 2023;14:4921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sondka Z, Bamford S, Cole CG et al. The COSMIC cancer gene census: describing genetic dysfunction across all human cancers. Nat Rev Cancer 2018;18:696–705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steel MA, Penny D. Distributions of tree comparison metrics–some new results. Syst Biol 1993;42:126–41. [Google Scholar]
- Tirosh I, Izar B, Prakadan SM et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 2016;352:189–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang F, Wang Q, Mohanty V et al. MEDALT: single-cell copy number lineage tracing enabling gene discovery. Genome Biol 2021;22:70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang R, Lin D-Y, Jiang Y. SCOPE: a normalization and copy-number estimation method for single-cell DNA sequencing. Cell Syst 2020;10:445–52.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Chen H, Zhang NR. DNA copy number profiling using single-cell sequencing. Brief Bioinform 2018;19:731–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y, Navin NE. Advances and applications of single-cell sequencing technologies. Mol Cell 2015;58:598–609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics 2005;21:4084–91. [DOI] [PubMed] [Google Scholar]
- Yates LR, Campbell PJ. Evolution of the cancer genome. Nat Rev Genet 2012;13:795–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan K, Sakoparnig T, Markowetz F et al. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biol 2015;16:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaccaria S, Raphael BJ. Characterizing allele-and haplotype-specific copy numbers in single cells with CHISEL. Nat Biotechnol 2021;39:207–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zafar H, Navin N, Nakhleh L et al. Computational approaches for inferring tumor evolution from single-cell genomic data. Curr Opin Syst Biol 2018;7:16–25. [Google Scholar]
- Zahn H, Steif A, Laks E et al. Scalable whole-genome single-cell library preparation without preamplification. Nat Methods 2017;14:167–73. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The xenograft sequencing datasets utilized in this study were generated by Zahn et al. (2017) and Campbell et al. (2019) and are available at the European Genome-phenome Archive (http://www.ebi.ac.uk/ega/) under accession number EGAS00001002170 and EGAD00001004552. The dataset from 10x Genomics can be downloaded from https://support.10xgenomics.com/single-cell-dna/datasets as ‘Breast Tissue nuclei section E 2000 cells’.






