Abstract
Understanding the clonal architecture and evolutionary history of a tumour poses one of the key challenges to overcome treatment failure due to resistant cell populations. Previously, studies on subclonal tumour evolution have been primarily based on bulk sequencing and in some recent cases on single-cell sequencing data. Either data type alone has shortcomings with regard to this task, but methods integrating both data types have been lacking. Here, we present B-SCITE, the first computational approach that infers tumour phylogenies from combined single-cell and bulk sequencing data. Using a comprehensive set of simulated data, we show that B-SCITE systematically outperforms existing methods with respect to tree reconstruction accuracy and subclone identification. B-SCITE provides high-fidelity reconstructions even with a modest number of single cells and in cases where bulk allele frequencies are affected by copy number changes. On real tumour data, B-SCITE generated mutation histories show high concordance with expert generated trees.
Subject terms: Cancer genetics, Tumour heterogeneity, Computational biology and bioinformatics, Genetics
Intra-tumour heterogeneity provides important information about subclonal tumour evolution. Here, the authors develop B-SCITE, a computational method for inferring tumour phylogenies from combined single-cell and bulk sequencing data.
Introduction
Cancer is a genetic disease that develops through a branched evolutionary process1. It is characterised by the emergence of genetically distinct subclones through the random acquisition of mutations at the level of single cells and shifting prevalences at the subclone level through selective advantages purveyed by driver mutations. This interplay creates complex mixtures of tumour-cell populations, which exhibit different susceptibilities to targeted cancer therapies and are suspected to be the cause of treatment failure2,3. Therefore, it is of great interest to obtain a better understanding of the evolutionary histories of individual tumours and their subclonal composition4.
Most genetic analyses of tumours are currently based on next-generation sequencing data of bulk tumour samples. Such data provide indirect measurements of the subclonal tumour composition in the form of aggregate total and variant read counts measured across hundreds of thousands or millions of cells. A large number of approaches have been published over the last few years that try to identify subclones, their frequencies and in some cases, their phylogenetic relationships by deconvolving these aggregate data5–23. However, the underlying statistical problem is underdetermined24,25, and single-nucleotide variants (SNVs) with similar variant allele frequencies (VAFs) are automatically clustered into a single subclone. This inevitably leads to incorrect phylogenies for tumours with multiple distinct subclones of similar prevalences as illustrated in Fig. 1. The aggregate sequencing data additionally pose a limitation to the achievable tree resolution, as mutational signals of smaller subclones cannot be distinguished from noise26 and therefore not be reliably represented in the tree. Sequencing multiple samples from the same tumour and increasing the coverage can to some extent mitigate these issues, but is not always practicable.
Another solution is the use of single-cell sequencing (SCS) data which provide mutation profiles of individual cells, such that the phylogeny can be directly inferred without any form of deconvolution. The main challenge here instead are the high levels of noise found in SCS data that are primarily introduced during DNA amplification, a necessary step to obtain sufficient DNA material for sequencing. False negatives are the most prevalent error type due to allelic dropout, but also false positives occur when an error is introduced early in the amplification. Further noise stems from doublet-mutation profiles, which occur when two cells are accidentally sequenced together27. Classic approaches for phylogeny reconstruction are not suitable for dealing with these SCS-specific noise profiles, and a number of probabilistic approaches have been developed to specifically account for the error types found in SCS data28–32.
A major difference between the evolutionary histories of tumours inferred from bulk and SCS data is that the former typically are clonal trees where mutations with similar frequencies are clustered together and represented in a single tree node (Fig. 1e, f), while trees derived from SCS data are fully resolved trees that can be either cell lineage trees, binary trees where the cells form the leaves and mutations occur along tree branches, or mutation trees (Fig. 1c) that depict the partial temporal order in which mutations were acquired33. For cell lineage trees, a heuristic has been proposed for clustering cells into clones in a post-processing step29, which results in trees that are closer to bulk clonal trees.
Another difference, as illustrated in Fig. 1, is that the VAFs obtained from a bulk sample are well suited for inferring the temporal order of mutations (by ordering mutations with respect to decreasing VAF), but of limited use for the identification of branching events. On the contrary, single-cell genotypes obtained from SCS data have a lot of strength to infer branching events, but due to unobserved ancestral states and high noise levels, may give ambiguous or no signals on the temporal order of mutations in linear segments of the inferred mutation trees.
As the strengths and weaknesses of single-cell and bulk-sequencing data are to a large extent complementary with respect to phylogeny inference, using both data types for a joint inference should improve our understanding of subclonal tumour evolution over using each type of data alone. It has already been shown that clustering of mutations into subclones from bulk-sequencing data can be informed by single-cell genotypes to obtain more accurate results34. In this work, we present B-SCITE, a probabilistic approach for the inference of tumour phylogenies from combined single-cell and bulk-sequencing data. We show in a comprehensive simulation study that B-SCITE systematically outperforms competing approaches in terms of accurate tree reconstruction and mutation clustering. It performs particularly well in difficult cases, where only a small number of single cells is available, or when bulk read counts are affected by copy-number changes. We also show that for real tumour data, B-SCITE provides mutation histories that are in better congruence with expert-generated trees than tools relying only on a single data type.
Results
Method overview
We developed B-SCITE, a probabilistic approach for the inference of tumour mutation histories by the use of SNV data obtained from single-cell and bulk DNA sequencing (Fig. 2). Full technical details are given in the ‘Methods’ section. B-SCITE consists of a Markov chain Monte Carlo-based search scheme to traverse the space of possible tree topologies and a joint likelihood model for the evaluation of candidate trees based on their joint fit to the single-cell and bulk data. The single-cell data are given in the form of a mutation matrix, where each row represents a mutation, for which each column represents that mutation’s state in the corresponding single cell (‘0’ normal state, ‘1’ altered state and ‘NA’ missing data); on the other hand, the bulk data consist of the variant and total read counts of the mutated loci in each of the sequenced bulk samples. B-SCITE reports a single maximum likelihood mutation tree that can be condensed into a clonal tree by clustering linear tree segments based on VAF similarity.
Performance assessment on simulated data
We tested B-SCITE on a comprehensive set of simulated data. We focused initially on comparing with ddClone34, which is, to the best of our knowledge, the only existing method that performs analysis of intra-tumour heterogeneity by jointly using bulk and SCS data. ddClone was shown to significantly outperform methods only utilising the bulk-sequencing data. For ease of comparison, we followed the simulation strategy employed in the original ddClone publication (see Supplementary Methods for details).
Since B-SCITE infers tumour phylogenies, we additionally compared against the two tree inference methods, OncoNEM29 and SCITE28, which are working on single-cell data only.
All box plots used for presenting performance of the methods on the simulated data were generated using ggplot235, with its default settings and the data points overlaid.
Comparisons of clustering accuracy
Since ddClone does not provide any output related to the tree of tumour evolution, we used the V-measure of cluster assignments36 and adjusted Rand index of cluster assignments in our comparisons. For B-SCITE, the clonal tree derived from the fully resolved maximum likelihood tree provides the mutation clusters. Namely, each clone C defines a unique cluster consisting of all mutations appearing for the first time at C. OncoNEM also provides an option to cluster the data into subclones based on the inferred phylogeny, and is hence included in the comparison.
The results on simulations (Fig. 3 for 25 single cells and 10 clones) show that B-SCITE consistently outperforms both ddClone and OncoNEM. B-SCITE and ddClone are both robust to doublet contamination and distortion in the single-cell data sampling denoted by λ. The latter introduces a discrepancy between the genotype frequencies of the sequenced single cells and the subclone frequencies of the bulk tumour tissue, where larger values of λ indicate better agreement between the frequencies (for more details see Supplementary Methods).
OncoNEM, which only utilises single-cell data, improves as the sampling of single cells more closely reflects the bulk tumour composition (as λ increases). Even for highly distorted data, OncoNEM performs a little better than ddClone. However, when simulating a smaller number of clones (Supplementary Fig. 1 with six clones instead of ten), the gap between OncoNEM and ddClone increases while B-SCITE remains the best performer.
Increasing the number of cells from 25 to 50 and 100 has a marginal effect on the accuracy with 10 clones (Supplementary Fig. 2, although more of the simulated clones may also be observed with more cells). As the number of clones increases (Supplementary Figs. 3, 4), OncoNEM’s performance decreases although it is aided by larger cell numbers, while ddClone’s performance starts to degrade, as more cells allow more of the simulated clones to be observed. B-SCITE retains the best and most stable performance. A similar pattern is seen when computing the accuracy with the adjusted Rand index (Supplementary Figs. 5–7), which amplifies the differences between the methods.
The effect of allelic dropout and false negatives is relatively mild on B-SCITE (Supplementary Fig. 8) and has a more noticeable effect on ddClone and OncoNEM. A similar dependence on false negatives is seen with a highly elevated false-positive rate (Supplementary Fig. 9), and the false positives lead to a small but clear decrease in accuracy for B-SCITE. OncoNEM also suffers a slight loss in accuracy, while ddClone actually improves marginally with the higher error rate though still with the worst performance overall.
Accuracy in inferring phylogenetic order of mutations
In addition to clustering mutations into subclones, B-SCITE also infers the complete phylogenetic history of a tumour. We therefore compared B-SCITE with the single-cell phylogenetic methods OncoNEM and SCITE based on three different accuracy measures (the definition of each measure is available in the Supplementary Methods). Specifically, for SCITE, we chose the extended version with the doublet model27 to make sure that any change in performance can be fully attributed to the additional data available to B-SCITE.
B-SCITE again has the best and most robust performance over the range of λ (Fig. 4). The two single-cell methods improve, as the single-cell sampling approaches a better representation of the bulk frequencies, but never reaches the performance of B-SCITE. The apparent improvement for B-SCITE as λ decreases is due to the smaller number of observed clones being included in the calculation of the tree accuracy.
Changing the number of clones or cells (Supplementary Figs. 10–13), we observe the same pattern with B-SCITE on top, and SCITE performing slightly better than OncoNEM. Similar behaviour is also observed for inferring the correct ancestry relationships (Supplementary Fig. 14). For mutations in separate lineages, both SCITE and B-SCITE perform well in correctly detecting the correct separations, while OncoNEM has a somewhat lower accuracy (Supplementary Fig. 15).
The effect of allelic dropout on the phylogenetic inference is again relatively mild on B-SCITE compared with SCITE and OncoNEM, while false positives play a more important role (Supplementary Figs. 16, 17) and reduce the accuracy of all tools by a small amount.
Performance in the presence of CNA events
Copy-number aberrations (CNAs) perturb the fraction of mutated reads in the affected regions, thereby shifting the observed VAFs. While the model underlying B-SCITE expects mutations to come from copy-number-neutral regions, it is not always possible to identify and discard all other mutations. VAFs from copy-number-altered regions are known to confound tree reconstruction and mutation ordering from bulk data only. By also utilising single-cell data, B-SCITE is quite robust to these effects, with only a small average decrease in the co-clustering accuracy, as the number of CNAs in the simulated data increases (Supplementary Fig. 18). For very-high-coverage data, where the bulk VAFs play a stronger role in the phylogenetic reconstruction, the effect is correspondingly more pronounced (Supplementary Fig. 19). For the other accuracy measures, we see the same pattern for the two coverage levels (Supplementary Figs. 20–23).
Performance in the presence of ISA violations
In the tree inference (Methods), we employ the infinite-sites assumption (ISA), which states that mutations only occur once in the phylogeny and persist in descendent cells. Violations of this assumption however only have a very weak effect on the structural accuracy of the other mutations (Supplementary Fig. 24).
Multiple bulk samples
To assess B-SCITE’s performance in settings where multiple bulk samples are available and to see whether additional bulk samples render single-cell data redundant, we simulated data with up to four bulk samples and compared B-SCITE with the bulk-only method PhyloWGS18 (Supplementary Figs. 25–27). Expectedly, reconstruction quality of both tools improves, as the number of bulk samples increases. While B-SCITE generally outperforms PhyloWGS, it is possible to create settings that benefit the bulk-only method, namely a combination of high-quality and quantity bulk data with single-cell data that badly reflects the tumour composition (small λ). When looking at different settings for the bulk data, we observe more accurate reconstruction with multiple bulk samples at lower coverage as compared with a single bulk sample at high coverage. Finally, for any fixed number of bulk samples, we observe that the reconstruction quality improves with the number of available single cells.
Application to real data
To assess the performance of B-SCITE on real tumour data, we analysed the sequencing data of two patients with childhood leukaemia37, one triple-negative breast cancer patient38 and two colorectal cancer patients with matched liver metastases39. Discussion of the results is provided below and all details of data pre-processing can be found in Supplementary Methods.
Acute lymphoblastic leukaemia (ALL)
For both leukaemia patients, a bulk sample was sequenced together with a large number (>100) of single cells, for which targeted sequencing was performed using a personalised panel. This allows us to compare B-SCITE with methods relying solely on single-cell or bulk-sequencing data. For our comparison, we chose CTPsingle20 for the bulk-only approach and SCITE28 for single-cell-based inference.
For patient 1, 20 mutations were detected sequencing one bulk sample and 111 single cells. The phylogenies inferred by B-SCITE and its two competitors are depicted in Fig. 5. CTPsingle, the approach based on bulk samples, finds two trees compatible with the observed frequencies. Both trees cluster the 20 mutations in five subclones and are either completely linear or have a single mutation in a separate branch. Using the SCS data of the same patient, SCITE detects an early branching event that splits up some of the subclones inferred by CTPsingle. Without knowledge of the ground truth tree, there is no certainty whether the branching reflects the true phylogeny. However, having data from such a large number of cells and the close location to the root, makes it highly likely that the branching is genuine. The same branching is also inferred by B-SCITE, which finds generally the same topology as SCITE, but arranges mutations differently in linear segments of the tree. As expected, having the additional information of VAFs for the individual mutations allows B-SCITE to find an ordering that is in better congruence with the observed mutation frequencies, which should be in decreasing order. As a consequence, the mutation ordering inferred by B-SCITE is also closer to the subclone clustering of CTPsingle, in the sense that mutations from the same cluster that are in the same branch tend to be closer together than in the SCITE tree. For this data set, we estimate the relative weight of the bulk data to be 0.81 times that of the single-cell information.
For patient 2, 16 mutations were detected sequencing one bulk sample and 115 single cells. CTPsingle reports a subclone clustering that is compatible with many phylogenies and some of them are depicted in Fig. 6a and Supplementary Fig. 28. SCITE and B-SCITE each infer a single tree (Fig. 6b, c). The general topologies of the two trees are again similar with different arrangement of mutations on linear segments. In particular, B-SCITE puts the mutations in FAM105A and CMTM8 higher up in their branch, which better reflects their relatively high frequencies (44% and 42%). Notably, B-SCITE does not completely sort the mutations in this branch by frequency, as the mutation in RRP8 is placed between two mutations with higher frequencies. This indicates that a strong signal for this placement is coming from the single-cell data. Apart from issues with mutation calling, a possible explanation could be a copy-number change affecting RRP8, which decreases its observed VAF. The bulk data play a stronger role for these data, with their estimated weight to be 1.55 times that of the single cell.
Triple-negative breast cancer (TNBC)
For the triple-negative breast cancer patient, in the original study38, copy-number profiling of 50 single cells was first performed. Next, 16 single tumour cells were whole-exome sequenced and hierarchical clustering of the cells was performed based on profiles of a large number (>500) of detected somatic SNVs, each present in more than one of the 16 cells (mutations detected in only one cell were discarded). A clonal tree was then manually reconstructed (Extended Data Fig. 6b in ref. 38). Based on this tree, we provide a clonal tree for the 18 mutations selected in Fig. 7a. For each of these 18 mutations, the original study obtained bulk data read counts from targeted ultra-deep sequencing, with an approximate depth of coverage > 1,00,000×. Hence, we were able to run both, SCITE and B-SCITE, on this data set as well. Trees obtained by these tools are shown in Fig. 7b, c. Notably, B-SCITE infers a mutation tree, which is in a strong concordance with the clonal tree, and at the same time, in agreement with the variant allele frequencies obtained from bulk data. On the other hand, the tree inferred by SCITE has several differences compared with the clonal tree, including the placement of the mutation in gene SYNE2 on a different branch than the PPP2R1A and AURKA mutations. Similarly, mutations reported to be clonal in ref. 38, namely mutations in genes MAP2K7 and NTRK1, are placed on different branches in the SCITE tree, and the placement of the branch containing mutations in genes CHRM5 and TGFB2 is closer to the root. In this tree, there are also several inconsistencies with variant allele frequencies of mutations obtained from bulk data. With the high bulk coverage and the limited number of single cells, the bulk data have a much higher relative weight (≈2800 times) than the single-cell data. However, this does not mean that the latter have no effect on the tree structure. In fact, the bulk data favour a completely linear tree topology.
Colorectal cancer
We reanalysed the bulk and single-cell sequencing data of two cases of colorectal cancer with liver metastases39. The interesting feature of these data sets is the availability of two bulk samples (primary tumour and metastasis) in addition to single-cell data. Unfortunately, the original study revealed in both patients the presence of aneuploid populations at both sites via flow sorting, which is not ideal for the use of B-SCITE due to CNA-based shifts of bulk VAFs.
In the study, flow-sorted cells were pooled and exome sequenced en bulk. We obtained the read counts from these experiments by processing the raw data from the SRA and used them as bulk data input. The original study also comprised targeted single-cell sequencing based on the T1000 cancer gene panel (mean depth 137× at average coverage of 0.92). To run B-SCITE, we focused on the mutations detected via single-cell panel sequencing, but discarded mutations with insufficient coverage in the bulk samples. In addition, we removed cells where no mutations were detected. These filtering steps left 12 mutations and 72 single cells for CRC1 (CO5), and 25 mutations and 86 single cells for CRC2 (CO8). The trees inferred by B-SCITE are displayed in Fig. 8.
For CO5, we observe an almost linear tree structure. For the primary tumour sample, VAFs are roughly decreasing, while we observe a bigger fluctuation for the metastasis sample. This may be a direct consequence of the higher aneuploidy observed in the liver metastatis compared with the primary site39. Compared with the targeted sequencing of the previous case, bulk-sequencing depth is relatively low here (with an average coverage depth of 69.25×). This is also reflected in a lower relative weight for the bulk data (≈1.75).
For CO8, we observe a more branched structure. Notably, there are two completely separate lineages, one of which is only detected in the single-cell data. In the other branch, early mutations all have VAFs >50%, indicating the presence of CNAs. Fluctuation in VAFs in this tree part may indicate complex copy-number events, but at the given sequencing depth (average of 105.76×) may also be attributed to read count variance. The relative weight on the bulk data is 2.96 compared with the single-cell data.
The lower branches are generally in good agreement with decreasing bulk VAFs and show a clear distinction between the primary and metastatic sample. Interestingly, the original study reported two separate metastatic seeding events based on a tree topology inferred from single-cell data alone39. In particular, it places FUS (VAF 29.1% in the metastasis sample) in the second metastatic lineage, while other mutations with similar VAFs in the metastasis sample (LAMB4 27.6%, F8 30.0% and SPEN 25.1%) are placed in the first metastatic branch. This separation strongly violates the ‘sum rule’ for bulk VAFs and is likely the reason why B-SCITE chooses a topology which places these mutations into a single lineage. To better understand this discrepancy, we checked the single-cell genotype matrices inferred in the original study (Supplementary Fig. 7)39 and found that the separate placement of FUS requires that FUS mutation calls are explained away as false positives in ten metastatic cells (MA_44, MA_42, MA_41, MA_48, MA_29, MA_45, MA_90, MA_91, MA_33 and MA_35). Another ambiguous mutation is ATP7B (3.1% metastasis VAF), which the single-cell tree places above the metastatic mutation FUS, while B-SCITE places it in the primary branch. While in better congruence with the bulk data, the placement chosen by B-SCITE comes at the cost of an increased number of false-positive calls for ATP7B.
In general, there appears to be no tree topology that jointly explains all aspects of the single-cell data, let alone both data types. A possible explanation could be the presence of complex events that are not covered by our tree model, such as mutation loss or recurrent mutations. The absence of a well-fitting tree topology leads to two unwanted effects, both of which we observe in this data set. First, the inferred tree topology can be strongly influenced by the mutation choice. In the present data, lowering the read count cut-off (to 5 instead of 20) and thereby discarding fewer mutations is sufficient to drive FUS into a separate metastatic branch. Second, for a fixed set of mutations, lower branches in the tree can become unstable in the sense that multiple branching variations obtain highly similar scores. Having such a ‘flat landscape' in high-scoring parts of the tree search space makes it difficult to find a globally optimal tree. This matches our observation that this data set took much longer MCMC chains to consistently find the best tree than normally expected for data sets of this size.
Discussion
Recent advances in sequencing technologies allow large-scale bulk and single-cell sequencing of tumour samples. The resulting data are invaluable for understanding the evolutionary history and subclonal composition of individual tumours and addressing the issue of treatment failure due to resistant cell populations. The bottleneck to fully leveraging the joint strength of both data types is the lack of specialised software which integrates single-cell and bulk-sequencing data in a joint inference scheme. Prior to this work, only a single tool (ddClone) has been available, where subclone inference based on bulk-sequencing data is informed by single-cell genotypes, but no integrative tool has been published for phylogeny inference. To fill this gap, we have developed B-SCITE, the first approach for inferring tumour phylogenies and subclonal compositions from combined single-cell and bulk-sequencing data. B-SCITE uses a joint likelihood model to integrate both data types and performs a probabilistic search to find the best combination of a fully resolved mutation history and values for the model parameters. Extensive simulation studies show that B-SCITE systematically outperforms competing single-cell-based approaches, thereby indicating that bulk data make a valuable contribution to the inference. The quantity of the performance gain depends on the degree of sampling distortion between the single-cell samples and the composition of the bulk tumour. However, even in cases where the single-cell data very well reflect the tumour composition, B-SCITE outperforms its competitors.
Since the identification of the correct copy-number states of mutated sites can be a challenge, we performed additional tests with bulk VAFs that one could expect to arise from non-diploid regions. While B-SCITE is designed and works best for heterozygous mutations at copy-number-neutral sites, it still shows a robust performance when a subset of the mutation sites is affected by CNAs.
To compare B-SCITE with ddClone, the only other tool combining single-cell and bulk-sequencing data, we obtained subclones from B-SCITE’s fully resolved mutation histories by performing a local mutation clustering. Experiments on a comprehensive set of simulated data sets showed that B-SCITE systematically outperforms ddClone, suggesting that subclone inference benefits from taking the underlying phylogeny into account.
In addition, we explored the usefulness of combining bulk and single-cell data in the analysis of real tumours. Looking at the data from two patients with childhood leukaemia from ref. 37, where a large number (>100) of single cells were sequenced, we find that B-SCITE and the single-cell-only approach SCITE infer very similar branchings in the tumour phylogenies, while B-SCITE shows an improved ability in reconciling the temporal ordering of mutations with their VAFs. In cases where B-SCITE refrains from ordering mutations by decreasing VAFs, we suspect the presence of a copy-number change, whose signal was overridden by a strong contrary signal in the single-cell data. This gives further evidence that B-SCITE is to some extent robust to data deviating from the assumption that all mutations are heterozygous and at copy-number-neutral sites.
We also tested B-SCITE on a triple-negative breast cancer patient from ref. 38, from which 16 single cells were sequenced. In contrast to SCITE, B-SCITE infers a tree, which is in high concordance with the expert-generated tree for the 18 mutations pre-selected in the original study. These results suggest potential advantage of using B-SCITE over the existing methods in clinical settings, where a targeted subset of cancer-relevant genes are sequenced.
Methods
Tree models of tumour evolution
A clonal tree T, depicting subclonal tumour evolution in a cancer patient with a set of detected somatic mutations M = {M1, M2,…, Mn}, is a labelled rooted tree with node set V(T) = {v1, v2,…, vs, vs+1}. Its root vs+1 represents the population of healthy cells of the patient, and the other nodes represent s tumour subclonal populations (subclones) of the same individual, which are all genetically different. The mutational label associated with a node vi, denoted as LT(vi), is a subset of M and represents the set of mutations acquired at subclone vi. In other words, for each non-root node vi, LT(vi) represents the set of all mutations present at subclone vi, but absent from subclone vj, where vj is parent of vi in T. We assume that the population of healthy cells does not harbour any mutation and therefore set LT(vs+1) = ∅. As subclones are all genetically different, we have that LT(v) ≠ ∅ for each v ∈ {v1, … vs}. Furthermore, assuming that we have h different bulk samples of the analysed tumour, we also add a frequency label to each node vi ∈ V(T). The frequency label of node vi is represented as a vector (ϕi1, ϕi2,…, ϕih), where ϕij denotes the relative frequency of cell population vi in bulk sample j.
In this work, we assume that each mutation was acquired exactly once by one subclone and then passed on to all its descendants (infinite-sites assumption). Consequently, LT(vi) ∩ LT(vj) = ∅ for all pairs (vi,vj) (mutations are acquired only once) and (each of the observed mutations emerged in some of the subclones). As we assume that mutations are never lost, the set of mutations present in a subclone vi consists of all mutations acquired along the path from the root to vi.
The topology of clonal tree T can be represented by an s × s ancestor matrix AT defined as follows (we assume 1-based indexing of matrix entries):
1 |
Below, we also allow the use of AT[vi, vj], which is defined to be equal to AT[i, j].
For convenience of notation, we also introduce function NT:M→V(T) with NT(Mi) defined as the node of the first occurrence of mutation Mi in clonal tree T. In other words, NT(Mi) = vj if and only if Mi ∈ LT(vj).
For tree inference based on single-cell data, the mutation tree model that does not cluster mutations into subclones is more natural. The mutation tree can be defined as a special case of a clonal tree where s = n. This obviously implies that |LT(vi)| = 1 for each non-root node vi from V(T). Furthermore, for each such node and without loss of generality, we may assume that LT(vi) = Mi.
Input data
We assume that a set M = {M1, M2,…, Mn} of heterozygous somatic single-nucleotide variants from diploid regions of the genome is given. These mutations were observed in a tumour via sequencing of h bulk samples ℬ1, ℬ2,…,ℬh, and m single-cell samples .
For each mutation Mi and each bulk sample ℬj, the bulk sequencing provided the number of variant and total reads spanning the genomic position of Mi in , denoted, respectively, by rij and tij. SCS provided the observed mutation profiles of cells as the column vectors of a mutation matrix Dn×m defined as
2 |
Tree scoring based on bulk-sequencing data
Assume that we are given a mutation tree T over n = s mutations, whose (s + 1) nodes represent the set of cellular subpopulations in the analysed tumour. For each bulk sample ℬj, our goal is to assign non-negative real values Φj = {ϕ1j, ϕ2j,…, ϕ(s+1)j} to the nodes of T such that
3 |
and the likelihood of bulk-sequencing data (defined below) be maximised. Intuitively, ϕij represents the inferred fraction of the cellular population vi in the sample ℬj.
Consider an arbitrary mutation Mi. By the assumption made above, Mi occurs for the first time at node vi and is present at vi and all of its descendants (i.e., nodes vj such that AT[i, j] = 1). This implies that the inferred fraction yij of cells harbouring mutation Mi in sample ℬj is given by the formula
4 |
Alongside the tree constraints in Eqs. (3) and (4), we introduce a likelihood model for the bulk-sequencing data to allow us to combine it with the single-cell measurements.
Consider a genomic position P in a diploid genome region that has a (sub-)clonal heterozygous mutation M ∈ M in a tumour. Let y be the fraction of cells harbouring M in an arbitrary bulk sample. Assume that we sequence this bulk sample and obtain a total of t reads that span P, of which r supports the variant M. Based on the composition of the considered bulk sample, the probability that a read supports M is (since M is heterozygous and from a diploid region). We assume that the sampled number of variant reads r follows a binomial distribution with parameters t (number of trials) and (success probability). For high-coverage t, the binomial distribution can be approximated by the Gaussian distribution with mean and standard deviation . The logarithm of the probability density is
5 |
where , which represents the bulk-sequencing data-based fraction of cells harbouring mutation M, based on the assumption that M is from a diploid region. Full details of derivation of Eq. (5) are provided in the Supplementary Methods.
For a given assignment of values yij, after discarding constant terms, the log-likelihood of the entire bulk data is then
6 |
Our goal is to maximise the likelihood over the latent variables yij, under the constraints Eqs. (3) and (4) imposed by the tree topology. To make this maximisation problem tractable by existing solvers, we use bulk data-derived frequencies to approximate the standard deviations
so that the log terms in Eq. (6) become constant and can be removed from the optimisation, while quadratic term coefficients become constants. Our problem then transforms to maximising the likelihood over the underlying frequencies. We therefore define the score of the bulk data to be
subject to the restrictions imposed in Eq. (3) and Eq. (4), where the sum involves the weighted quadratic terms from the Gaussian approximation. Obtaining the optimal values for yij represents an instance of a Quadratic Program (QP), which is readily solvable by the existing commercial and free QP solvers. For our purposes, we have used IBM ILOG CPLEX Optimization Studio V12.5.
Tree scoring based on single-cell data
For the tree scoring based on single-cell data, we need to assess how well the observed mutation states of the single cells match the subclones defined by T. Due to noise in the mutation matrix D, the single cells will likely fit to none of the s + 1 cell populations defined by T perfectly, even if the tree represents the true mutation history of the tumour. We account for this by using the probabilistic error model introduced in ref. 28: let the vector σ = (σ1, σ2,…, σm) define the attachments of the single cells to T, such that cj, the single cell corresponding to column j in D, attaches to . Then we expect cj to have the mutations assigned to the nodes vi that belong to the path from root to (i.e., all nodes vi such that AT[i, σj] = 1). The observational probabilities are then
7 |
where α denotes the probability of observing a false positive and β denotes the probability of observing a false negative. The two error rates are summarised as θ = (α, β) in the following. By setting the probability of having missing observations (D[i, j] = NA) to 1 independent of the true state, they are neutral and do not contribute to the tree scoring
8 |
Assuming that the observational errors are independent of each other, the likelihood of a given mutation tree T, sample attachment vector σ and θ is then
9 |
In the following, we marginalise out the sample attachments to focus on the mutation tree as the informative part of the mutation history, which is also more robust against noise than the location of individual samples in the tree. This gives us
10 |
In practice, SCS data are often contaminated with doublet samples. Therefore, we treat each sample as a weighted mixture of a singlet and doublet sample27. The marginalised likelihood then becomes
11 |
where δ is the probability of a sample being a doublet and Dj is the jth column of D, which represents the observed mutation states of cell j. The double sum over the attachment points σj and σj' creates all combinations of attachment pairs. This can be efficiently computed in time O(mn2)27. Finally to obtain a single-cell-based tree score that is on a comparable scale to the bulk score, we take the log of the marginalised likelihood
Combined B-SCITE approach
To measure how well a candidate mutation tree T fits a combination of bulk and single-cell measurements, we use the joint log-likelihood score defined as
Our goal is to find a combination (T,θ)* that maximises the above score:
The number of possible mutation trees is too large to allow for an exhaustive search. Therefore, we use the Markov chain Monte Carlo approach introduced in ref. 28 to search the joint (T, θ) space. In each step, it proposes a new state, which has either a new mutation tree T′ or a new parameter θ'. The proposal probability q(T′, θ'|T, θ) is determined by the neighbourhood size of the respective move type. A new tree T′ is obtained from T by (i) pruning and re-attaching of a subtree, by (ii) swapping two subtrees or by (iii) the exchange of the labels of two nodes (see ref. 28 for more details). The key difference to SCITE is that the probability of accepting a new state is now depending not only on the observed single-cell mutation profiles but also on the bulk VAFs. Hence, the acceptance probability becomes
Note that Sbulk(T) does not depend on θ. Therefore, it needs not to be recomputed to obtain Sjoint(T, θ′) after a θ-move.
Compression of mutation trees into clonal trees
To compare the mutation tree T inferred by B-SCITE with clonal trees inferred in OncoNEM or mutational clusters inferred in ddClone, we first perform clustering of mutations in T in order to identify clones. Mutations placed on different branches in T are expected to evolve in divergent clones; hence, we assume that they are not clustered together into the same clone. Furthermore, each of the branching nodes of T (i.e., nodes having more than one child) represent the emergence of new clonal populations in the subtrees formed by its descendants. This motivated us to first identify all linear chains in T, which do not contain any branching node (except for the end nodes). Then, for each such chain, we discard its end node closest to the root and cluster the remaining mutations along the linear chain.
For the clustering, we employ a 1D Gaussian mixture model for the cellular frequencies yi of the mutations in the chain. The standard deviation of each component is fixed by its mean through the binomial approximation according to Eq. (5). Therefore, only the mean of each component and the assignment of mutations to components need to be inferred, for which we employ the EM algorithm. The optimal number of mixture components is selected according to the Akaike information criterion.
As clustering is performed based on the inferred mutation frequencies, which decrease down the chain, only subsets of consecutive mutations can get clustered together. For each cluster of the mutations obtained in the clustering step, we merge their corresponding nodes into a single node and assign all mutations from the cluster to this node. The whole procedure results in the clonal tree of tumour evolution.
With multiple bulk samples, the mixture components and the assignment of mutations to them is shared across the samples, and we simply allow each component to have a different mean for each sample.
Estimating the relative contribution of bulk and SCS data
To estimate the range of the single-cell and bulk-based scores, we define a baseline tree, , with a single clone containing all mutations, which are then present in all single cells. The range of the log scores of each data type is then
allowing us to define an estimate of the relative weight of the bulk data to the single cell as
This estimate can inform the weighting parameter ω to balance the two data types:
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
S.M. was supported by a Vanier Canada Graduate Scholarship and NSERC CREATE (139277) fellowship. K.J. was supported by SystemsX.ch RTD Grant 2013/150 (http://www.systemsx.ch/). J.K. was supported by ERC Synergy Grant 609883 (http://erc.europa.eu/). S.C.S. was supported by NSF Grant CCF-1619081, NIH grant GM108348 and the Indiana University Grand Challenges Program, Precision Health Initiative. The authors would like to thank Jochen Singer and Farid Rashidi Mehrabadi for the help in mutation calling for TNBC data.
Author contributions
S.M., K.J. and J.K. designed and implemented the method. S.M., K.J., J.K., C.S. and N.B. conceived the project and wrote the paper. All authors read and approved the final paper.
Data availability
The human sequencing data sets utilised in this study are available for download from the Sequence Read Archive with the accession numbers: SRP044380 (for ALL patients), SRA053195 (for TNBC patients) and SRP074289 (for CRC patients).
Code availability
B-SCITE has been implemented in C++ and is freely available under a GNU General Public License v3.0 at https://github.com/smalikic/B-SCITE.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information: Nature Communications thanks Mohammed El-Kebir and Luay Nakhleh for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Salem Malikic, Katharina Jahn, Jack Kuipers.
These authors jointly supervised this work: S. Cenk Sahinalp, Niko Beerenwinkel.
Contributor Information
S. Cenk Sahinalp, Email: cenksahi@indiana.edu.
Niko Beerenwinkel, Email: niko.beerenwinkel@bsse.ethz.ch.
Supplementary information
Supplementary Information accompanies this paper at 10.1038/s41467-019-10737-5.
References
- 1.Peter CN. The clonal evolution of tumor cell populations. Science. 1976;194:23–28. doi: 10.1126/science.959840. [DOI] [PubMed] [Google Scholar]
- 2.Mel G. Evolutionary determinants of cancer. Cancer Discov. 2015;5:806–820. doi: 10.1158/2159-8290.CD-15-0439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rebecca AB, Swanton C. Tumour heterogeneity and the evolution of polyclonal drug resistance. Mol. Oncol. 2014;8:1095–1111. doi: 10.1016/j.molonc.2014.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat. Rev. Genet. 2017;18:213–229. doi: 10.1038/nrg.2016.170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Strino F, Parisi F, Micsinai M, Kluger Y. Trap: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res. 2013;41:e165–e165. doi: 10.1093/nar/gkt641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Oesper L, Mahmoody A, Raphael BJ. Theta: inferring intra-tumor heterogeneity from high-throughput dna sequencing data. Genome Biol. 2013;14:R80. doi: 10.1186/gb-2013-14-7-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Oesper L, Satas G, Benjamin JR. Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data. Bioinformatics. 2014;30:3532–3540. doi: 10.1093/bioinformatics/btu651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Roth A, et al. Pyclone: statistical inference of clonal population structure in cancer. Nat. Methods. 2014;11:396–398. doi: 10.1038/nmeth.2883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinforma. 2014;15:35. doi: 10.1186/1471-2105-15-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Subhajit, S. et al. Bayclone: Bayesian nonparametric inference of tumor subclones using ngs data. In Pacific Symposium on Biocomputing Co-Chairs 467–478 (World Scientific, Singapore, 2014). [PubMed]
- 11.Fischer A, Vázquez-García I, Illingworth CJR, Mustonen V. High-definition reconstruction of clonal composition in cancer. Cell Rep. 2014;7:1740–1752. doi: 10.1016/j.celrep.2014.04.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ha G, et al. Titan: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome Res. 2014;24:1881–1893. doi: 10.1101/gr.180281.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zare H, et al. Inferring clonal composition from multiple sections of a breast cancer. PLoS Comput. Biol. 2014;10:e1003703. doi: 10.1371/journal.pcbi.1003703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hajirasouliha I, Mahmoody A, Benjamin JR. A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data. Bioinformatics. 2014;30:i78–i86. doi: 10.1093/bioinformatics/btu284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Miller CA, et al. Sciclone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput. Biol. 2014;10:e1003665. doi: 10.1371/journal.pcbi.1003665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Popic V, et al. Fast and scalable inference of multi-sample cancer lineages. Genome Biol. 2015;16:91. doi: 10.1186/s13059-015-0647-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics. 2015;31:i62–i70. doi: 10.1093/bioinformatics/btv261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Deshwar AG, et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16:35. doi: 10.1186/s13059-015-0602-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Malikic S, McPherson AW, Donmez N, Sahinalp CS. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics. 2015;31:1349–1356. doi: 10.1093/bioinformatics/btv003. [DOI] [PubMed] [Google Scholar]
- 20.Donmez N, et al. Clonality inference from single tumor samples using low-coverage sequence data. J. Comput. Biol. 2017;24:515–523. doi: 10.1089/cmb.2016.0148. [DOI] [PubMed] [Google Scholar]
- 21.Marass F, Mouliere F, Yuan K, Rosenfeld N, Markowetz F. A phylogenetic latent feature model for clonal deconvolution. Ann. Appl. Stat. 2016;10:2377–2404. doi: 10.1214/16-AOAS986. [DOI] [Google Scholar]
- 22.El-Kebir M, Satas G, Oesper L, Raphael BJ. Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell Syst. 2016;3:43–53. doi: 10.1016/j.cels.2016.07.004. [DOI] [PubMed] [Google Scholar]
- 23.Satas G, Raphael BJ. Tumor phylogeny inference using tree-constrained importance sampling. Bioinformatics. 2017;33:i152–i160. doi: 10.1093/bioinformatics/btx270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hu Z, Sun R, Curtis C. A population genetics perspective on the determinants of intra-tumor heterogeneity. BBA Rev. Cancer. 2017;1867:109–126. doi: 10.1016/j.bbcan.2017.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kuipers J, Jahn K, Beerenwinkel N. Advances in understanding tumour evolution through single-cell sequencing. BBA Rev. Cancer. 2017;1867:127–138. doi: 10.1016/j.bbcan.2017.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013;31:213–219. doi: 10.1038/nbt.2514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jack, K., Katharina, J., Ben, R. & Niko, B. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res.27, 1885–1894 (2017). [DOI] [PMC free article] [PubMed]
- 28.Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biol. 2016;17:86. doi: 10.1186/s13059-016-0936-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ross E, Markowetz F. OncoNEM: inferring tumour evolution from single-cell sequencing data. Genome Biol. 2016;17:69. doi: 10.1186/s13059-016-0929-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kim K, Simon R. Using single cell sequencing data to model the evolutionary history of a tumor. BMC Bioinforma. 2014;15:27. doi: 10.1186/1471-2105-15-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biol. 2015;16:36. doi: 10.1186/s13059-015-0592-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: a method for inferring tumor trees from single-cell sequencing data under finite-site models. Genome Biol. 2017;18:178. doi: 10.1186/s13059-017-1311-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Davis A, Navin NE. Computing tumor trees from single cells. Genome Biol. 2016;17:113. doi: 10.1186/s13059-016-0987-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Salehi S, et al. ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data. Genome Biol. 2017;18:44. doi: 10.1186/s13059-017-1169-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hadley, W. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, New York, 2016).
- 36.Rosenberg A, Hirschberg J. V-measure: a conditional entropy-based external cluster evaluation measure. EMNLP-CoNLL. 2007;7:410–420. [Google Scholar]
- 37.Gawad C, Koh W, Quake SR. Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. Proc. Natl Acad. Sci. USA. 2014;111:17947–17952. doi: 10.1073/pnas.1420822111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang Y, et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature. 2014;512:155. doi: 10.1038/nature13600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Leung Marco L., Davis Alexander, Gao Ruli, Casasent Anna, Wang Yong, Sei Emi, Vilar Eduardo, Maru Dipen, Kopetz Scott, Navin Nicholas E. Single-cell DNA sequencing reveals a late-dissemination model in metastatic colorectal cancer. Genome Research. 2017;27(8):1287–1299. doi: 10.1101/gr.209973.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The human sequencing data sets utilised in this study are available for download from the Sequence Read Archive with the accession numbers: SRP044380 (for ALL patients), SRA053195 (for TNBC patients) and SRP074289 (for CRC patients).
B-SCITE has been implemented in C++ and is freely available under a GNU General Public License v3.0 at https://github.com/smalikic/B-SCITE.