Scuphr: A probabilistic framework for cell lineage tree reconstruction

Hazal Koptagel; Seong-Hwan Jun; Joanna Hård; Jens Lagergren

doi:10.1371/journal.pcbi.1012094

. 2024 May 9;20(5):e1012094. doi: 10.1371/journal.pcbi.1012094

Scuphr: A probabilistic framework for cell lineage tree reconstruction

Hazal Koptagel ^1,², Seong-Hwan Jun ³, Joanna Hård ⁴, Jens Lagergren ^1,^2,^*

Editor: Qing Nie⁵

PMCID: PMC11125557 PMID: 38723024

Abstract

Cell lineage tree reconstruction methods are developed for various tasks, such as investigating the development, differentiation, and cancer progression. Single-cell sequencing technologies enable more thorough analysis with higher resolution. We present Scuphr, a distance-based cell lineage tree reconstruction method using bulk and single-cell DNA sequencing data from healthy tissues. Common challenges of single-cell DNA sequencing, such as allelic dropouts and amplification errors, are included in Scuphr. Scuphr computes the distance between cell pairs and reconstructs the lineage tree using the neighbor-joining algorithm. With its embarrassingly parallel design, Scuphr can do faster analysis than the state-of-the-art methods while obtaining better accuracy. The method’s robustness is investigated using various synthetic datasets and a biological dataset of 18 cells.

Author summary

Cell lineage tree reconstruction carries a significant potential for studies of development and medicine. The lineage tree reconstruction task is especially challenging for cells taken from healthy tissue due to the scarcity of mutations. In addition, the single-cell whole-genome sequencing technology introduces artifacts such as amplification errors, allelic dropouts, and sequencing errors. We propose Scuphr, a probabilistic framework to reconstruct cell lineage trees. We designed Scuphr for single-cell DNA sequencing data; it accounts for technological artifacts in its graphical model and uses germline heterozygous sites to improve its accuracy. Scuphr is embarrassingly parallel; the speed of the computational analysis is inversely proportional to the number of available computational nodes. We demonstrated that Scuphr is fast, robust, and more accurate than the state-of-the-art method with the synthetic data experiments. Moreover, in the biological data experiment, we showed Scuphr successfully identifies different clones and further obtains more support on closely related cells within clones.

Introduction

Reconstructing cell lineage trees, from single-cell data, for healthy tissue is a fundamental computational problem with enormous potential for studies of development and differentiation [1–6]. There are two related reconstruction problems for cancer tumors: reconstruction of clonal trees and the reconstruction of tumor phylogenies from single cells. Several data types, e.g., bulk DNA, single-cell DNA, and single-cell RNA, have been used for the latter two problems [7–11]. All these reconstruction methods exploit mutations and attempt to reconstruct trees in which the proximity between a pair of cells, or clones, is correlated with the similarity between their patterns of mutations. The somatic mutation rate in humans is 10⁻⁹ per locus per cell division [12], and the copy number is not considered to carry substantial information regarding cell lineage membership in healthy tissue. Therefore, mutations are scarce when reconstructing lineage trees for healthy tissues, implying that more sophisticated models and computational methods are needed to capitalize fully on the existing mutations. This scarcity also highlights the need for single-cell DNA sequencing (scDNA-seq) data since it reveals more point mutations than any other current data type [13].

Regardless of its potential to reveal mutations, scDNA-seq data comes with its challenges [14–18]. Due to the small amount of genomic data available in a single cell, the genome needs to be amplified before sequencing [19]. Unfortunately, the whole-genome amplification methods, such as the multiple displacement amplification (MDA) method [20] and the multiple annealing and looping-based amplification cycles (MALBAC) method [21], introduce technical artifacts known as amplification errors (AEs) that are hard to distinguish from mutations. Moreover, so-called allelic dropout (ADO) events remain even after the amplification. In addition, the subsequent sequencing of the amplified materials introduces sequencing errors [22–25].

There have been several methods explicitly made for scDNA-seq data, both for identifying mutations (single nucleotide variant (SNV) callers) and for reconstructing cell lineage trees, although several of them are targeting cancer data. Monovar [26] is an SNV caller designed specifically for scDNA-seq data; for each position, it models the ADO with a Bernoulli distribution, the AEs with independent and identically distributed (i.i.d.) Bernoulli random variables and base-calling error probabilities depend on Phred quality scores, [27, 28], while utilizing dynamic programming. LiRA [29] and Conbase [30] are scDNA-seq SNV callers that leverage read-phasing, while the latter does variant calling based on the population of single cells.

There has been a sequence of single-cell tree reconstruction methods targeting cancer data, [31–36], leading up to the SCIΦ method [37]. Interestingly, for the cancer case, the infinite sites assumption (ISA, [38–40]) may be violated due to segmental deletions. However, for healthy tissue, the ISA is an appropriate assumption. So, since SCIΦ is based on ISA, it is also relevant to the analysis of healthy tissue. SCIΦ has a probabilistic model that allows joint SNV calling and tree reconstruction using the Markov chain Monte Carlo (MCMC) method. More recently, the Phylovar [41] method was shown to handle millions of loci and be faster than SCIΦ while having similar accuracy by taking advantage of the efficient vectorized computations.

Method overview

Scuphr is a distance-based phylogenetic inference method that reconstructs cell lineage trees from scDNA-seq data, produced by experimental procedures that amplify the cells’ genomes, using amplification methods such as the MDA method [20] and the MALBAC method [21]. Analyses of this data type need to distinguish somatic mutations from sequencing errors and nucleotide substitutions caused by amplification. Therefore, Scuphr relies on a probabilistic model of read-phasing and these two error sources. Read-phasing is a technique applied to identify which allele the read comes from, which is used to distinguish if the read could be from a mutated or a non-mutated segment. We first describe our model without read-phasing and then introduce the details of the read-phasing.

The amplification process is modeled as a generalized Pólya urn process, in which a drawn ball with an error probability is replaced by one of the same color and one of another, in contrast to the ubiquitous replacement by two of the same color in the Pólya urn. The observed Phred scores define the base-calling error probabilities. The model also contains a probability for the ADO events. Another vital part of Scuphr is a dynamic programming-based inference algorithm that, based on the error model, computes the probability that two cells have different genotypes at any investigated potential mutation site.

Scuphr processes the scDNA-seq data using a site selection method that identifies the candidate sites that will be analyzed using the probabilistic model and contribute to the distance. The distance is obtained by combining the probabilities of different genotypes across the selected sites for each pair of cells. Finally, this distance is used as an input to a distance-based phylogenetic method, the neighbor-joining (NJ) algorithm [42].

The summary of the Scuphr workflow is shown in Fig 1. The input to Scuphr consists of bulk and single-cell DNA reads. First, candidate mutation sites are detected using the site selection method, Fig 1a. These candidate mutation sites can consist of a single base pair, like in most state-of-the-art methods, or two base pairs where the candidate mutation site is accompanied by a nearby germline SNV (gSNV). We call these site types singleton sites and paired sites, respectively. These site types will be referred to as sites throughout the paper. Second, site-associated distance matrices are calculated in parallel for each selected site, Fig 1b. Third, a single distance matrix is obtained by combining the site-associated distance matrices, Fig 1c. Finally, the cell lineage tree is reconstructed by applying the NJ algorithm to the final distance matrix, Fig 1d. In addition, the site-associated distance matrices can be sampled with replacement several times to obtain bootstrap lineage tree samples, which could be used to get consensus trees and edge supports.

The read-phasing assists in identifying missing data and separating somatic mutations and errors using patterns of co-occurrence of the nucleotides at the gSNV loci and the candidate sites. Fig 1e shows an example of a non-mutated and a mutated cell’s genome. Each cell has two alleles; one maternal and one paternal. The first locus is a gSNV, where the nucleotides at the first and second allele differ. The second locus is the candidate site, where the non-mutated cell has the reference nucleotide in both alleles, and the mutated cell has a reference and an alternate nucleotide. The non-mutated and mutated cells have the same second allele, and their difference is due to the mutation located at the first allele. The mutation is associated with the blue gSNV nucleotide. In Fig 1f, a read with the candidate site is shown. One cannot decide if this read comes from a non-mutated or a mutated genome since no information is available for the gSNV locus. Therefore, it cannot be attributed to any of the alleles. In Fig 1g, a read from the site pair is observed. Since both nucleotides are observed, and the blue gSNV nucleotide accompanies the candidate reference nucleotide, one can conclude that the read comes from the non-mutated genome. However, in Fig 1h, the reference nucleotide is accompanied by the pink gSNV nucleotide, which could come from either of the genomes.

Results

The two state-of-the-art methods, SCIΦ [37] and Phylovar [41], can exploit amplified, diploid scDNA-seq data. We compared the performance of Scuphr with SCIΦ in terms of cell lineage tree reconstruction accuracy and runtime in synthetic data experiments. For the biological data experiment, we compared the performance of Scuphr to both SCIΦ and Phylovar.

Synthetic dataset experiments

In order to compare the performances of the methods in a controlled way, we synthetically generated datasets. We created the ground truth cell lineage trees, and assigned mutations to the edges of the tree; all cells under the edge inherit the mutation. We generated the genomes of the cells and simulated the amplification and read sequencing processes. We accounted for various amplification, allelic dropout, and sequencing errors, as well as different levels of gSNVs within the genome. The details of the synthetic data generation process are presented in Methods. We used the synthetic datasets to compare the methods in two ways; the lineage tree reconstruction accuracy and runtime.

Accuracy evaluation on synthetic datasets

To evaluate the performance of cell lineage tree reconstruction, we compared the topologies of the trees inferred by Scuphr and SCIΦ with the ground truth cell lineage tree. We use a similarity measure defined as one minus the normalized Robinson-Foulds (RF) distance. The similarity score is in [0, 1], where 1 means the tree topologies are the same. We investigated the accuracy across several combinations of AE rates, ADO rates, frequencies of loci with paired sites, and the number of cells. We investigated two AE rates: 10⁻⁵, see Fig 2 and 10⁻³, see Fig 3. First, for each choice of other parameters, the frequencies of loci with paired sites considered were 0.001, 0.01, 0.1, and 1. Second, for each choice of the other parameters, the ADO probabilities considered were 0, 0.1, and 0.2. Third, all parameter configurations were investigated for inputs having 10, 20, and 50 cells.

Both methods perform well for the lower error rate 10⁻⁵, Fig 2. However, with a few exceptions, Scuphr has a higher mean accuracy, and other quantiles are higher for Scuphr than for SCIΦ. This trend is accentuated for the case where the frequency of loci with paired sites is 1.

For the higher error rate of 10⁻³, Fig 3 shows a readily noticeable difference between the two methods. The average accuracy for Scuphr is always better than that of SCIΦ, and other quantiles are, in almost all cases, higher for Scuphr than for SCIΦ. In particular, when all of the loci are paired, Scuphr exploits the paired sites for read-phasing, but SCIΦ has more or less the same average accuracy for lower frequencies of paired sites. In this case, the average accuracy of Scuphr is almost twice as high as that of SCIΦ.

Interestingly, when SCIΦ is provided with the sites selected by our candidate site selection method, its accuracy is improved in several cases. However, its accuracy is also, in many instances, decreased. Moreover, for the biological data, it is, due to the number of sites, that it takes an even longer time to run SCIΦ with the sites selected by our site selection method. Therefore, in Figs 2 and 3, we presented the results of the entire SCIΦ method, as described in [37]. The accuracy obtained when SCIΦ is provided with the site selected by our candidate site selection method is described in the S1 Appendix.

Runtime analysis on synthetic datasets

In addition to the lineage tree reconstruction accuracy, we also compared the wall-clock runtime of Scuphr to that of SCIΦ. All runtime experiments were performed on a single cluster node with 32 CPU cores, and each configuration was repeated ten times. As Scuphr can be used with default and estimated parameters, the runtime analysis for the parameter estimation step was performed separately, and the results are presented in the S1 Appendix. We ran both methods with the same sites to compare the runtimes. The time used to obtain these sites is excluded from the runtimes reported here. We run SCIΦ with both single and multiple cores. The runtimes were very similar across the number of cores, and in this section, the single-core runtimes are presented for SCIΦ. For the multiple core runs, we direct the reader to the S1 Appendix.

Fig 4 shows how the runtime changes across fractions of singleton sites and the number of cores. The left, center, and right subfigures display the runtime analysis for 10, 20, and 50 cells, respectively. Since each distance matrix is calculated independently, the main part of our proposed algorithm is embarrassingly parallel, i.e., the wall-clock runtime scales linearly with the number of available cores, as seen in the figure. Additionally, the algorithm’s runtime is linear in the number of sites. For 10 and 20 cells, our software infers the lineage tree faster than SCIΦ. Our software is faster for singleton sites and 50 cells when at least two cores are used for computation.

The wall-clock runtime comparisons for paired sites are shown in Fig 5. Also, in this case, the left, center, and right subfigures correspond to 10, 20, and 50 cells, respectively. Our paired site analysis is slower than the singleton site analysis due to the number of fragment types considered. Nonetheless, the method has the same scalability trend as in the singleton site experiments. Our method is faster than SCIΦ for the 10 cells case when at least eight cores are used and for the 20 and 50 cells datasets when at least 16 cores are used.

These runtime analyses were performed on a single cluster node. Nevertheless, one can use multiple nodes to compute the distance matrices without communication overhead. Consequently, Scuphr achieves a linear speedup in the total number of available cores on a cluster. The final step of Scuphr, the lineage tree reconstruction, runs on a single core; however, this step is very efficient and does not change the overall asymptotic runtime.

Biological data analysis

In this section, the accuracy of Scuphr SCIΦ, and Phylovar are compared using a fibroblast dataset previously used in [30]. The dataset used in this study is a slightly modified version of the dataset in [30]. For details, see S1 Appendix. The dataset consists of scDNA-seq data for 18 cells with a recent common origin and a known lineage tree topology. This single-cell DNA data has been obtained by amplifying the DNA using MALBAC before the sequencing. These cells are so closely related that very few mutations distinguish them; hence, it is very hard to reconstruct the true lineage tree topology. The cells belong to two main monophyletic groups, one containing cells 0–11 and one containing cells 12–17. The dataset also includes a bulk DNA sample from the donor, which can be used as an outgroup.

The data were preprocessed using the pipeline described in S1 Appendix, and the sites of interest were identified as described in Methods. More than 3 million sites were selected for analysis; the details are presented in S1 Appendix.

Since the fibroblast dataset is so hard that a reconstruction method would at most identify the two main monophyletic groups correctly, we devised a test based on bootstrapping, using the transfer bootstrap expectation (TBE) [43] edge supports. 100 bootstrap lineage trees were constructed by bootstrapping sites (hence, bootstrapping the distance matrices). The TBE supports of the bootstrapped trees on the true lineage tree topology were computed with the Booster software [43]. The TBE support of a branch b is in the [0, 1] range where “0 means that the bootstrap trees contain the edge b in a random fashion and 1 means that b appears in all bootstrap trees” [43].

Scuphr separates the two main monophyletic groups with very high TBE support, 0.8, Fig 6a. Also, all branches within two smaller monophyletic groups, cells 4-5 and cells 10-11, respectively, are correctly inferred in all bootstrap rounds.

To compare accuracy, we also applied SCIΦ to the fibroblast data (Fig 6b). Due to the input format requirement (bulk and single-cell sequencing data in Mpileup format), we could not run SCIΦ on the whole genome in a single run. Instead, we applied SCIΦ independently to each chromosome, see S1 Appendix for details. We sampled 100 bootstrap trees from the lineage trees reported by SCIΦ for the chromosomes; the trees were weighted by the number of identified mutations per chromosome. The TBE supports are evaluated on the true lineage tree topology (Fig 6b). SCIΦ got the same TBE supports for a subset of branches (separation of two clones and the subclone consisting of cells 3, 9, 10, and 11). SCIΦ reported higher support for a single branch, 0.28, than Scuphr, 0.17. For the rest of the branches, Scuphr reported higher support or equal support to that reported by SCIΦ. In the monophyletic group consisting of cells 0, 1, 4, 5, and 6, in contrast to SCIΦ, Scuphr inferred substantial subclonal structure.

Finally, we compared the performance of Scuphr against Phylovar on the biological data in Fig 7. Phylovar uses the same candidate site selection scheme as SCIΦ, hence using the same sites for analysis. Following the software documentation, we extracted the sites picked by SCIΦ and created a new Mpileup file, which enabled us to run Phylovar on a cluster node. We ran Phylovar 5 times, each with 20 hill-climbing chains on a cluster node of 32 cores, and obtained 100 trees for analysis. For experimental details, see S1 Appendix.

As with the other methods, Phylovar successfully separated two clones. Although Phylovar was able to infer the relation of subclone consisting of cells 0, 1, 4, 5, and 6 better than SCIΦ; its support scores are lower than Scuphr, Fig 7b. Moreover, Phylovar was unable to identify the sibling relation of cells 4 and 5, which was supported by all Scuphr trees. Phylovar obtained similar support scores for the remainder of the blue clone as the other methods. In the beige clone, Phylovar had high support for a sibling relation (cells 12 and 13), which was not supported by either Scuphr or SCIΦ.

Materials and methods

In this section, we describe the proposed model step-by-step. First, we present the probabilistic graphical model of Scuphr in detail and outline important components and formulas. Second, we describe how to reconstruct the cell lineage tree from the outcome of the first part. Third, we describe the candidate loci selection criteria in detail. Fourth, we show how the model parameters are estimated. Fifth, the simulated data generation procedure is presented. Finally, two accuracy metrics, the similarity score, and the TBE support, used in the study are described.

The probabilistic model

First, we introduce some important concepts. Recall that, as most state-of-the-art methods do, a set of candidate mutation loci is used for analysis. We call a candidate locus that covers a single base pair a singleton site. Moreover, Scuphr can facilitate gSNVs near candidate loci and do read-phasing. A candidate mutation locus paired with a gSNV site is called a paired site. We will refer to all site types as sites throughout the Methods section for brevity. Unless otherwise stated, the model descriptions hold for all site types. Let Π be the set of sites selected for analysis and C be the number of single cells.

The probabilistic graphical model

Fig 8 shows the probabilistic graphical model of Scuphr at π ∈ Π. a, b, and α are the model hyperparameters. p_ado, p_ae, and p_m are the model’s parameters and correspond to the ADO, AE, and mutation probabilities. A set of reads, their corresponding base-calling error probabilities, and the coverage of each cell c are observed and represented by R_c, Q_c, and L_c. Moreover, the bulk genotype, B, is observed. The single-cell mutation status is represented with G_c; G_c, B, and the common mutation type random variable, Z, define the single-cell genotype, X_c. The aforementioned scDNA-seq specific challenges are modeled with $D_{c}^{1}, D_{c}^{2}$ , and A_c random variables; $D_{c}^{1}$ and $D_{c}^{2}$ model the ADO events of each allele, and A_c represents the number of errors that have happened during amplification. The fragment types generated at the end of the amplification process and their counts are expressed by F_c and N_c, respectively.

The graphical model consists of cell lineage, DNA amplification, and read sequencing.

Cell lineage: At π, all the non-mutated cells have the bulk genotype, and all the mutated cells must share the same mutation type under the ISA. This shared mutation type is modeled with a Dirichlet-Categorical distribution with the hyperparameter α. The cells’ mutation statuses are modeled with i.i.d. Bernoulli random variables with a mutation probability, p_m. The mutation probability has a Beta prior distribution with hyperparameters a and b. The mutation status random variable, bulk, and mutation type variables define the genotype of the cell, X_c.
DNA amplification: Here, we model the ADO events of each allele with Bernoulli random variables, with the same ADO probability p_ado. The number of AEs that have happened during the amplification is modeled with a Binomial random variable with probability p_ae, and its number of trials depends on the ADO random variables and the observed read coverage. These ADO and AE random variables, the observed read coverage, and the single-cell genotype form the amplified fragments, F_c and their corresponding counts, N_c.
Read sequencing: Finally, the amplified fragments are sequenced and create observed reads. Since the read sequencing is an erroneous process, the observed Phred scores are used to obtain the base-calling error probabilities, Q_c, and the uncertainty of read sequencing is modeled.

We briefly discussed the graphical model. More details of its components are presented in the following parts of this section. Fig 9 illustrates the cell lineage, DNA amplification, and sequencing steps described above. For simplicity, the site superscript π is omitted.

Distance matrix

For each selected site, π ∈ Π, we construct a symmetric nonnegative (C + 1) × (C + 1) distance matrix M^π. M_c,c′ is the distance between single-cells c and c′, computed by

M_{c, c^{'}} = \frac{\sum_{π} M_{c, c^{'}}^{π} \times 1 [L_{c}^{π} > 0, L_{c^{'}}^{π} > 0]}{\sum_{π} 1 [L_{c}^{π} > 0, L_{c^{'}}^{π} > 0]}, \forall (c, c^{'}) \in [C],

(1)

where $1$ is the indicator function, and $L_{c}^{π}$ is the coverage of c at π. Only the sites where both cells have coverage are considered during distance computation. The matrix’s last row and column are the distances between the cell and the non-mutated bulk, b, computed by

M_{c, b} = \frac{\sum_{π} M_{c, b}^{π} \times 1 [L_{c}^{π} > 0]}{\sum_{π} 1 [L_{c}^{π} > 0]}, \forall c \in [C] .

(2)

Distance between two single-cells at π

The binary random variable $G_{c}^{π}$ represents the mutation status of single-cell c at π: $G_{c}^{π} = 0$ if the cell is not mutated and $G_{c}^{π} = 1$ if the cell is mutated. The distance between two single-cells at π is

\begin{matrix} \begin{matrix} M_{c, c^{'}}^{π} & = P (G_{c} = 0, G_{c^{'}} = 1 | B, R_{1 : C}, Q_{1 : C}, L_{1 : C}, Θ) \\ + P (G_{c} = 1, G_{c^{'}} = 0 | B, R_{1 : C}, Q_{1 : C}, L_{1 : C}, Θ), \end{matrix} \end{matrix}

(3)

where B is the bulk genotype, Θ = {α, a, b, p_ado, p_ae} is the set of model parameters and hyperparameters, and R_1:C, Q_1:C, and L_1:C are the observed reads, base-calling error probabilities, and read coverages of single-cells. The distance to non-mutated bulk is simply

M_{c, b} = \sum_{i = 0}^{1} P (G_{c} = 1, G_{c^{'}} = i | B, R_{1 : C}, Q_{1 : C}, L_{1 : C}, Θ) .

On the random variables of the right-hand side of the above equation, we omit the π superscript and use the same convention in the following parts.

The mutation status probability of two cells at a site satisfies

P (G_{c}, G_{c^{'}} | B, R_{1 : C}, Q_{1 : C}, L_{1 : C}, Θ) \propto P (G_{c}, G_{c^{'}}, R_{1 : C} | B, Q_{1 : C}, L_{1 : C}, Θ) .

P(G_c, G_c′, R_1:C|B, Q_1:C, L_1:C, Θ) is computed through series of marginalizations done on the latent variables defined in the graphical model, shown in Eq 4. In order to describe the process clearly, we introduce the marginalizations one at a time in the following subsections.

\begin{matrix} \begin{matrix} P (G_{c}, G_{c^{'}}, R_{1 : C} | B, Q_{1 : C}, L_{1 : C}, Θ) \\ = \sum_{Z} \sum_{G_{1 : C \ {c, c^{'}}}} \int_{p_{m}} \sum_{F_{1 : C}, N_{1 : C}} \sum_{D_{1 : C}^{1}} \sum_{D_{1 : C}^{2}} \sum_{A_{1 : C}} \\ P (G_{c}, G_{c^{'}}, R_{1 : C}, Z, G_{1 : C \ {c, c^{'}}}, p_{m}, F_{1 : C}, N_{1 : C}, D_{1 : C}^{1}, D_{1 : C}^{2}, A_{1 : C} | B, Q_{1 : C}, L_{1 : C}, Θ) d p_{m} \end{matrix} \end{matrix}

(4)

Lineage model

The ISA implies that (i) a site can be mutated at most once, and (ii) all the mutated cells at the loci share the same mutation. We marginalize the mutation types

\begin{matrix} \begin{matrix} P (G_{c}, G_{c^{'}}, R_{1 : C} | B, Q_{1 : C}, L_{1 : C}, Θ) \\ = \sum_{z} P (Z = z | B, α) P (G_{c}, G_{c^{'}}, R_{1 : C} | Z = z, B, Q_{1 : C}, L_{1 : C}, Θ), \end{matrix} \end{matrix}

where Z is the mutation type random variable and follows the Dirichlet-Categorical distribution as described previously;

\begin{matrix} P (Z = z | B, α) = \frac{B (\sum_{k = 1}^{K} α_{k}, 1)}{B (α_{z}, 1)}, \end{matrix}

(5)

where K is the number of possible mutation genotypes, α is the concentration parameter, and B is the Beta function. B stands for the bulk genotype whereas B is the Beta function. Z differs from the non-mutated bulk genotype by a single nucleotide, e.g., we may have B = (A, A) and Z = (A, G) for a singleton site or B = (AA, AT) and Z = (AA, GT) for a paired site. See S1 Appendix for details.

Marginalization of other single-cells

Using the notation G_1:C\{c,c′} for the mutation status random variables of all cells except c and c′, the joint distribution of the reads and the mutation statuses of the single-cells c and c′ can be expressed as

\begin{matrix} \begin{matrix} P ( & G_{c}, G_{c^{'}}, R_{1 : C} | Z, B, Q_{1 : C}, L_{1 : C}, Θ) \\ = \sum_{G_{1 : C \ {c, c^{'}}}} P (G_{1 : C} | a, b) P (R_{1 : C} | G_{1 : C}, Z, B, Q_{1 : C}, L_{1 : C}, p_{a e}, p_{a d o}) \\ = \sum_{m = 0}^{C} \sum_{\begin{matrix} G_{1 : C \ {c, c^{'}}} : \\ \sum_{i} G_{i} = m \end{matrix}} P (G_{1 : C} | a, b) P (R_{1 : C} | G_{1 : C}, Z, B, Q_{1 : C}, L_{1 : C}, p_{a e}, p_{a d o}), \end{matrix} \end{matrix}

where $m = \sum_{i = 1}^{C} G_{i} \in [0, C]$ is the number of mutated cells at the site π. The summation over mutation counts and mutation statuses in the above equation can be computed efficiently using dynamic programming.

As mentioned earlier, we assign Beta prior distribution on the mutation probability, p_m, with the hyperparameters a and b. Given p_m, the mutation statuses of all single-cells are conditionally independent and are i.i.d. Bernoulli random variables. The joint distribution of single-cell mutation statuses is

\begin{matrix} \begin{matrix} P (G_{1 : C} | a, b) & = \int_{p_{m}} P (G_{1 : C} | p_{m}, a, b) P (p_{m} | a, b) d p_{m} \\ = \frac{B (m + a, C - m + b)}{B (a, b)} . \end{matrix} \end{matrix}

The derivation is presented in S1 Appendix.

Genotypes of single-cells

We define the auxiliary random variables, X_1:C, to denote single-cell genotypes. The genotype of a single-cell c is

\begin{matrix} X_{c} = {\begin{matrix} B, & if G_{c} = 0 \\ Z, & if G_{c} = 1 . \end{matrix} \end{matrix}

(10)

The values of X_1:C are deterministic functions of G_1:C, Z, and B. From now on, we will use X_1:C notation instead of {G_1:C, Z, B}, e.g.,

P (R_{1 : C} | G_{1 : C}, Z, B, Q_{1 : C}, L_{1 : C}, p_{a e}, p_{a d o}) = P (R_{1 : C} | X_{1 : C}, Q_{1 : C}, L_{1 : C}, p_{a e}, p_{a d o}) .

Conditional independence of single-cells

Given the genotypes of single cells, the amplification, and the allelic dropout probabilities, the likelihoods of reads are conditionally independent. The read likelihood is factorized by

P (R_{1 : C} | X_{1 : C}, Q_{1 : C}, L_{1 : C}, p_{a e}, p_{a d o}) = \prod_{c = 1}^{C} P (R_{c} | X_{c}, Q_{c}, L_{c}, p_{a e}, p_{a d o}) .

Introduction of fragments

We introduce the fragments created during the DNA amplification. The fragment types, F_c, and their counts, N_c, are marginalized as follows;

\begin{matrix} \begin{matrix} P (R_{c} | X_{c}, Q_{c}, L_{c}, p_{a e}, p_{a d o}) \\ = \sum_{F_{c}, N_{c}} P (R_{c} | F_{c}, N_{c}, Q_{c}) P (F_{c}, N_{c} | X_{c}, L_{c}, p_{a e}, p_{a d o}) . \end{matrix} \end{matrix}

Amplification model

The DNA amplification is modeled with a generalized Pólya urn model. Two ADO events determine the initial state of the urn. These ADO events are modeled by two Bernoulli random variables with the same ADO probability, p_ado;

\begin{matrix} \begin{matrix} P (F_{c}, N_{c} | X_{c}, L_{c}, p_{a e}, p_{a d o}) \\ = \sum_{D_{c}^{1} = 0}^{1} \sum_{D_{c}^{2} = 0}^{1} P (D_{c}^{1}, D_{c}^{2} | p_{a d o}) P (F_{c}, N_{c} | D_{c}^{1}, D_{c}^{2}, X_{c}, L_{c}, p_{a e}), \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} P (D_{c}^{1}, D_{c}^{2} | p_{a d o}) & = p_{a d o}^{D_{c}^{1} + D_{c}^{2}} {(1 - p_{a d o})}^{2 - (D_{c}^{1} + D_{c}^{2})} . \end{matrix} \end{matrix}

In the case of no ADO events, the process starts with one copy of each allele. The process starts with the other allele if there is one ADO event.

One can describe the urn process as follows; the urn is initialized with one or two colored balls. At each step, a ball is drawn from the urn, a copy of the ball is made, and both the original and the copy is put back into the urn. The outcome of this process can be represented by one or two lineage trees where the roots of the trees are the original copies of the alleles. We will refer to these trees as amplification trees. Given the initial state and the final number of balls in the urn, which is observed as the read coverages, the total number of edges in amplification trees is (an additional incoming edge to the root is introduced to account for subsampling)

\begin{matrix} \begin{matrix} E_{D_{c}^{1}, D_{c}^{2}}^{L_{c}} = {\begin{matrix} 2 L_{c} - 2, & if D_{c}^{1} = 0, D_{c}^{2} = 0 \\ 2 L_{c} - 1, & if D_{c}^{1} = 0, D_{c}^{2} = 1 \\ 2 L_{c} - 1, & if D_{c}^{1} = 1, D_{c}^{2} = 0 \\ 0, & if D_{c}^{1} = 1, D_{c}^{2} = 1 . \end{matrix} \end{matrix} \end{matrix}

DNA amplification sometimes replaces a nucleotide; therefore, occasionally, the copy of a ball has a different color than the original. Let A_c be the random variable describing the number of AEs has happened during the DNA amplification,

\begin{matrix} \begin{matrix} P (F_{c}, N_{c} | D_{c}^{1}, D_{c}^{2}, X_{c}, L_{c}, p_{a e}) \\ = \sum_{A_{c}} P (A_{c} | D_{c}^{1}, D_{c}^{2}, L_{c}, p_{a e}) \sum_{F_{c}, N_{c}} P (F_{c}, N_{c} | D_{c}^{1}, D_{c}^{2}, A_{c}, X_{c}, L_{c}) . \end{matrix} \end{matrix}

The probability of the number of AEs is a Binomial distribution over the edges of the amplification trees

P (A_{c} | D_{c}^{1}, D_{c}^{2}, L_{c}, p_{a e}) = (\binom{E_{D_{c}^{1}, D_{c}^{2}}^{L_{c}}}{A_{c}}) p_{a e}^{A_{c}} {(1 - p_{a e})}^{E_{D_{c}^{1}, D_{c}^{2}}^{L_{c}} - A_{c}},

where p_ae is the probability of an AE happening on an edge. In practice, p_ae is very small (e.g., [10⁻⁶, 3 × 10⁻⁴] [19, 30]); hence we neglect the cases where A_c > 1.

Marginalizing amplification trees

Let the fragment types and counts be $F_{c} = (F_{c}^{1}, F_{c}^{2}, F_{c}^{3})$ and $N_{c} = (N_{c}^{1}, N_{c}^{2}, N_{c}^{3})$ , respectively. Let d(F_i||F_j) be the function that computes the Hamming distance between two fragment types, F_i and F_j.

The first two elements of F_c are the cell genotype, $(F_{c}^{1}, F_{c}^{2}) = X_{c}$ , and the third element is the fragment type caused by an AE. In the case of no AE, $F_{c}^{3} = \emptyset$ . In the case of one AE, $F_{c}^{3}$ must differ a single nucleotide from its originating fragment, either $d (F_{c}^{3} | | F_{c}^{1}) = 1$ or $d (F_{c}^{3} | | F_{c}^{2}) = 1$ .

Similar to the F_c tuple, $N_{c}^{1}, N_{c}^{2}$ , and $N_{c}^{3}$ are the numbers of fragments of the first allele, second allele, and fragments carrying a nucleotide introduced by the AE. The total number of fragments is $L_{c} = N_{c}^{1} + N_{c}^{2} + N_{c}^{3}$ . In case of an AE event, $N_{c}^{3} > 0$ ; otherwise, $N_{c}^{3}$ is zero. Finally, the first two elements of fragment counts must satisfy the ADO events, i.e., $N_{c}^{1} = 0$ if $D_{c}^{1} = 1$ and $N_{c}^{2} = 0$ if $D_{c}^{2} = 1$ .

With the above conditions satisfied, given the cell genotype, read coverage, ADO, and AE events, the probability of a fragment type and count pair is a product that contains up to three important factors. The first factor regards dividing L_c fragments into one or two amplification trees. There is a single way to partition if there is an ADO event, e.g., (0, L_c) if the first allele is dropped out or (L_c, 0) if the second allele is dropped. Otherwise, due to the Pólya urn, the number of reads follows a Beta-Binomial distribution, and each partition, {(1, L_c − 1), (2, L_c − 2), …, (L_c − 1, 1)} has a 1/(L_c − 1) probability. See S1 Appendix for details. The second factor regards the AE event; if an AE has happened, how many ways are there to get $N_{c}^{3}$ erroneous fragments? Here we should note that, even though we know how L_c is partitioned into amplification trees, we do not know the internal structure of the trees. We need to consider all possible ways of forming the amplification trees (i.e., marginalization of the amplification tree topologies). We model an amplification tree as described in S1 Appendix. Assuming a count configuration of $N_{c} = (N_{c}^{1}, N_{c}^{2}, N_{c}^{3})$ where $N_{c}^{i} > 0$ for all i ∈ {1, 2, 3} and that the AE is happening strictly on the first amplification tree, there are $C (N_{c}^{1} + N_{c}^{3})$ possible tree topologies of the first amplification tree, each with the probability of $1 / C (N_{c}^{1} + N_{c}^{3})$ . The subscript c is the single-cell id, whereas C(.) or C(., .) is a function that returns the number of possible tree topologies given specific tree details in the parenthesis. Moreover, $C (N_{c}^{1} + N_{c}^{3}, N_{c}^{3})$ out of $E_{D_{c}^{1}, D_{c}^{2}}^{L_{c}}$ edges satisfy the specified count configuration in this marginalized amplification tree space. When all these are combined, the second component becomes $C (N_{c}^{1} + N_{c}^{3}, N_{c}^{3}) / (C (N_{c}^{1} + N_{c}^{3}) \times E_{D_{c}^{1}, D_{c}^{2}}^{L_{c}})$ . An example of tree topologies and edges supporting a count configuration is shown in Fig 10. We direct the reader to Tables A and B in S1 Appendix for all the count and fragment type configurations.

Fig 10 — The labeled nodes indicate the order of the amplification events. The dashed line represents an incoming edge to the root to account for subsampling during the sequencing of the fragments. The $red$ edges are all possible 3-edges in 4-trees, C(4, 3).

The final component is the probability of $F_{c}^{3}$ given AE; in the case of no AE, the probability is

\begin{matrix} \begin{matrix} P (F_{c}^{3} | A_{c} = 0) = {\begin{matrix} 1, & if F_{c}^{3} = \emptyset \\ 0, & otherwise, \end{matrix} \end{matrix} \end{matrix}

and in the case of AE, the probability is 1/3 for the singleton sites (1 × 3 = 3 different possible genotypes differ by 1 nucleotide), and 1/6 for paired sites (there are 2 × 3 = 6 possibilities).

The product of the three components leads to the fragment type and counts probability; $P (F_{c}, N_{c} | D_{c}^{1}, D_{c}^{2}, A_{c}, X_{c}, L_{c})$ , which is detailed in Tables A and B in S1 Appendix.

Read sequencing

Read sequencing is also erroneous and depends on sequencing technology [22–25]. We use the Phred quality scores (ρ) to compute the base-calling error probabilities, Q, [27, 28]; Q = 10^−0.1×ρ.

For a single read with a known originating fragment, the likelihood of the read at a singleton site is

\begin{matrix} P (R_{c}^{l} | F_{c}^{l}, Q_{c}^{l} = q) = {\begin{matrix} 1 - q, & if no error \\ \frac{q}{3}, & if error, \end{matrix} \end{matrix}

(6)

and the likelihood of the read at a paired site is

\begin{matrix} P (R_{c}^{l} | F_{c}^{l}, Q_{c}^{l} = (q^{s}, q^{s^{'}})) = {\begin{matrix} (1 - q^{s}) (1 - q^{s^{'}}), & if no errors \\ (1 - q^{s}) \frac{q^{s^{'}}}{3}, & if error only at locus s \\ \frac{q^{s}}{3} (1 - q^{s^{'}}), & if error only at locus s^{'} \\ \frac{q^{s}}{3} \frac{q^{s^{'}}}{3}, & if errors at both positions . \end{matrix} \end{matrix}

(7)

We use dynamic programming to efficiently compute the likelihood of multiple reads, $P (R_{c} = R_{c}^{1 : L_{c}} | F_{c}, N_{c}, Q_{c} = Q_{c}^{1 : L_{c}})$ , of cell c. In the dynamic programming algorithm, we introduce the reads of a cell one at a time, assign them to different fragments, and track how the likelihood of reads changes with the addition of the new read. For instance, when introducing the l’th read, we know what the likelihood of the previous reads ( $R_{c}^{1 : l - 1}$ ) is if they all originate from the first fragment, all from the second one, or any other partition (e.g., 2 reads from first fragment, no reads from second fragment, l − 3 from the third fragment). After adding the L_c’th read, we extract the likelihood of the corresponding N_c configuration from the dynamic programming. The pseudocode is shown in S1 Appendix.

Runtime analysis

The runtime complexity for the dynamic programming algorithm computing the probability of the observed reads for cell c, P(R_c|X_c, Q_c, L_c, p_ae, p_ado), is $L_{c}^{3}$ —resulting in O(|Π|CL³) where L is the maximum number of reads over cell and site. When we perform pairwise distance calculation for each site, we incur the cost of |Π|C² times the look up of of the dynamic programming table across possible mutations m = 0, …, C, yielding |Π|C³. This results in the total run time of O(|Π|(C³ + CL³)).

Lineage tree reconstruction

The standard NJ algorithm [42] and its variants, such as FastNJ [44], are commonly used for distance-based methods. In the final step of Scuphr, the standard NJ algorithm is applied to the distance matrix to reconstruct the cell lineage tree using the implementation in the Dendropy library [45]. The tree is re-rooted, so the bulk node becomes the root and indicates the non-mutated state.

Site selection

We use several heuristics to identify candidate loci for analysis. Even though Scuphr can run on the sites with no observed alternate nucleotides, these sites would not contribute information about the topology of the lineage tree and waste computational resources. Instead, we select a subset of the genome that could provide information regarding the topology.

Paired site selection

The main goal for the paired site selection is to find pairs of loci with a sufficient amount of alternative nucleotides and a heterozygous site nearby that can be used for read-phasing.

First, we identify the gSNVs using the unamplified bulk reads taken from another tissue. We run FreeBayes [46] software and set the read depth threshold to 10 and alternative nucleotide frequency to 0.2. The sites with heterozygous genotypes are considered the gSNV sites. Second, we check single-cell reads that cover the gSNV site and ensure at least two single-cells display the gSNV; that is, both nucleotides must be present in at least 20% of the reads from both cells. After this verification, we look for the candidate mutation sites around the gSNV. Both gSNV and the candidate sites must be covered with the same read to facilitate read-phasing. All nucleotides covered by a read come from the same allele. The reference nucleotide of the candidate site is determined from bulk reads; the site must have at least 10 reads in bulk data, and at least 80% of the reads are one nucleotide, which is referred to as the reference. For a site to be picked, at least 2 and at most C − 1 cells must agree on an alternative nucleotide (at least 20% of the reads should be different from the reference). The signal from a single-cell or all single-cells does not contribute to the information for lineage tree reconstruction. If multiple gSNVs are near the candidate site, the closest gSNV is used to form the pair. Finally, one last gSNV check is done to ensure the single-cell reads (that cover both the candidate and gSNV sites) meet the gSNV requirement described above. The further the candidate site is from the gSNV site, the fewer reads cover both sites.

Singleton site selection

During the singleton site selection, the gSNV heuristics are omitted. The candidate mutation site identification is performed using the same heuristics as the paired site selection.

Hybrid site selection

In the hybrid case, the algorithm works with both paired and singleton sites. A candidate mutation site is paired with a gSNV if the paired site criteria are met; otherwise, the site is picked as a singleton site.

Parameter estimation and hyperparameter settings

We run the Metropolis-Hastings algorithm for 5, 000 iterations with three different initial values to infer the parameters p_ado and p_ae. We discard the first 20% samples as burn-in. The acceptance ratio of our Metropolis-Hastings algorithm is

\begin{matrix} \begin{matrix} A (Θ^{*}, Θ) & = min (1, \frac{q (p_{a d o}, p_{a e} | p_{a d o}^{*}, p_{a e}^{*})}{q (p_{a d o}^{*}, p_{a e}^{*} | p_{a d o}, p_{a e})} \frac{P (p_{a d o}^{*}, p_{a e}^{*})}{P (p_{a d o}, p_{a e})} L) \\ = min (1, \frac{P (p_{a d o}^{*}) P (p_{a e}^{*})}{P (p_{a d o}) P (p_{a e})} L), \end{matrix} \end{matrix}

where

\begin{matrix} \begin{matrix} L & = \frac{\prod_{π} P (R_{1 : C}^{π} | p_{a d o}^{*}, p_{a e}^{*}, B^{π}, Q_{1 : C}^{π}, L_{1 : C}^{π}, α, a, b)}{\prod_{π} P (R_{1 : C}^{π} | p_{a d o}, p_{a e}, B^{π}, Q_{1 : C}^{π}, L_{1 : C}^{π}, α, a, b)} . \end{matrix} \end{matrix}

The likelihood is calculated similarly to the earlier derivations in Methods;

\begin{matrix} \begin{matrix} P (R_{1 : C}^{π} | p_{a d o}, p_{a e}, B^{π}, Q_{1 : C}^{π}, L_{1 : C}^{π}, α, a, b) \\ = \sum_{z} P (Z^{π} | B^{π}, α) \\ \sum_{G_{1 : C}^{π}} P (G_{1 : C}^{π} | a, b) P (R_{1 : C}^{π} | G_{1 : C}^{π}, Z^{π}, p_{a d o}, p_{a e}, B^{π}, Q_{1 : C}^{π}, L_{1 : C}^{π}) . \end{matrix} \end{matrix}

We use a Gaussian random walk proposal in which each parameter is treated independently, and samples are proposed using a Gaussian distribution with a 0.01 standard deviation. We set uniform prior probabilities for the parameters, calculate the likelihood of the observed reads $R_{1 : C}^{1 : Π}$ based on our model, and accept or reject the samples. The means of the samples are used as p_ado and p_ae parameters during the analysis.

Scuphr has three hyperparameters; α, a, and b. α is the concentration parameter of the Dirichlet-Categorical distribution used for the mutation type probabilities. We set α to all-ones vector. The a and b hyperparameters are for the Beta prior to mutation probability p_m. We assigned uniform prior to the mutation probability by setting a = 1 and b = 1. However, the user can set these parameters and, thereby, modify the algorithm’s tendency towards mutations.

For our experiments, we randomly picked 20 sites that are used to estimate the parameters. We sampled the initial value of p_ado from U[0, 1] and p_ae from U[0, 0.1]. We set the initial range of p_ae to [0, 0.1] because the AE probabilities are reported to be small [19, 30].

Simulation of synthetic datasets

We generated the synthetic dataset as follows. First, we generated random binary cell lineage trees with C leaves. We assigned 2 × (C − 1) × μ mutations, μ ∈ {10, 20}, to the tree’s edges and ensured each edge had at least one mutation. For each dataset, we generated a 1 million base pairs long diploid genome that was used for the bulk and the single cells. We randomly picked mutation loci from odd-numbered bases in the genome. For each of the phasing frequencies ρ ∈ {0.001, 0.01, 0.1, 1}, we picked ρ × 500, 000 base pairs as gSNV sites and placed them randomly in even indexed locations in the genome. So for ρ = 1, every second position in the genome is a gSNV site, and each read containing a mutation has an accompanying gSNV site. This construction is sufficient since the distance between sites does not affect the site selection or the subsequent analysis. The gSNV sites are shared by the single-cell genomes, and the mutation sites are shared according to their placement in the cell lineage tree. Further, we masked the single-cell genomes to account for cell-specific ADO events. For each pair of sites (which consists of consecutive positions, one even and one odd), we dropped the maternal and paternal alleles independently with p_ado ∈ {0, 0.1, 0.2}.

The fragments were generated using the Pólya urn process for each site. The masked single-cell genome determined the initial state of the urn. If both alleles are dropped out, no fragments were generated. In the case of a single ADO, all the fragments are generated from the non-dropped allele. If there is no dropout, the fragments are simulated from both alleles, i.e., the urn was initialized with two balls of different colors. The number of fragments per pair of sites was sampled from a Poisson distribution with rate parameter λ ∈ {10, 20}, i.e., an interval that contains the read depth found in our biological data. Whenever a fragment is copied, an AE occurs with p_ae independently. So, even though we base our inference on assuming that there is at most one AE per site, we allowed multiple errors during the data simulation. We use p_ae = 10⁻⁵ or p_ae = 10⁻³ for all cells of a dataset; throughout the paper, we refer to these datasets as a dataset with low and high AEs, respectively.

We simulated how the fragments are sequenced so that reads are obtained, as follows. Phred quality scores were sampled from a discrete Uniform distribution in the range [30, 42]. A sequencing error was introduced based on the corresponding base-calling error probability.

We used a straightforward approach for the bulk reads and replicated the bulk genome 15 times. One can view this step as the bulk data is generated using 15 single-cell genomes without any mutations. This replication does not contain a DNA amplification step since the bulk data consists of reads sequenced from unamplified fragments.

Accuracy metrics

This section describes the two accuracy metrics used for the analysis.

The similarity score

The Robinson-Foulds (RF) distance [47] is a symmetric difference metric commonly used for phylogenetic tree comparisons [36, 48]. The metric calculates the total number of bipartitions in either of the trees but not in both. We normalized the RF scores (nRF) by the total number of non-trivial bipartitions in both tree topologies (a leaf’s edge is considered trivial. If two tree topologies have the same leaf set, there will be edges that define the same bipartition. There will be C identical bipartitions, regardless of the internal structure of the trees) nRF = RF/I_B, where I_B is the total number of internal edges in the two tree topologies. Notice that the number of non-trivial edges of a tree depends on its topology, e.g., whether the tree is binary or not. We used a similarity score,

\begin{matrix} \begin{matrix} similarity score & = 1 - nRF, \end{matrix} \end{matrix}

in [0, 1]. If the trees have the same topology, the similarity score is 1. If the trees do not have any common non-trivial bipartition, the similarity score is 0.

Transfer bootstrap expectation

The TBE is introduced as an alternative metric to compute the support of bootstrap trees on a reference tree topology [43]. Compared to Felsenstein’s bootstrap proportions [49], which checks how frequently an edge appears in the bootstrap trees, the TBE metric penalizes the slight topological differences less. TBE metric computes the number of operations (e.g., taxa removal) required to match an edge in the bootstrap tree to the reference tree.

TBE score of an edge is in [0, 1], where 1 means the edge exists in all bootstrap trees and 0 means the edge appears at random. The higher the TBE score of an edge, the better.

Discussion

We evaluated the performance of Scuphr on a biological dataset with SCIΦ and Phylovar, and on several simulated datasets with SCIΦ. Our investigation focused on the algorithm’s robustness for varying numbers of single-cells, read coverages, and technological artifacts such as AEs, ADOs, and sequencing errors. We observed that for the low amplification error datasets, Scuphr performs on par or better than SCIΦ in most cases. For the high amplification error datasets, Scuphr consistently outperforms SCIΦ. When provided by the candidate mutation sites selected by our method, SCIΦ’s performance becomes similar to Scuphr in most low amplification error datasets; however, Scuphr continues to outperform SCIΦ in the high amplification error datasets.

In addition, we showed that the algorithm scales with the number of single cells and sites. Moreover, it scales inversely with the number of cores. For instance, using a single core, the main part of Scuphr takes approximately 1.6 hours for 20 cells and 1024 singleton sites. The time required for 100, 000 sites would be approximately 166.7 hours; however, with a modest infrastructure of five compute nodes with 32 CPU cores each, Scuphr’s runtime can be reduced to approximately 1 hour. This advantage makes it possible to analyze millions of sites in the genome, whereas most state-of-the-art methods can handle only a few thousand sites.

Finally, we evaluated the performance of Scuphr using a biological dataset of 18 single cells acquired from [30]. We selected approximately 3.4 million candidate sites for analysis and used bootstrapping to obtain edge supports on the reference tree topology. Although the biological dataset was challenging, Scuphr assigned high support values to the edges that separate the two main clones and some closely related cells. Scuphr outperformed both SCIΦ and Phylovar by obtaining higher edge supports for a subclone, but was unable to support a sibling relation that was successfully identified by Phylovar.

The graphical model of Scuphr that is tailored for the healthy, diploid scDNA-seq data, combined with the NJ algorithm yields higher or as good accuracy as the state-of-the-art methods. Scuphr’s computational complexity is affected quadratically by the number of single cells. However, this issue can be redeemed with the embarrassingly parallel nature of the method. In addition, the dynamic programming technique is leveraged to calculate the read likelihoods efficiently. In the experiments, we showed how the runtimes of singleton and paired sites differ. The computations of the paired sites are considerably slower due to the more common mutation types and amplification error types to marginalize over. Since there are far fewer paired sites selected from the biological dataset for analysis, this wouldn’t cause an issue. In addition, as a design choice, we used bulk data taken from another tissue to represent the unamplified, non-mutated state. Even though there are publicly available datasets (such as the fibroblast data used in this study) that contain both single-cell and bulk data, this might not always be the case. In the absence of the bulk data, publicly available known single nucleotide polymorphism datasets could be used for read-phasing. Alternatively, the analysis could be limited to singleton sites and the site selection could be done by comparing the reads of single-cells to a reference genome.

As our goal is to reconstruct the cell lineage tree topology, we have not placed emphasis on estimating the the branch lengths. With our experimental studies, we have shown that Scuphr has high accuracy in terms of tree topology reconstruction; however, we caution that the branch lengths estimates obtained in the tree reconstruction step needs further validation.

Conclusion

Single-cell DNA sequencing technologies enable detailed analyses of development and cell differentiation [1–3]. We presented Scuphr, a probabilistic framework that reconstructs cell lineage trees from healthy, diploid single-cells using whole-genome amplified DNA sequencing data. Scuphr is designed with the challenges of the scDNA-seq data in mind, it fits well with the biological findings, and in particular, it obtains better accuracy with leveraging read-phasing.

In addition to the distance-based and MCMC-based methods, various variational inference based methods have recently been developed for tree reconstruction tasks [50–53]. These methods typically operate in the standard phylogeny setting and require a good set of initial trees for their analysis. In the potential future development of such methods where the domain moves toward the single-cell setting, Scuphr could provide a good set of bootstrap trees as input quickly.

Scuphr is designed for healthy, diploid scDNA-seq data. However, it can be enhanced to handle cancer data by incorporating copy number variations into its model. We will investigate how the extended model handles the challenges of single-cell tumor data and compare its performance with state-of-the-art methods in our future work.

Supporting information

S1 Appendix. Supplementary information.

The file includes additional formulations, biological dataset, and benchmarking details.

(PDF)

pcbi.1012094.s001.pdf^{(1.9MB, pdf)}

Acknowledgments

The computations and data handling were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC).

Data Availability

The fibroblast dataset used in the biological data experiments is published and available at the European Genome-phenome Archive (EGA) with the accession number EGAS00001003108. The source code of Scuphr is available at https://github.com/Lagergren-Lab/scuphr.

Funding Statement

Funding is provided by the Swedish Foundation for Strategic Research Grant BD15-0043 and Swedish Research Council through the grants 2022-03516 and 2018-06217 to JL. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Lodato MA, Woodworth MB, Lee S, Evrony GD, Mehta BK, Karger A, et al. Somatic mutation in single human neurons tracks developmental and transcriptional history. Science. 2015;350(6256):94–98. doi: 10.1126/science.aab1785 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Marioni JC, Arendt D. How single-cell genomics is changing evolutionary and developmental biology. Annu Rev Cell Dev Biol. 2017;33:537–553. doi: 10.1146/annurev-cellbio-100616-060818 [DOI] [PubMed] [Google Scholar]
3. Lodato MA, Rodin RE, Bohrson CL, Coulter ME, Barton AR, Kwon M, et al. Aging and neurodegeneration are associated with increased mutations in single human neurons. Science. 2018;359(6375):555–559. doi: 10.1126/science.aao4426 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Lee-Six H, Øbro NF, Shepherd MS, Grossmann S, Dawson K, Belmonte M, et al. Population dynamics of normal human blood inferred from somatic mutations. Nature. 2018;561(7724):473–478. doi: 10.1038/s41586-018-0497-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Bae T, Tomasini L, Mariani J, Zhou B, Roychowdhury T, Franjic D, et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science. 2018;359(6375):550–555. doi: 10.1126/science.aan8690 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Coorens THH, Moore L, Robinson PS, Sanghvi R, Christopher J, Hewinson J, et al. Extensive phylogenies of human development inferred from somatic mutations. Nature. 2021;597(7876):387–392. doi: 10.1038/s41586-021-03790-y [DOI] [PubMed] [Google Scholar]
7. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472(7341):90–94. doi: 10.1038/nature09807 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, et al. PyClone: statistical inference of clonal population structure in cancer. Nat Methods. 2014;11(4):396–398. doi: 10.1038/nmeth.2883 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16:35. doi: 10.1186/s13059-015-0602-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Safinianaini N, de Souza CPE, Lagergren J. CopyMix: mixture model based single-cell Clustering and Copy Number Profiling using Variational Inference. bioRxiv. 2021;. [Google Scholar]
11. Jun SH, Toosi H, Mold J, Engblom C, Chen X, O’Flanagan C, et al. Reconstructing clonal tree for phylo-phenotypic characterization of cancer using single-cell transcriptomics. Nat Commun. 2023;14(1):982. doi: 10.1038/s41467-023-36202-y [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Lynch M. Evolution of the mutation rate. Trends Genet. 2010;26(8):345–352. doi: 10.1016/j.tig.2010.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, et al. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc Natl Acad Sci U S A. 2015;112(17):5473–5478. doi: 10.1073/pnas.1418631112 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Navin NE. Cancer genomics: one cell at a time. Genome Biol. 2014;15(8):452. doi: 10.1186/s13059-014-0452-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet. 2016;17(3):175. doi: 10.1038/nrg.2015.16 [DOI] [PubMed] [Google Scholar]
16. Dong X, Zhang L, Milholland B, Lee M, Maslov AY, Wang T, et al. Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat Methods. 2017;14(5):491–493. doi: 10.1038/nmeth.4227 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Zafar H, Navin N, Nakhleh L, Chen K. Computational approaches for inferring tumor evolution from single-cell genomic data. Curr Opin Syst Biol. 2018;7:16–25. doi: 10.1016/j.coisb.2017.11.008 [DOI] [Google Scholar]
18. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21(1):31. doi: 10.1186/s13059-020-1926-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Bourcy CFAd, de Bourcy CFA, De Vlaminck I, Kanbar JN, Wang J, Gawad C, et al. A Quantitative Comparison of Single-Cell Whole Genome Amplification Methods. PLoS ONE. 2014;9(8):e105585. doi: 10.1371/journal.pone.0105585 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A. 2002;99(8):5261–5266. doi: 10.1073/pnas.082089499 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Zong C, Lu S, Chapman AR, Xie XS. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science. 2012;338(6114):1622–1626. doi: 10.1126/science.1229164 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, et al. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci Rep. 2018;8(1):10950. doi: 10.1038/s41598-018-29325-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Petrackova A, Vasinek M, Sedlarikova L, Dyskova T, Schneiderova P, Novosad T, et al. Standardization of Sequencing Coverage Depth in NGS: Recommendation for Detection of Clonal and Subclonal Mutations in Cancer Diagnostics. Front Oncol. 2019;9. doi: 10.3389/fonc.2019.00851 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20(1):50. doi: 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021;3(1):lqab019. doi: 10.1093/nargab/lqab019 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Zafar H, Wang Y, Nakhleh L, Navin N, Chen K. Monovar: single-nucleotide variant detection in single cells. Nat Methods. 2016;13(6):505–507. doi: 10.1038/nmeth.3835 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Ewing B, Green P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 1998;8(3):186–194. doi: 10.1101/gr.8.3.175 [DOI] [PubMed] [Google Scholar]
28. Ewing B, Hillier L, Wendl MC, Green P. Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Res. 1998;8(3):175–185. doi: 10.1101/gr.8.3.175 [DOI] [PubMed] [Google Scholar]
29. Bohrson CL, Barton AR, Lodato MA, Rodin RE, Luquette LJ, Viswanadham VV, et al. Linked-read analysis identifies mutations in single-cell DNA-sequencing data. Nat Genet. 2019;51(4):749–754. doi: 10.1038/s41588-019-0366-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Hård J, Al Hakim E, Kindblom M, Björklund ÅK, Sennblad B, Demirci I, et al. Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing. Genome Biol. 2019;20(1):1–18. doi: 10.1186/s13059-019-1673-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biol. 2015;16:36. doi: 10.1186/s13059-015-0592-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol. 2016;17:69. doi: 10.1186/s13059-016-0929-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biol. 2016;17:86. doi: 10.1186/s13059-016-0936-x [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 2017;18(1):178. doi: 10.1186/s13059-017-1311-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res. 2019;29(11):1847–1859. doi: 10.1101/gr.243121.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Kozlov A, Alves JM, Stamatakis A, Posada D. CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data. Genome Biol. 2022;23(1):37. doi: 10.1186/s13059-021-02583-w [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Singer J, Kuipers J, Jahn K, Beerenwinkel N. Single-cell mutation identification via phylogenetic inference. Nat Commun. 2018;9(1):1–8. doi: 10.1038/s41467-018-07627-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. doi: 10.1093/genetics/61.4.893 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1975;7(2):256–276. doi: 10.1016/0040-5809(75)90020-9 [DOI] [PubMed] [Google Scholar]
40. Tajima F. Infinite-allele model and infinite-site model in population genetics. J Genet. 1996;75(1):27–31. doi: 10.1007/BF02931749 [DOI] [Google Scholar]
41. Edrisi M, Valecha MV, Chowdary SBV, Robledo S, Ogilvie HA, Posada D, et al. Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data. Bioinformatics. 2022;38(Suppl 1):i195–i202. doi: 10.1093/bioinformatics/btac254 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. [DOI] [PubMed] [Google Scholar]
43. Lemoine F, Entfellner JBD, Wilkinson E, Correia D, Felipe MD, De Oliveira T, et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature. 2018;556(7702):452–456. doi: 10.1038/s41586-018-0043-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Elias I, Lagergren J. Fast neighbor joining. Theor Comput Sci. 2009;410(21):1993–2000. doi: 10.1016/j.tcs.2008.12.040 [DOI] [Google Scholar]
45. Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26(12):1569–1571. doi: 10.1093/bioinformatics/btq228 [DOI] [PubMed] [Google Scholar]
46.Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv. 2012;.
47. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53(1-2):131–147. doi: 10.1016/0025-5564(81)90043-2 [DOI] [Google Scholar]
48. Wang L, Bouchard-Côté A, Doucet A. Bayesian phylogenetic inference using a combinatorial sequential Monte Carlo method. J Am Stat Assoc. 2015;110(512):1362–1374. doi: 10.1080/01621459.2015.1054487 [DOI] [Google Scholar]
49. Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39(4):783–791. doi: 10.1111/j.1558-5646.1985.tb00420.x [DOI] [PubMed] [Google Scholar]
50.Zhang C, Matsen IV FA. Variational Bayesian phylogenetic inference. In: Int. Conf. Learn. Represent.; 2018.
51. Zhang C. Improved variational bayesian phylogenetic inference with normalizing flows. “Adv Neural Inf Process Syst”. 2020;33:18760–18771. [Google Scholar]
52.Zhang C, Matsen IV FA. A Variational Approach to Bayesian Phylogenetic Inference. arXiv. 2022;.
53.Koptagel H, Kviman O, Melin H, Safinianaini N, Lagergren J. VaiPhy: A variational inference based algorithm for phylogeny. In: Advances in Neural Information Processing Systems 35; 2022.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Supplementary information.

The file includes additional formulations, biological dataset, and benchmarking details.

(PDF)

pcbi.1012094.s001.pdf^{(1.9MB, pdf)}

Data Availability Statement

[pcbi.1012094.ref001] 1. Lodato MA, Woodworth MB, Lee S, Evrony GD, Mehta BK, Karger A, et al. Somatic mutation in single human neurons tracks developmental and transcriptional history. Science. 2015;350(6256):94–98. doi: 10.1126/science.aab1785 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref002] 2. Marioni JC, Arendt D. How single-cell genomics is changing evolutionary and developmental biology. Annu Rev Cell Dev Biol. 2017;33:537–553. doi: 10.1146/annurev-cellbio-100616-060818 [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref003] 3. Lodato MA, Rodin RE, Bohrson CL, Coulter ME, Barton AR, Kwon M, et al. Aging and neurodegeneration are associated with increased mutations in single human neurons. Science. 2018;359(6375):555–559. doi: 10.1126/science.aao4426 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref004] 4. Lee-Six H, Øbro NF, Shepherd MS, Grossmann S, Dawson K, Belmonte M, et al. Population dynamics of normal human blood inferred from somatic mutations. Nature. 2018;561(7724):473–478. doi: 10.1038/s41586-018-0497-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref005] 5. Bae T, Tomasini L, Mariani J, Zhou B, Roychowdhury T, Franjic D, et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science. 2018;359(6375):550–555. doi: 10.1126/science.aan8690 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref006] 6. Coorens THH, Moore L, Robinson PS, Sanghvi R, Christopher J, Hewinson J, et al. Extensive phylogenies of human development inferred from somatic mutations. Nature. 2021;597(7876):387–392. doi: 10.1038/s41586-021-03790-y [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref007] 7. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472(7341):90–94. doi: 10.1038/nature09807 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref008] 8. Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, et al. PyClone: statistical inference of clonal population structure in cancer. Nat Methods. 2014;11(4):396–398. doi: 10.1038/nmeth.2883 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref009] 9. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16:35. doi: 10.1186/s13059-015-0602-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref010] 10. Safinianaini N, de Souza CPE, Lagergren J. CopyMix: mixture model based single-cell Clustering and Copy Number Profiling using Variational Inference. bioRxiv. 2021;. [Google Scholar]

[pcbi.1012094.ref011] 11. Jun SH, Toosi H, Mold J, Engblom C, Chen X, O’Flanagan C, et al. Reconstructing clonal tree for phylo-phenotypic characterization of cancer using single-cell transcriptomics. Nat Commun. 2023;14(1):982. doi: 10.1038/s41467-023-36202-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref012] 12. Lynch M. Evolution of the mutation rate. Trends Genet. 2010;26(8):345–352. doi: 10.1016/j.tig.2010.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref013] 13. Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, et al. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc Natl Acad Sci U S A. 2015;112(17):5473–5478. doi: 10.1073/pnas.1418631112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref014] 14. Navin NE. Cancer genomics: one cell at a time. Genome Biol. 2014;15(8):452. doi: 10.1186/s13059-014-0452-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref015] 15. Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet. 2016;17(3):175. doi: 10.1038/nrg.2015.16 [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref016] 16. Dong X, Zhang L, Milholland B, Lee M, Maslov AY, Wang T, et al. Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat Methods. 2017;14(5):491–493. doi: 10.1038/nmeth.4227 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref017] 17. Zafar H, Navin N, Nakhleh L, Chen K. Computational approaches for inferring tumor evolution from single-cell genomic data. Curr Opin Syst Biol. 2018;7:16–25. doi: 10.1016/j.coisb.2017.11.008 [DOI] [Google Scholar]

[pcbi.1012094.ref018] 18. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21(1):31. doi: 10.1186/s13059-020-1926-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref019] 19. Bourcy CFAd, de Bourcy CFA, De Vlaminck I, Kanbar JN, Wang J, Gawad C, et al. A Quantitative Comparison of Single-Cell Whole Genome Amplification Methods. PLoS ONE. 2014;9(8):e105585. doi: 10.1371/journal.pone.0105585 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref020] 20. Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A. 2002;99(8):5261–5266. doi: 10.1073/pnas.082089499 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref021] 21. Zong C, Lu S, Chapman AR, Xie XS. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science. 2012;338(6114):1622–1626. doi: 10.1126/science.1229164 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref022] 22. Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, et al. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci Rep. 2018;8(1):10950. doi: 10.1038/s41598-018-29325-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref023] 23. Petrackova A, Vasinek M, Sedlarikova L, Dyskova T, Schneiderova P, Novosad T, et al. Standardization of Sequencing Coverage Depth in NGS: Recommendation for Detection of Clonal and Subclonal Mutations in Cancer Diagnostics. Front Oncol. 2019;9. doi: 10.3389/fonc.2019.00851 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref024] 24. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20(1):50. doi: 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref025] 25. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021;3(1):lqab019. doi: 10.1093/nargab/lqab019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref026] 26. Zafar H, Wang Y, Nakhleh L, Navin N, Chen K. Monovar: single-nucleotide variant detection in single cells. Nat Methods. 2016;13(6):505–507. doi: 10.1038/nmeth.3835 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref027] 27. Ewing B, Green P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 1998;8(3):186–194. doi: 10.1101/gr.8.3.175 [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref028] 28. Ewing B, Hillier L, Wendl MC, Green P. Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Res. 1998;8(3):175–185. doi: 10.1101/gr.8.3.175 [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref029] 29. Bohrson CL, Barton AR, Lodato MA, Rodin RE, Luquette LJ, Viswanadham VV, et al. Linked-read analysis identifies mutations in single-cell DNA-sequencing data. Nat Genet. 2019;51(4):749–754. doi: 10.1038/s41588-019-0366-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref030] 30. Hård J, Al Hakim E, Kindblom M, Björklund ÅK, Sennblad B, Demirci I, et al. Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing. Genome Biol. 2019;20(1):1–18. doi: 10.1186/s13059-019-1673-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref031] 31. Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biol. 2015;16:36. doi: 10.1186/s13059-015-0592-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref032] 32. Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol. 2016;17:69. doi: 10.1186/s13059-016-0929-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref033] 33. Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biol. 2016;17:86. doi: 10.1186/s13059-016-0936-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref034] 34. Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 2017;18(1):178. doi: 10.1186/s13059-017-1311-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref035] 35. Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res. 2019;29(11):1847–1859. doi: 10.1101/gr.243121.118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref036] 36. Kozlov A, Alves JM, Stamatakis A, Posada D. CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data. Genome Biol. 2022;23(1):37. doi: 10.1186/s13059-021-02583-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref037] 37. Singer J, Kuipers J, Jahn K, Beerenwinkel N. Single-cell mutation identification via phylogenetic inference. Nat Commun. 2018;9(1):1–8. doi: 10.1038/s41467-018-07627-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref038] 38. Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. doi: 10.1093/genetics/61.4.893 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref039] 39. Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1975;7(2):256–276. doi: 10.1016/0040-5809(75)90020-9 [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref040] 40. Tajima F. Infinite-allele model and infinite-site model in population genetics. J Genet. 1996;75(1):27–31. doi: 10.1007/BF02931749 [DOI] [Google Scholar]

[pcbi.1012094.ref041] 41. Edrisi M, Valecha MV, Chowdary SBV, Robledo S, Ogilvie HA, Posada D, et al. Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data. Bioinformatics. 2022;38(Suppl 1):i195–i202. doi: 10.1093/bioinformatics/btac254 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref042] 42. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref043] 43. Lemoine F, Entfellner JBD, Wilkinson E, Correia D, Felipe MD, De Oliveira T, et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature. 2018;556(7702):452–456. doi: 10.1038/s41586-018-0043-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012094.ref044] 44. Elias I, Lagergren J. Fast neighbor joining. Theor Comput Sci. 2009;410(21):1993–2000. doi: 10.1016/j.tcs.2008.12.040 [DOI] [Google Scholar]

[pcbi.1012094.ref045] 45. Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26(12):1569–1571. doi: 10.1093/bioinformatics/btq228 [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref046] 46.Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv. 2012;.

[pcbi.1012094.ref047] 47. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53(1-2):131–147. doi: 10.1016/0025-5564(81)90043-2 [DOI] [Google Scholar]

[pcbi.1012094.ref048] 48. Wang L, Bouchard-Côté A, Doucet A. Bayesian phylogenetic inference using a combinatorial sequential Monte Carlo method. J Am Stat Assoc. 2015;110(512):1362–1374. doi: 10.1080/01621459.2015.1054487 [DOI] [Google Scholar]

[pcbi.1012094.ref049] 49. Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39(4):783–791. doi: 10.1111/j.1558-5646.1985.tb00420.x [DOI] [PubMed] [Google Scholar]

[pcbi.1012094.ref050] 50.Zhang C, Matsen IV FA. Variational Bayesian phylogenetic inference. In: Int. Conf. Learn. Represent.; 2018.

[pcbi.1012094.ref051] 51. Zhang C. Improved variational bayesian phylogenetic inference with normalizing flows. “Adv Neural Inf Process Syst”. 2020;33:18760–18771. [Google Scholar]

[pcbi.1012094.ref052] 52.Zhang C, Matsen IV FA. A Variational Approach to Bayesian Phylogenetic Inference. arXiv. 2022;.

[pcbi.1012094.ref053] 53.Koptagel H, Kviman O, Melin H, Safinianaini N, Lagergren J. VaiPhy: A variational inference based algorithm for phylogeny. In: Advances in Neural Information Processing Systems 35; 2022.

PERMALINK

Scuphr: A probabilistic framework for cell lineage tree reconstruction

Hazal Koptagel

Seong-Hwan Jun

Joanna Hård

Jens Lagergren

Roles

Abstract

Author summary

Introduction

Method overview

Fig 1. The workflow of Scuphr and the illustration of read-phasing.

Results

Synthetic dataset experiments

Accuracy evaluation on synthetic datasets

Fig 2. Similarities between the true and inferred trees for low amplification errors.

Fig 3. Similarities between the true and inferred trees for high amplification errors.

Runtime analysis on synthetic datasets

Fig 4. Runtime comparison for singleton sites.

Fig 5. Runtime comparison for paired sites.

Biological data analysis

Fig 6. The TBE supports of bootstrap trees are projected onto true lineage tree topology.

Fig 7. The TBE supports of bootstrap trees are projected onto true lineage tree topology.

Materials and methods

The probabilistic model

The probabilistic graphical model

Fig 8. The graphical model of Scuphr.

Fig 9. Illustration of the random variables for a singleton site on the left column and a paired site on the right column.

Distance matrix

Distance between two single-cells at π

Lineage model

Marginalization of other single-cells

Genotypes of single-cells

Conditional independence of single-cells

Introduction of fragments

Amplification model

Marginalizing amplification trees

Fig 10. Illustration of C(4) = 6 possible 4-trees.

Read sequencing

Runtime analysis

Lineage tree reconstruction

Site selection

Paired site selection

Singleton site selection

Hybrid site selection

Parameter estimation and hyperparameter settings

Simulation of synthetic datasets

Accuracy metrics

The similarity score

Transfer bootstrap expectation

Discussion

Conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases