Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model

Yunheng Han; Erin K Molloy

doi:10.1186/s13015-023-00248-w

. 2023 Dec 1;18:19. doi: 10.1186/s13015-023-00248-w

Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model

Yunheng Han ¹, Erin K Molloy ^1,^2,^✉

PMCID: PMC10691101 PMID: 38041123

Abstract

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13015-023-00248-w.

Keywords: Tumor phylogenetics, Cell lineage trees, Quartets, Supertrees, ASTRAL

Introduction

Cancer progression and treatment can be informed by reconstructing the evolutionary history of tumor cells [1]. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally so the underlying tree is highly unresolved [2, 3]. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) and triplets (three-leaf rooted phylogenetic trees) in light of these barriers.

A quartet is an unrooted, phylogenetic tree with four leaves. Quartets have long been used as the building blocks for reconstructing the evolutionary history of species [4]. The reason quartet-based methods have garnered such success in species phylogenetics is their good statistical properties under the Multi-Species Coalescent ( $MSC$ ) model [5, 6]. An $MSC$ model species tree generates gene trees (note that a gene tree reflects the genealogical history of a gene, which is passed down from ancestor to descendant, whereas the species tree governs the pool of potential ancestors). Arguably, one of the most important theoretical results from the last decade of systematics is that the most probable unrooted gene tree under the $MSC$ is topologically equivalent to the unrooted model species tree when considering four species [7]. For trees with more than four leaves, the most probable unrooted gene tree can be topologically discordant with the unrooted model species tree [8]. In such situations, the model species tree is said to be in the anomaly zone or the offending gene tree is said to be anomalous. It is now widely recognized that anomalous gene trees can challenge traditional species tree estimation methods [9, 10].

The statistical theory described above has motivated the development of quartet-based methods (e.g., [11, 12]) and is central to their proofs of statistical consistency under the MSC. ASTRAL [12], in particular, has become a gold standard approach to multi-locus species tree estimation. Moreover, new and improved quartet-based methods are continually being developed [13–18]. Similar theory and methodology has been given for triplets: three-leaf, rooted, phylogenetic trees [19–21].

Inspired by these efforts, we study the utility of quartets and triplets for estimating cell lineage trees under a popular tumor phylogenetics model [2, 22–24], in which mutations arise on a (highly unresolved) cell lineage tree according to the infinite sites model and then errors and missing values are introduced to the resulting mutation data in an unbiased fashion. The idea is that deviations from a perfect phylogeny can be attributed to sequencing errors, as data produced by single-cell protocols are notoriously error-prone and sparse. Although the infinite sites plus unbiased error and missingness ( $IS + UEM$ ) model generates mutations rather than gene trees, quartets (or triplets) are implied by mutations that are present in two cells and absent from two cells (or one cell).

Our main result is that there are no anomalous quartets under the $IS + UEM$ model; this motivates seeking a cell lineage tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee extends to the case of highly unresolved model trees, with error defined as the number of false negative branches. Somewhat surprisingly, our positive finding for quartets does not extend to triplets, as there can be anomalous triplets under the $IS + UEM$ model under reasonable conditions. These results generalize to any model of 2-state character evolution for which there are no anomalous quartets or triplets. An example of such a model is the infinite sites plus neutral Wright-Fisher ( $IS + nWF$ ) model [25, 26] and its approximations [46]. Under $IS + nWF$ , mutations follow the IS assumption but evolve within a species tree, so deviations from a perfect phylogeny are due to genetic drift. Nevertheless, there are no anomalous triplets (see Additional file 1 of [47]) and no anomalous quartets (Theorem 1 in [48]; also see [49]), motivating the application of quartet-based methods to estimate species trees from low-homoplasy retroelement insertion presence/absence patterns [48, 50]. However, our work is largely motivated by tumor phylogenetics, so we conclude by outlining how quartet-based methods might be employed in this setting, given other important challenges like copy number aberrations (CNAs) and doublets.

Background

We now provide some background on phylogenetic trees, models of evolution, and statistical consistency.

Phylogenetic trees

A phylogenetic tree is defined by the triple $(g, X, ϕ)$ , where g is a connected acyclic graph, X is a set of labels (often representing species or cells), and $ϕ$ is a bijection between the labels in X and leaves (i.e., vertices with degree 1) of g. Phylogenetic trees can be either unrooted or rooted, and we use u(T) to denote the unrooted version of a rooted tree T. Edges in an unrooted tree are undirected, whereas edges in a rooted tree are directed away from the root, a special vertex with in-degree 0 (all other vertices have in-degree 1). Vertices that are neither leaves nor the root are called internal vertices, and edges incident to only internal vertices are called internal edges (otherwise they are referred to as terminal edges). An interval vertex with degree greater than 3 (called a polytomy) can be introduced to a tree by contracting one of its edges (i.e., deleting the edge and identifying its endpoints). A refinement of a polytomy is the opposite of a contraction. If there are no polytomies in T, we say that T is binary or fully resolved; otherwise, we say that T is non-binary. We use the phrase highly unresolved to indicate that T contains many polytomies and/or that the polytomies in T have high degrees.

As previously mentioned, many methods for species tree estimation are based on quartets. A quartet is an unrooted, binary tree with four leaves. We denote the three possible quartets on $X = {A, B, C, D}$ as $q_{1} = A, B | C, D$ , $q_{2} = A, C | B, D$ , and $q_{3} = A, D | B, C$ . A set of quartets can be created from an unrooted tree T by restricting T to every possible subset of four leaves (i.e., deleting the other leaves from T and then suppressing vertices of degree 2). The resulting set Q(T) is called the quartet encoding of T, and we say that T displays quartet q if $q \in Q (T)$ . Importantly, if T contains polytomies, restricting T to some subsets of four labels will not produce a (binary) quartet. Some selections will produce star trees, which do not provide any topological information. We use ${T |}_{S}$ to denote T restricted to label set S (note that if branch parameters are associated with T, they are added together when suppressing vertices of degree 2). The concepts described above for quartets extend to triplets. A triplet is a rooted, binary tree with three leaves, and we denote the three possible triplets on $X = {A, B, C}$ as $t_{A} = A | B, C$ , $t_{B} = B | A, C$ , and $t_{C} = C | A, B$ . Lastly, a bipartition or split of label set X partitions it into two disjoint subsets. It is easy to see that each edge in an unrooted tree induces a bipartition, and we use Bip(T) to denote the set of bipartitions induced by all edges in T.

Mutations and models of evolution

A mutation matrix M is an $n \times k$ matrix, where n is the number of rows (representing cells or species) and k is the number of columns (representing mutations). Columns are also referred to as characters or site patterns. Our focus here is on 2-state characters, with $M_{i, j} = 0$ indicating that mutation j is absent from cell i and $M_{i, j} = 1$ indicating that mutation j is present in cell i. In tumor phylogenetics, mutations are called in reference to a healthy cell, which is the root of the cell lineage tree; thus, 0 represents the ancestral state and 1 represents mutant/derived state (note that this interpretation of states 0 and 1 will only be important when looking at triplets and not quartets).

Throughout this paper, we assume the mutation matrix D is generated under a hierarchical model with two steps (Fig. 1).

A mutation matrix G is generated under some model $M$ , parameterized by a rooted phylogenetic tree topology $σ$ and a set $Θ$ of associated numeric parameters. Importantly, model $M$ given $(σ, Θ)$ defines a probability distribution on mutation patterns, and we assume mutations in G are independent and identically distributed (i.i.d.) according to this model. For simplicity of notation, we typically omit the dependence on the numeric parameters in $Θ$ .
Errors and/or missing values are introduced to the ground truth matrix G according to the UEM model (described below). This result of this process is the observed matrix D.

Hierarchical models of this form, denoted $M + UEM$ , define a probability distribution on mutation patterns given their parameters. Thus, if we say that D is generated under the $M + UEM$ model, then we assume the mutations in D are i.i.d. according to this model. We now describe the data generation steps in greater detail for a popular tumor phylogenetics model [2, 22–24].

Fig. 1 — The schematic shows a model cell lineage tree, where the dashed lines and circles are “fake” edges and vertices, respectively. If we assume a mutation occurs on any non-fake edges with equal probability (as in [24]), then the probability of a mutation on any solid edge will be 1/11. Mutations cannot occur on any of the dashed edges. Data are generated from this model cell lineage tree in two steps. First, mutations arise on the tree under the IS model, producing data matrix G. Second, false positives (0 flips to 1; shown in red), false negatives (1 flips to 0; shown in blue), and missing values (0/1 flips to ?; shown in grey) are introduced to G under the UEM model, producing data matrix D

Step 1: Infinite Sites (IS) model. For tumor phylogenetics, we take $M$ to be the infinite sites (IS) model, so the mutation matrix G is generated under the IS model given a rooted cell lineage tree $σ$ and a set $Θ$ of edge probabilities that sum to 1. Specifically, every edge e in $σ$ is associated with a numeric value $p (e) \in Θ$ , indicating the probability that a mutation occurs on e. When a mutation occurs on e, all cells on a directed path from e to any of the leaves of $σ$ are set to state 1; all other cells are set to state 0. Thus, a mutation corresponds to the bipartition induced by the branch on which it occurred. Internal edges on which mutations cannot occur are contracted, so that the probability of a mutation on any edge in $σ$ is strictly greater than zero. Terminal edges on which mutations cannot occur are not contracted; however, we refer to these edges and (the leaves incident to them) as “fake”.

Step 2: Unbiased Error and Missingness (UEM) model. If mutation matrix G is generated under the IS model given $(σ, Θ)$ , then reconstructing $σ$ is trivial. However, for tumor phylogenetics, false positives and false negatives are introduced to G, producing the observed matrix D. This is done according to Eq. 1:

\begin{matrix} P (D_{i, j} = x | α, β, G_{i, j} = y) = \{\begin{matrix} (1 - α) & if D_{i, j} = 0 and G_{i, j} = 0 \\ α & if D_{i, j} = 1 and G_{i, j} = 0 \\ β & if D_{i, j} = 0 and G_{i, j} = 1 \\ (1 - β) & if D_{i, j} = 1 and G_{i, j} = 1 \end{matrix}) \end{matrix}

where $0 \leq α < 1$ and $0 \leq β < 1$ are the probability of false positives and false negatives, respectively. Simultaneously, missing values are introduced to G with probability $0 \leq γ < 1$ ; this can be incorporated into the model by multiplying each of the cases in Eq. 1 by $(1 - γ)$ .

Our goal is to estimate cell lineage trees under the $IS + UEM$ model. An important property for phylogeny estimation methods is whether they are statistically consistent under the model of interest.

Definition 1

(Statistical Consistency; see Section 1.1 of [27]) Let $A$ be some model that generates mutations, and let D be a mutation matrix, with n rows (cells or species) and k columns (mutations), generated under $A$ given rooted tree $σ$ and numerical parameters $Θ$ . We say that an estimation method is statistically consistent under $A$ if for any $ϵ > 0$ , there exists a constant $K > 0$ such that when D contains at least K mutations, the method given D returns (the unrooted version of) $σ$ with probability at least $1 - ϵ$ . Alternatively, we might say that the error in the tree estimated from D is zero with probability at least $1 - ϵ$ .

The idea is that as the number k of mutations goes towards infinity, the error in the estimated tree is zero with high probability. Tree error is typically defined as the number of false negative branches (i.e., branches in $σ$ that are missing from the estimated tree) plus the number of false positive branches (i.e., branches in the estimated tree that are missing from $σ$ ).

No anomalous quartets under an unbiased error and missingness model

To begin, we assume that the rooted cell lineage tree $σ$ has four leaves; therefore, it must have one of five tree shapes shown in Fig. 2. Two of them display a star when unrooted, and the other three correspond to a quartet when unrooted. If mutations are generated i.i.d. under some model $A$ given $σ$ , there are 16 possible patterns on four cells, denoted ${A, B, C, D}$ . A quartet is implied by two cells being in state 1 and two cells being in state 0. Therefore, two patterns ( $A B C D = 0011$ and 1100) support quartet $q_{1} = A, B | C, D$ , two patterns (0101 and 1010) support quartet $q_{2} = A, C | B, D$ , two patterns (0110 and 1001) support quartet $q_{3} = A, D | B, C$ , and the other 10 patterns do not provide topological information. Henceforth, we denote the probability of quartets under model $A$ given $σ$ as $P_{A} (q_{1} | σ) = P_{A} (1100 | σ) + P_{A} (0011 | σ)$ , $P_{A} (q_{2} | σ) = P_{A} (1010 | σ) + P_{A} (0101 | σ)$ , and $P_{A} (q_{3} | σ) = P_{A} (1001 | σ) + P_{A} (0110 | σ)$ . Now we consider quartet-informative patterns generated from a model tree with more than four leaves.

Definition 2

(No anomalous quartets) We say that there are no anomalous quartets under model $A$ given rooted tree $σ$ if the following inequalities hold for every subset S of four species in $σ$ . Let $q_{1}, q_{2}, q_{3}$ denote the three quartets on S, and let i index ${1, 2, 3}$ .

If ${u (σ) |}_{S} = q_{i}$ , $P_{A} (q_{i} | σ) > P_{A} (q_{j} | σ)$ for all $j \in {1, 2, 3}$ such that $i \neq j$ .
If ${u (σ) |}_{S}$ is a star, $P_{A} (q_{1} | σ) = P_{A} (q_{2} | σ) = P_{A} (q_{3} | σ)$ .

This brings us to the main result of this section.

Theorem 1

There are no anomalous quartets under the $IS + UEM$ model, assuming $α + β \neq 1$ .

The statement above directly follows from Lemma 1 and Corollary 1.

Lemma 1

There are no anomalous quartets under the $IS$ model. Moreover, all quartet-informative patterns have zero probability except for one or both of the patterns corresponding to $u (σ)$ when $u (σ)$ is not a star.

If $σ$ has more than four leaves, we can restrict $σ$ to any subset of four leaves and get a valid sub-model (i.e., a sub-model for which the probability of the mutation patterns on four cells is the same as under the larger model tree). For the IS model, the sub-model is formed by deleting the other leaves and adding branch parameters together when suppressing vertices of degree 2. The mutation pattern probabilities for the four cells under this sub-model will be the same as the larger tree because addition represents an or condition (i.e., a mutation occurring on this branch or on that branch will produce the same pattern when looking at only a subset of cells). Thus, it suffices to verify that there are no anomalous quartets for $σ$ with four leaves. This can be done by considering a mutation occurring on each of the internal branches of all possible rooted tree shapes with four leaves (Fig. 2) and comparing the resulting pattern to the unrooted tree shape; see Additional file 1 for details.

The following two lemmas will also be useful later.

Lemma 2

Let $0 \leq α < 1$ and $0 \leq β < 1$ . Then,

\begin{matrix} ({(1 - β)}^{2} {(1 - α)}^{2} + β^{2} α^{2}) - 2 β (1 - β) α (1 - α) = (1 - (α + β))^{2} > 0 \end{matrix}

for $α + β \neq 1$ . If $α + β = 1$ , the inequality in Eq. 2 becomes an equality.

The statement above follows from expanding the polynomials; see the Additional file 1 for details.

Lemma 3

If there are no anomalous quartets under model $M$ , then there are no anomalous quartets under the $M + UE$ model, assuming $α + β \neq 1$ .

Proof

Taking any subset of four leaves, there are 16 possible mutation patterns that may occur under model $M$ . These are the two invariant patterns (0000 and 1111), the eight variant but quartet-uninformative patterns (1000, 0100, 0010, 0001, 0111, 1011, 1101, 1110), and the six quartet-informative patterns (1100, 0011, 0101, 1010, 0110, 1001). For each pattern g listed above, we enumerate all possible ways of introducing errors (false positives and false negatives); this gives us the probability of each of the 16 mutation patterns under the $UE$ model given $(α, β)$ . Now we need to put this information together to get the probability of quartets under the $M + UE$ model. First, we compute the probability of observing any quartet q from errors (false positives and false negatives) being introduced to the invariant and variant but quartet-uninformative characters; see Eq. 3.

\begin{matrix} f (α, β, σ) & = (2 α^{2} {(1 - α)}^{2}) \cdot P_{M} (0000 | σ) + (2 β^{2} {(1 - β)}^{2}) \cdot P_{M} (1111 | σ) \\ + ((α {(1 - α)}^{2} (1 - β)) + (α^{2} (1 - α) β)) \\ \cdot (P_{M} (0001 | σ) + P_{M} (0010 | σ) + P_{M} (0100 | σ) + P_{M} (1000 | σ)) \\ + ((β {(1 - β)}^{2} (1 - α)) + (β^{2} (1 - β) α)) \\ \cdot (P_{M} (1110 | σ) + P_{M} (1101 | σ) + P_{M} (1011 | σ) + P_{M} (0111 | σ)) ; \end{matrix}

see Additional file 1: Tables S1–S4 for details. Second, we repeat this calculation for the quartet-informative patterns; see Table 1 and Additional file 1: Tables S5–S6 for details. Putting it all together gives us the probability of each quartet under the $M + UE$ model

\begin{matrix} P_{M + UE} (q_{i} | α, β, σ) & = f (α, β, σ) \\ + ({(1 - α)}^{2} {(1 - β)}^{2} + α^{2} β^{2}) \cdot P_{M} (q_{i} | σ) \\ + 2 α β (1 - α) (1 - β) \cdot (P_{M} (q_{j} | σ) + P_{M} (q_{k} | σ)) \end{matrix}

for $i, j, k \in {1, 2, 3}$ such that $i \neq j \neq k$ . Now we can compute the difference in probabilities between quartets $q_{i}$ and $q_{j}$ under the $M + UE$ model for any $i, j \in {1, 2, 3}$ such that $i \neq j$ . By Lemma 1, we have

\begin{matrix} P_{M + UE} (q_{i} | α, β, σ) & - P_{M + UE} (q_{j} | α, β, σ) = (1 - (α + β))^{2} \cdot (P_{M} (q_{i} | σ) - P_{M} (q_{j} | σ)) . \end{matrix}

Assuming $α + β \neq 1$ , this quantity is zero if $P_{M} (q_{i} | σ) = P_{M} (q_{j} | σ)$ and greater than zero if $P_{M} (q_{i} | σ) > P_{M} (q_{j} | σ)$ . Because there are no anomalous quartets under model $M$ , the former will be the case if $u (σ)$ is a star; the latter will be the case if $u (σ) = q_{i}$ . It follows there are no anomalous quartets under the $M + UE$ model. $□$

Table 1.

List of mutation patterns with four cells ${A, B, C, D}$ that can be generated by introducing false positives and false negatives to pattern #12 ( $G_{*, j} = 1100$ ) and pattern #3 ( $G_{*, j} = 0011$ ) as well as their probabilities under the $UE$ model

Open in a new tab

The red values indicate a false positive introduced to $G_{*, j}$ by flipping 0 to 1. The blue values indicate a false negative introduced to $G_{*, j}$ by flipping 1 to 0. Similar tables for the other 14 patterns are provided in Additional file 1

Note that the quantity $α + β$ is unlikely to equal 1 in practice, as both probabilities should be less than 0.5. We now extend the result above to address unbiased missing values, in addition to errors.

Corollary 1

If there are no anomalous quartets under model $M$ , then there are no anomalous quartets under the $M + UEM$ model, assuming that $α + β \neq 1$ .

Proof

If one or more of the values in a mutation pattern is missing, then no quartet is displayed. Thus, unbiased missingness can be accounted for in the proof of Lemma 3 simply by updating $P_{M} (x | σ)$ to ${(1 - γ)}^{4} \cdot P_{M} (x | σ)$ . In this case, Eq. 5 becomes

\begin{matrix} P_{M + UEM} (q_{i} | α, β, γ, σ) & - P_{M + UEM} (q_{j} | α, β, γ, σ) \\ = (1 - (α + β))^{2} \cdot {(1 - γ)}^{4} \cdot (P_{M} (q_{i} | σ) - P_{M} (q_{j} | σ)) \end{matrix}

which does not change our argument. $□$

This concludes the derivation of the main result of this section (Theorem 1). Before moving on to triplets, we note that the difference in quartet probabilities (Eq. 6) depends on (1) the probability of false positives and negatives (specifically how close their sum is to one), (2) the probability of missing values, and (3) the probability of observing quartet-informative patterns under model $M$ given $σ$ . In the simulations performed by [24], the largest values of $α$ , $β$ , and $γ$ were 0.001, 0.2, and 0.05, respectively. In this scenario with $u (σ) = q_{i}$ , we have $P_{IS + UEM} (q_{i} | α, β, γ, σ) - P_{IS + UEM} (q_{j} | α, β, γ, σ) = 0.52 \cdot P_{IS} (q_{i} | σ)$ for any $i, j \in {1, 2, 3}$ such that $i \neq j$ . The magnitude of $P_{IS} (q_{i} | σ)$ depends on the lineage tree and the four cells sampled from it. We discuss sampling further in the context of triplets; for now, we note that, in practical settings, $P_{IS} (q_{i} | σ)$ may have a greater impact on the difference in quartet probabilities under the $IS + UEM$ model than $α$ , $β$ , or $γ$ .

Anomalous triplets under an unbiased error and missingness model

We now derive related results for triplets. To begin, we assume the rooted cell lineage tree $σ$ has three leaves; therefore, it must have one of two topologies: binary or non-binary (Additional File 1: Fig S1). If mutations are generated i.i.d. under some model $A$ given $σ$ , there are 8 possible patterns on three cells, denoted ${A, B, C}$ . A triplet is implied by two cells being in state 1 (i.e., the mutant/derived state) and one cell being in state 0 (i.e., the ancestral state) because the two cells harboring the mutation must have descended from a common ancestor cell also harboring the mutation. One pattern ( $A B C = 110$ ) supports triplet $t_{C} = C | A, B$ , one pattern (101) supports triplet $t_{B} = B | A, C$ , one pattern (011) supports triplet $t_{A} = A | B, C$ , and the other five patterns do not provide topological information. Henceforth, we denote the probability of triplets under model $A$ given $σ$ as $P_{A} (t_{A} | σ) = P_{A} (011 | σ)$ , $P_{A} (t_{B} | σ) = P_{A} (101 | σ)$ , and $P_{A} (t_{C} | σ) = P_{A} (110 | σ)$ . Now we consider triplet-informative patterns generated from a model tree with more than three leaves.

Definition 3

(No anomalous triplets) We say that there are no anomalous triplets under model $A$ if the following inequalities hold for every subset S = {X, Y, Z} of three species in $σ$ . Let $t_{X}, t_{Y}, t_{Z}$ denote the three triplets on S, and let i index {X, Y, Z}.

If ${σ |}_{S} = t_{i}$ , $P_{A} (t_{i} | σ) > P_{A} (t_{j} | σ)$ for all $j \in {X, Y, Z}$ such that $i \neq j$ .
If ${σ |}_{S}$ is non-binary, $P_{A} (t_{X} | σ) = P_{A} (t_{Y} | σ) = P_{A} (t_{Z} | σ)$ .

This brings us to the main result of this section.

Theorem 2

There are no anomalous triplets under the $IS + UEM$ model, assuming one of two conditions: (1) $α = 0$ or (2) $α + β \neq 1$ and $P_{IS} (100 | σ) = P_{IS} (010 | σ) = P_{IS} (001 | σ)$ . Otherwise, there can be anomalous triplets under the $IS + UEM$ model.

The statement above directly follows from Lemma 4 and Corollary 2.

Lemma 4

There are no anomalous triplets under the $IS$ model. Moreover, all triplet-informative patterns have zero probability except for the pattern corresponding to $σ$ when $σ$ is not non-binary.

If $σ$ has more than three leaves, we can restrict $σ$ to any subset of three leaves and get a valid sub-model (i.e., a sub-model for which the probability of the mutation patterns on three cells is the same as under the larger model tree, as discussed for quartets). Thus, it suffices to verify that there are no anomalous triplets for $σ$ with three leaves. This can be done by considering a mutation occurring on each of the internal branches of all possible rooted tree shapes with three leaves (Additional File 1: Fig S1) and comparing the resulting pattern to the tree shape; see Additional File 1 for details.

Lemma 5

If there are no anomalous triplets under model $M$ , then there are no anomalous triplets under the $M + UE$ model, assuming one of two conditions: (1) $α = 0$ or (2) $α + β \neq 1$ and $P_{M} (100 | σ) = P_{M} (010 | σ) = P_{M} (001 | σ)$ . Otherwise, there can be anomalous triplets under the $M + UE$ model.

Proof

Taking any subset of three leaves, there are 8 possible mutation patterns that may occur under model $M$ . These are the two invariant patterns ( $A B C = 000$ and 111), the three variant but triplet-uninformative patterns (100, 010, 001), and the three triplet-informative patterns (110, 101, 011). For each pattern g listed above, we enumerate all possible ways of introducing errors (false positives and false negatives); this gives us the probability of mutation patterns under the UE model given $(α, β, g)$ ; see Additional file 1: Tables S7–S14. Putting everything together, we find the probability of triplet $t_{C}$ under the $M + UE$ model is

\begin{matrix} P_{M + UE} (t_{C} | α, β, σ) & = α^{2} (1 - α) \cdot P_{M} (000 | σ) + β {(1 - β)}^{2} \cdot P_{M} (111 | σ) \\ + α^{2} β \cdot P_{M} (001 | σ) \\ + α (1 - α) (1 - β) \cdot (P_{M} (100 | σ) + P_{M} (010 | σ)) \\ + (1 - α) {(1 - β)}^{2} \cdot P_{M} (110 | σ) \\ + α β (1 - β) \cdot (P_{M} (101 | σ) + P_{M} (011 | σ)) \end{matrix}

Similar probabilities can be computed for $t_{B}$ and $t_{A}$ . To provide a general formula, we define

\begin{matrix} g (α, β, σ) = α^{2} (1 - α) \cdot P_{M} (000 | σ) + β {(1 - β)}^{2} \cdot P_{M} (111 | σ) \end{matrix}

and set $x_{A} = 100$ , $x_{B} = 010$ , and $x_{C} = 001$ . This allows us to write the probability of any triplet as

\begin{matrix} P_{M + UE} (t_{i} | α, β, σ) & = g (α, β, σ) \\ + α^{2} β \cdot P_{M} (x_{i} | σ) \\ + α (1 - α) (1 - β) \cdot (P_{M} (x_{j} | σ) + P_{M} (x_{k} | σ)) \\ + (1 - α) {(1 - β)}^{2} \cdot P_{M} (t_{i} | σ) \\ + α β (1 - β) \cdot (P_{M} (t_{j} | σ) + P_{M} (t_{k} | σ)) \end{matrix}

where $i, j, k \in {A, B, C}$ with $i \neq j \neq k$ . Now we can compute the differences in probabilities between $t_{i}$ and $t_{j}$ under the $M + UE$ for any $i, j \in {A, B, C}$ such that $i \neq j$ . We find

\begin{matrix} P_{M + UE} (t_{i} | α, β, σ) - P_{M + UE} (t_{j} | α, β, σ) \\ = α^{2} β \cdot (P_{M} (x_{i} | σ) - P_{M} (x_{j} | σ)) \\ + α (1 - α) (1 - β) \cdot (P_{M} (x_{j} | σ) - P_{M} (x_{i} | σ)) \\ + (1 - α) {(1 - β)}^{2} \cdot (P_{M} (t_{i} | σ) - P_{M} (t_{j} | σ)) \\ + α β (1 - β) \cdot (P_{M} (t_{j} | σ) - P_{M} (t_{i} | σ)) \\ = (α^{2} β - α (1 - α) (1 - β)) \cdot (P_{M} (x_{i} | σ) - P_{M} (x_{j} | σ)) \\ + ((1 - α) {(1 - β)}^{2} - α β (1 - β)) \cdot (P_{M} (t_{i} | σ) - P_{M} (t_{j} | σ)) \\ = α (1 - (α + β)) \cdot (P_{M} (x_{j} | σ) - P_{M} (x_{i} | σ)) \\ + (1 - β) (1 - (α + β)) \cdot (P_{M} (t_{i} | σ) - P_{M} (t_{j} | σ)) \end{matrix}

Assuming that either condition (1) $α = 0$ or condition (2) $α + β \neq 1$ and $P_{M} (x_{i} | σ) = P_{M} (x_{j} | σ)$ , this quantity is zero if $P_{M} (t_{i} | σ) = P_{M} (t_{j} | σ)$ and is greater than zero if $P_{M} (t_{i} | σ) > P_{M} (t_{j} | σ)$ . Because there are no anomalous triplets under $M$ , the former will be the case if $σ$ is non-binary; the latter will be the case if $σ = t_{i}$ . It follows there are no anomalous triplets under the $M + UE$ model, provided one of the two conditions hold. If these conditions do not hold and $σ \neq t_{j}$ , triplet $t_{j}$ is anomalous when

\begin{matrix} \frac{α}{(1 - β)} \cdot (P_{M} (x_{i} | σ) - P_{M} (x_{j} | σ)) - (P_{M} (t_{i} | σ) - P_{M} (t_{j} | σ)) > 0 \end{matrix}

for $i, j \in {A, B, C}$ such that $i \neq j$ . $□$

The result above extends easily to the case with unbiased missing values, in addition to unbiased error (see Corollary 1).

Corollary 2

If there are no anomalous triplets under model $M$ , then there are no anomalous triplets under the $M + UEM$ model, assuming that one of the two conditions: (1) $α = 0$ or (2) $α + β \neq 1$ and $P_{M} (100 | σ) = P_{M} (010 | σ) = P_{M} (001 | σ)$ . Otherwise, there can be anomalous triplets under the $M + UEM$ model.

This concludes the derivation of the main result of this section (Theorem 2). Before moving onto phylogeny estimation, we consider whether anomalous triplets should be expected in the context of tumor phylogenetics, so setting $M$ to be the $IS$ model. As previously mentioned, in the simulations performed by Kizilkale et al. [24], the largest values of $α$ and $β$ were 0.001 and 0.2, respectively. This would put $α / (1 - β) = 0.00125$ in Eq. 11, so $P_{M} (x_{i} | σ) - P_{M} (x_{j} | σ)$ would need to be 800 times greater than $P_{M} (t_{i} | σ) - P_{M} (t_{j} | σ)$ for $t_{j}$ to be anomalous under the $M + UE$ model (note the terms for missing values will cancel out in Eq. 11). Although this seems drastic, it could occur when $σ$ is created by restricting a larger model tree to a subset three leaves. Consider the cell lineage tree in Fig. 1 but with 1000 additional cells added between cell 9 and cell 10, and suppose that we sample cells ${1, 4, 10}$ , so the resulting sub-model, $σ$ has rooted topology $t_{10} = 10 | 1, 4$ . This submodel defines the following probability distribution on mutations, assuming mutations occur on all non-fake edges with equal probability. For the variant but triplet-uninformative patterns, we have $P_{IS} (x_{1} | σ) = 0$ , $P_{IS} (x_{4} | σ) = 1 / 1012$ , and $P_{IS} (x_{10} | σ) = 1009 / 1012$ , and for triplet-informative patterns, we have $P_{IS} (t_{1} | σ) = 0$ , $P_{IS} (t_{4} | σ) = 0$ , and $P_{IS} (t_{10} | σ) = 1 / 1012$ . Now looking at Eq. 11, we find that triplets $t_{1}$ and $t_{4}$ are anomalous under the IS + UEM model. Based on this analysis, we conjecture that triplets are less robust to error than quartets when sampling cells from a larger cell lineage tree.

Phylogeny estimation from quartets

Because there are no anomalous quartets under the $IS + UEM$ model under reasonable assumptions (i.e., $α + β \neq 1$ ), we now consider the utility of quartet-based methods for estimating cell lineage trees from mutation data. By quartet-based methods, we mean heuristics for the Maximum Quartet Support Supertree (MQSS) problem ( [4]; also see Section 7.7 in [27]).

Definition 4

(Maximum Quartet Support Supertrees) The MQSS problem is defined by

Input: A set of unrooted trees $P = {T_{1}, T_{2}, \dots, T_{k}}$ , with tree $T_{i}$ on leaf label set $X_{i}$

Output: A unrooted, binary tree B on leaf label set $\cup_{i = 1}^{k} X_{i}$ that maximizes $Q S_{D} (B) = \sum_{q \in Q (B)} w_{P} (q)$ , where $w_{P} (q)$ is the number of input trees in $P$ that display q

When all input trees are on the same leaf label set, MQSS becomes weighted quartet consensus (WQC). An optimal solution to WQC is a consistent estimator of the unrooted species tree topology under the MSC model (Theorem 2 in [12]). Although MQSS and WQC are NP-hard [28, 29], fast and accurate heuristics have been developed, with the most well-known being ASTRAL [12]. Since version 2 [13, 14], ASTRAL allows the input trees to be incomplete and is statistically consistent under the MSC model, provided some assumptions on missing data (see [30] for details).

There are two notable differences when using quartet-based methods to reconstruct cell lineage trees, rather than species trees. The first difference is that input are mutations rather than unrooted (gene) trees. This issue was addressed by Springer et al. [31], who treat mutations as unrooted trees with at most one internal branch (Fig. 3). Given this transformation of the input, it is possible to run ASTRAL and other quartet-based methods on mutation data. The second difference is that ASTRAL outputs a binary tree. Our next result suggests that MQSS/WQC are useful problem formulations, even when data are generated from non-binary trees.

Fig. 3 — This figure shows a mutation matrix with one false negative (in blue), one false positive (in red), and one missing entry. The first step indicates how each mutation (column in the matrix) corresponds to an unrooted tree with at most one internal branch. The second step is to estimate the phylogeny by applying a quartet-based method. This approach is motivated by there being no anomalous quartets under the $IS + UEM$ model, assuming $α + β \neq 1$ (Theorem 1)

Theorem 3

Let $σ$ be a rooted tree with at least one internal branch when unrooted (so it can be non-binary), and let D be an $n \times k$ mutation matrix generated under the $IS + UEM$ model given $(α, β, γ, σ)$ with $α + β \neq 1$ and $0 \leq α, β, γ < 1$ . Then, an optimal solution to MQSS given D is a consistent estimator of $u (σ)$ under the $IS + UEM$ model, with tree error defined as the number of false negative branches. If in addition $0 < α, β, < 1$ , ASTRAL given D is statistically consistent under the $IS + UEM$ model, with tree error defined as the number of false negative branches.

The first statement above follows from Theorem 1 and Lemma 6. The second statement follows from Theorem 1, Corollary 3, and the observation that every complete mutation pattern is possible under the $IS + UEM$ model when $0 < α, β < 1$ .

Lemma 6

Let $σ$ be a rooted tree with at least one internal branch when unrooted (so it can be non-binary), and let D be an $n \times k$ mutation matrix generated under model $A$ given $σ$ . If there are no anomalous quartets under $A$ , an optimal solution to MQSS given D is a consistent estimator of $u (σ)$ under model $A$ , with tree error defined as the number of false negative branches.

Proof

Let B be an unrooted, binary tree on the same label set as $u (σ)$ . The number of false negative branches between B and $u (σ)$ is zero if B is a refinement of $u (σ)$ , meaning that B can be obtained from $u (σ)$ in a sequence of refinement operations (this sequence has length zero if $σ$ is binary). Thus, to prove consistency with tree error defined as the number of false negative branches, we revise Definition 1 to say that for any $ϵ > 0$ , there exists a constant $K > 0$ such that when D contains at least K mutations, an optimal solution to MQSS given D is a refinement of $u (σ)$ with probability at least $1 - ϵ$ . The remainder of the proof follows from Lemma 7. $□$

Lemma 7

Suppose the conditions of Lemma 6 hold. Let $L (σ)$ be the label set of $σ$ , and let B and T be unrooted, binary trees on $L (σ)$ . Suppose that B is a refinement of $u (σ)$ and that T is NOT. Then, for any pair B and T and for any $ϵ > 0$ , there exists a constant $K > 0$ such that when D contains at least K mutations, $Q S_{D} (B) > Q S_{D} (T)$ with probability at least $1 - ϵ$ .

Proof

To begin, we restate the inequality as

\begin{matrix} Q S_{D} (B) - Q S_{D} (T) = \sum_{S \in X} w_{D} (B |_{S}) - \sum_{S \in X} w_{D} (T |_{S}) > 0 \end{matrix}

where $X$ is the set of all possible subsets of four elements from $L (σ)$ and $w_{D} (q)$ is the number of mutations in D that imply quartet q.

Claim 1: First, we claim that as $k \to \infty$ , $w_{D} / k$ converges to its expectation $F^{*}$ under model $A$ given $σ$ with probability 1. Claim 1 holds by the strong law of large numbers, as noted in the proofs of consistency for quartet and triplet-based methods under the $MSC$ model (see [20] for an example). Let $q_{1}^{S}$ , $q_{2}^{S}$ , $q_{3}^{S}$ denote the three possible quartets on S. Then, we can re-state claim 1 as follows. As $k \to \infty$ , $w_{D} (q_{i}^{S}) / k$ converges to $F^{*} (q_{i}^{S}) = P_{A} (q_{i}^{S} | σ)$ with probability 1 for all $i \in {1, 2, 3}$ and for all $S \in X$ .

Claim 2: Second, we claim there exists a $δ$ such that whenever $‖ w_{D} / k - F_{*} ‖_{\infty} < δ$ , Eq. 12 holds. We show claim 2 is true for $δ = π / 2 | X |$ , where

\begin{matrix} π = min_{> 0, S \in X} {| F^{*} (q_{1}^{S}) - F^{*} (q_{2}^{S}) |, | F^{*} (q_{1}^{S}) - F^{*} (q_{3}^{S}) |, | F^{*} (q_{2}^{S}) - F^{*} (q_{3}^{S}) |} . \end{matrix}

There are three cases to consider.

Case 1: Let $W \subset X$ include all $S \in X$ such that ${B |}_{S}$ and ${T |}_{S}$ are the same quartet. Then, $w_{D} (B |_{S}) = w_{D} (T |_{S})$ for all $S \in W$ , giving us

\begin{matrix} \sum_{S \in W} (\frac{w_{D} (B |_{S})}{k} - \frac{w_{D} (T |_{S})}{k}) = 0 . \end{matrix}

Now suppose that B and T restricted to S display different quartets, denoted $q_{1}^{S}$ and $q_{2}^{S}$ , respectively. If $q_{2}^{S} \in Q (u (σ))$ , it must also be in Q(B) as $Q (u (σ)) \subseteq Q (B)$ . Therefore, $q_{2}^{S} \notin Q (u (σ))$ . This gives us two additional cases to consider.

Case 2: Let $Y \subset X$ include all $S \in X$ such that ${B |}_{S}$ and ${T |}_{S}$ are different quartets, denoted $q_{1}^{S}$ and $q_{2}^{S}$ , respectively, with $q_{1}^{S} \in Q (u (σ))$ and $q_{2}^{S} \notin Q (u (σ))$ . Then, for all $S \in Y$ , $F^{*} (q_{1}^{S}) > F^{*} (q_{2}^{S})$ and $F^{*} (q_{1}^{S}) > F^{*} (q_{3}^{S})$ because there are no anomalous quartets under $A$ (Definition 2). Therefore, whenever $‖ w_{D} / k - F_{*} ‖_{\infty} < π / 2$ , $w_{D} (B |_{S}) > w_{D} (T |_{S})$ for all $S \in Y$ , giving us

\begin{matrix} \sum_{S \in Y} (\frac{w_{D} (B |_{S})}{k} - \frac{w_{D} (T |_{S})}{k}) > 0 . \end{matrix}

Case 3 (only needed for non-binary $σ$ ): Let $Z \subset X$ include all $S \in X$ such that ${B |}_{S}$ and ${T |}_{S}$ are different quartets, denoted $q_{1}^{S}$ and $q_{2}^{S}$ , respectively, with $q_{1}^{S}, q_{2}^{S} \notin Q (u (σ))$ . By our assumptions on B, ${σ |}_{S}$ is a star. Then, because there are no anomalous quartets under $A$ (Definition 2), $F^{*} (q_{1}^{S}) = F^{*} (q_{2}^{S}) = F^{*} (q_{3}^{S})$ . However, even as $k \to \infty$ , we are not guaranteed to get an exact equality $w_{D} (q_{1}^{S}) = w_{D} (q_{2}^{S}) = w_{D} (q_{3}^{S})$ . Thus, we need to put an upper bound $δ$ on $‖ w_{D} / k - F_{*} ‖_{\infty}$ so that Eq. 12 holds even when $w_{D} (B |_{S}) < w_{D} (T |_{S})$ for all $S \in Z$ . This happens for $δ = π / 2 | X |$ . When $‖ w_{D} / k - F_{*} ‖_{\infty} < π / 2 | X |$ , we have

\begin{matrix} \sum_{S \in Z} | \frac{w_{D} (B |_{S})}{k} - \frac{w_{D} (T |_{S})}{k} | < | Z | \cdot \frac{π}{| X |} < | Y | \cdot (π - \frac{π}{| X |}) < \sum_{S \in Y} (\frac{w_{D} (B |_{S})}{k} - \frac{w_{D} (T |_{S})}{k}) \end{matrix}

because $(| Y | + | Z |) / | X | \leq 1$ and $1 \leq | Y |$ . If not, either T is a refinement of $u (σ)$ or $u (σ)$ has no internal branches, contradicting our assumptions.

Putting the cases together: If $u (σ)$ is binary, then $W | Y$ is a partition $X$ , so we can combine equations 14 and 15 to get Eq. 12. If $u (σ)$ is non-binary, then $W | Y | Z$ is a partition of $X$ , so we can combine Eqs. 14 and 16 to get Eq. 12.

Wrap up. By claim 1, for any $ϵ > 0$ , there exists a constant $K > 0$ such that when D contains at least K mutations, $‖ w_{D} / k - F^{*} ‖_{\infty} < π / 2 | X |$ with probability at least $1 - ϵ$ . Then, by claim 2, for any $ϵ > 0$ , when D contains at least K mutations, $Q S_{D} (B) > Q S_{D} (T)$ with probability at least $1 - ϵ$ . $□$

Corollary 3

Suppose the conditions of Lemma 6 hold. If there are no anomalous quartets under model $A$ and every complete mutation pattern occurs with non-zero probability under $A$ , then ASTRAL given D is statistically consistent under $A$ , with tree error defined as the number of false negative branches.

Proof

ASTRAL solves MQSS exactly within a constrained version of the solution space, denoted $Σ$ . Its algorithm has two main steps: first, form $Σ$ so that it contains all bipartitions induced by the input “trees” (i.e., mutations in D), and second find a solution B to MQSS such that $B i p (B) \subseteq Σ$ . Because all complete mutation patterns have non-zero probability under the model $A$ , for any $ϵ_{1} > 0$ , there exists a constant $K_{1} > 0$ such that when $k \geq K_{1}$ , $Σ$ will contain all bipartitions induced by at least one refinement of $u (σ)$ with probability $1 - ϵ_{1}$ . By Lemma 7, for any $ϵ_{2} > 0$ , there exists a constant $K_{2} > 0$ such that when $k \geq K_{2}$ , $Q S_{D} (B) > Q S_{D} (T)$ for any pair B and T of unrooted binary trees on $L (σ)$ such that B is a refinement of $u (σ)$ and T is NOT. Now let $ϵ > 0$ and select $ϵ_{1}, ϵ_{2} > 0$ such that $ϵ_{1} + ϵ_{2} < ϵ$ . Then, when D contains at least $max {K_{1}, K_{2}}$ mutations, ASTRAL given D returns a refinement of $u (σ)$ with probability at least $(1 - ϵ_{1}) (1 - ϵ_{2}) > (1 - ϵ)$ . It follows the number of false negative branches is zero with probability at least $1 - ϵ$ . $□$

Lastly, we note that related results can be derived for triplets by viewing mutations as a rooted trees with at most one internal branch (Additional file 1: Fig. S2).

Theorem 4

Suppose that the conditions of Theorem 3 hold but that $α = 0$ (instead of $α + β \neq 1$ ). Then, an optimal solution to Maximum Triplet Support Supertree (MTSS) problem given D is a consistent estimator of $σ$ under the $IS + UEM$ model, with tree error defined as the number of false negative branches.

This result follows from Theorem 2 and Lemma 6 but replacing “quartet” with “triplet” and “rooted” with “unrooted”. Code for transforming ordered 2-state characters into rooted trees (as shown in Additional file 1: Fig. S2) is available as part of Dollo-CDP [32]; see the − k option on Github: https://github.com/molloy-lab/Dollo-CDP. These “rooted trees” can be given as input to quartet-based methods, like ASTRAL or ASTER [18], which will effectively ignore the root (note that any “tree” on fewer than four cells must be removed prior to running ASTRAL).

Discussion

Quartet-based approaches have garnered much success for estimating species phylogenies under the Multi-Species Coalescent [12–14]. Here, we considered their application for estimating cell lineage trees, focusing on two differences between estimating cell lineage trees compared to species trees. First, errors and missing values can arise from single-cell sequencing and thus are typically modeled. Second, the model cell lineage tree may be highly unresolved because tumors evolve clonally. To address these issues, we first show that there are no anomalous quartets under the infinite sites ( $IS$ ) plus unbiased error and missingness ( $UEM$ ) model, which is widely used in tumor phylogenetics (this is an identifiability result). We then show that under the $IS + UEM$ model, an optimal solution to the Maximum Quartet Support Supertree (MQSS) problem is a refinement of the model cell lineage tree (this is a consistency result when tree error is defined as the number of false negative branches). Lastly, we consider the case of triplets, showing that there can be anomalous triplets when the probability of false positive errors is greater than zero. Our result suggests that quartets may be more robust to error than triplets when reconstructing cell lineage trees.

Overall, our results suggest the potential of quartet-based methods for reconstructing trees from noisy mutation data, provided that the tree can be rooted and that false positive branches in the output tree can be effectively handled. The former is often doable because the tree can be rooted on the edge incident to the healthy cell with no mutations. The latter is related to mapping mutations onto branches in the cell lineage tree (see [33]) as well as identifying which cells are members of the same clone or subclone (see [34]). These tasks are also relevant to likelihood-based methods designed for cell lineage tree reconstruction, as such methods also return binary trees. Examples include ScisTree [23], SiFit [35], and CellPhy [33] (note that of these methods ScisTree makes the $IS$ assumption but the other two do not).

In general, likelihood-based methods require explicitly estimating numeric parameters, like $α$ and $β$ , as well as exploring the space of cell lineage trees, which grows exponentially in the numbers of cells. In contrast, quartet-based approaches allow for error implicitly (without explicit estimation of $α$ and $β$ ) and are often based on algorithmic techniques, like divide-and-conquer, that are quite fast in practice. That being said, quartet-based methods have been designed for species phylogenetics, where the number of leaves (species) is typically much less than the number of gene trees or characters. In tumor phylogenetics, the number of leaves (cells) can be much greater than than characters. This will likely to have consequences for runtime and accuracy (just consider that our consistency guarantees is in the limit of infinite mutations). Corollary 3 sheds light on a potential issue when using ASTRAL, namely the construction of the constrained solution space $Σ$ may not be very successful for mutation data, especially if the number of mutations is small. However, there are other high quality heuristics for MQSS, including wQMC [36–38], wQFM [16, 39], and TREE-QMC [17]. Moreover, even when the number of characters is small compared to the number of cells, the underlying model tree is likely to be highly unresolved. In this case, sampling different cells around the same branch may be a means of providing more data for estimation (this observation has already been leveraged by Kizilkale et al. [24]).

Although quartet-based methods, as presented here, may be robust to noise, they fail to address doublets and copy number aberrations (CNAs), which also challenge cell lineage tree reconstruction. A doublet is a sequencing artifact where data provided for a single cell is really a mixture of two cells. This “hybrid” cell challenges the notion of tree-like evolution, motivating the development of methods for correcting doublets [40]. If doublets can be effectively corrected, then their impact on quartet-based methods would be minimal. Alternatively, quartets may be useful for detecting doublets. CNAs include duplications and losses of large sections of chromosomes (see review on methods for detecting CNAs by Mallory et al. [41]). CNA losses, in particular, have motivated the development of many new methods for reconstructing tumor phylogenies [42–44]. Some of these methods view CNA losses as false negatives (although these false negatives will be biased towards particular cells and mutations). In contrast, SCARLET [44] reconstructs a CNA tree and then uses it to constrain phylogeny reconstruction with the mutation data. Constraints have also been leveraged in species phylogenetics, including with ASTRAL [45]. Thus, the output of quartet-based methods could similarly be forced to obey the constraints of a CNA tree. To summarize, there are practical limitations to quartet-based methods for tumor phylogenetics, several of which apply to existing methods that do not handle CNAs and doublets, for example.

Looking beyond cell lineage tree reconstruction, our results generalize beyond the $IS$ model to any model of 2-state character evolution for which there are no anomalous quartets or triplets. A consequence of our study is that quartet-based methods, like ASTRAL, are consistent under the $IS + nWF$ model, even when unbiased errors and missing values are introduced. This statement follows from combining Corollaries 1 and 3 with Theorem 1 in [48]. Thus, our work addresses an open question from [48] about the utility of such methods on imperfect data and gives a positive result for recent systematic studies leveraging quartet-based methods on retroelement insertion presence/absence patterns for placental mammals [51], bats [52], and birds [53]. Missing values, in particular, are prevalent in these data sets (e.g., the data set from [54] has 18% missing/ambiguous values). Future work should investigate this issue further, looking at error and missingness biased towards particular species or (orthologous) positions of the genome (related questions would also be of interest in the context of cell lineage tree estimation). Similarly, while unbiased errors (false positives and false negatives) may be appropriate for modeling sequencing error, it may not be appropriate in this other setting if error is biased towards particular species or genes when calling retroelement insertions. Lastly, species trees are typically assumed to be binary; however, there could be hard polytomies, in which case the model tree would be non-binary. Our results for consistency with error defined as the number of false negatives (Lemma 6 and Corollary 3) extend to the other models, like $MSC$ , suggesting the utility of quartet-based methods in the case of hard polytomies.

Supplementary Information

13015_2023_248_MOESM1_ESM.pdf^{(919.1KB, pdf)}

Additional file 1. Proofs of Lemmas 1, 3, and 5, supplemental figures S1–S2, and supplemental tables S1–S16.

Acknowledgements

This work was conducted in preparation for a talk given at the 2023 U.S. National Cancer Institute’s Spring School in Algorithmic Cancer Biologyove; EKM thanks the organizers for the invitation, the funders of the event, and the audience for their questions and engagement. YH and EKM thank Dr. Michael Nute for feedback on a preliminary version our manuscript that greatly improved our paper, especially the clarity of our notation. YH and EKM also thank the PC chairs of WABI, Dr. Aïda Ouangraoua and Dr. Djamal Belazzougui, as well as the anonymous reviewers for constructive feedback.

Abbreviations

CNA: Copy number abberation
IS: Infinite sites
MQSS: Maximum quartet support supertree
MSC: Multi-species coalescent
MTSS: Maximum triplet support supertree
nWF: Neutral Wright-Fisher
UEM: Unbiased error and missingness

Author contributions

EKM and YH proved the theorems and wrote the paper. EKM conceptualized and supervised the project.

Funding

This work was financially supported by the State of Maryland.

Data availability

Not applicable.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Lim B, Lin Y, Navin N. Advancing cancer research and medicine with single-cell genomics. Cancer Cell. 2020;37(4):456–470. doi: 10.1016/j.ccell.2020.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biol. 2016;17:86. doi: 10.1186/s13059-016-0936-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat Rev Genet. 2017;18(4):213–229. doi: 10.1038/nrg.2016.170. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wilkinson M, Cotton JA, Creevey C, Eulenstein O, Harris SR, Lapointe F-J, Levasseur C, Mcinerney JO, Pisani D, Thorley JL. The shape of supertrees to come: tree shape related properties of fourteen supertree methods. Syst Biol. 2005;54(3):419–431. doi: 10.1080/10635150590949832. [DOI] [PubMed] [Google Scholar]
5.Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5(5):568–583. doi: 10.1093/oxfordjournals.molbev.a040517. [DOI] [PubMed] [Google Scholar]
6.Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164(4):1645–1656. doi: 10.1093/genetics/164.4.1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011;62(6):833–862. doi: 10.1007/s00285-010-0355-7. [DOI] [PubMed] [Google Scholar]
8.Degnan JH. Anomalous unrooted gene trees. Syst Biol. 2013;62(4):574–590. doi: 10.1093/sysbio/syt023. [DOI] [PubMed] [Google Scholar]
9.Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007;56(1):17–24. doi: 10.1080/10635150601146041. [DOI] [PubMed] [Google Scholar]
10.Roch S, Steel M. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 2015;100:56–62. doi: 10.1016/j.tpb.2014.12.005. [DOI] [PubMed] [Google Scholar]
11.Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010;26(22):2910–2911. doi: 10.1093/bioinformatics/btq539. [DOI] [PubMed] [Google Scholar]
12.Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014;30(17):541–548. doi: 10.1093/bioinformatics/btu462. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31(12):44–52. doi: 10.1093/bioinformatics/btv234. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinf. 2018;19(6):153. doi: 10.1186/s12859-018-2129-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Dibaeinia P, Tabe-Bordbar S, Warnow T. FASTRAL: improving scalability of phylogenomic analysis. Bioinformatics. 2021;37(16):2317–2324. doi: 10.1093/bioinformatics/btab093. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Mahbub M, Wahab Z, Reaz R, Rahman MS, Bayzid MS. wQFM: highly accurate genome-scale species tree estimation from weighted quartets. Bioinformatics. 2021;37(21):3734–3743. doi: 10.1093/bioinformatics/btab428. [DOI] [PubMed] [Google Scholar]
17.Han Y, Molloy EK. Improving quartet graph construction for scalable and accurate species tree estimation from gene trees. Genome Res. 2023 doi: 10.1101/gr.277629.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhang C, Mirarab S. Weighting by gene tree uncertainty improves accuracy of quartet-based species trees. Mol Biol Evol. 2022;39(12):215. doi: 10.1093/molbev/msac215. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2(5):1–7. doi: 10.1371/journal.pgen.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010;10:302. doi: 10.1186/1471-2148-10-302. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency. BMC Genomics. 2020;21(1):136. doi: 10.1186/s12864-020-6519-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol. 2016;17(1):69. doi: 10.1186/s13059-016-0929-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wu Y. Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach. Bioinformatics. 2019;36(3):742–750. doi: 10.1093/bioinformatics/btz676. [DOI] [PubMed] [Google Scholar]
24.Kizilkale C, Mehrabadi FR, Azer ES, Pérez-Guijarro E, Marie KL, Lee MP, Day C-P, Merlino G, Ergün F, Buluç A, Sahinalp SC, Malikić S. Fast intratumor heterogeneity inference from single-cell sequencing data. Nat Comput Sci. 2022;2:577–583. doi: 10.1038/s43588-022-00298-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Fisher RA. On the dominance ratio. Proc R Soc Edinb. 1923;42:321–341. doi: 10.1017/S0370164600023993. [DOI] [Google Scholar]
26.Wright S. Evolution in mendelian populations. Genetics. 1931;16(2):97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Warnow T. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge: Cambridge University Press; 2017. [Google Scholar]
28.Jiang T, Kearney P, Li M. A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application. SIAM J Comput. 2001;30(6):1942–1961. doi: 10.1137/S0097539799361683. [DOI] [Google Scholar]
29.Lafond M, Scornavacca C. On the weighted quartet consensus problem. Theor Comput Sci. 2019;769:1–17. doi: 10.1016/j.tcs.2018.10.005. [DOI] [Google Scholar]
30.Nute M, Chou J, Molloy EK, Warnow T. The performance of coalescent-based species tree estimation methods under models of missing data. BMC Genomics. 2018;19(Suppl 5):286. doi: 10.1186/s12864-018-4619-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Springer MS, Molloy EK, Sloan DB, Simmons MP, Gatesy J. ILS-aware analysis of low-homoplasy retroelement insertions: inference of species trees and introgression using quartets. J Hered. 2019;111(2):147–168. doi: 10.1093/jhered/esz076. [DOI] [PubMed] [Google Scholar]
32.Dai J, Rubel T, Han Y, Molloy EK. Leveraging Constraints plus dynamic programming for the large dollo parsimony problem. In: Belazzougui D, Ouangraoua A, editors. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), vol. 273. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany. 2023. pp. 5–1523. 10.4230/LIPIcs.WABI.2023.5
33.Kozlov A, Alves JM, Stamatakis A, Posada D. Cell Phy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data. Genome Biol. 2022;23:37. doi: 10.1186/s13059-021-02583-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res. 2019;29(11):1847–1859. doi: 10.1101/gr.243121.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 2017;18:178. doi: 10.1186/s13059-017-1311-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Snir S, Rao S. Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans on Comput Biol Bioinf. 2010;7(4):704–718. doi: 10.1109/TCBB.2008.133. [DOI] [PubMed] [Google Scholar]
37.Snir S, Rao S. Quartet MaxCut: a fast algorithm for amalgamating quartet trees. Mol Phylogenet Evol. 2012;62(1):1–8. doi: 10.1016/j.ympev.2011.06.021. [DOI] [PubMed] [Google Scholar]
38.Avni E, Cohen R, Snir S. Weighted quartets phylogenetics. Syst Biol. 2014;64(2):233–242. doi: 10.1093/sysbio/syu087. [DOI] [PubMed] [Google Scholar]
39.Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic tree reconstruction from quartets: a heuristic approach. PLoS ONE. 2014;9(8):1–13. doi: 10.1371/journal.pone.0104008. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Weber LL, Sashittal P, El-Kebir M. doubletD: detecting doublets in single-cell DNA sequencing data. Bioinformatics. 2021;37(Suppl–1):214–221. doi: 10.1093/bioinformatics/btab266. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Mallory XF, Edrisi M, Navin N, Nakhleh L. Methods for copy number aberration detection from single-cell DNA-sequencing data. Genome Biol. 2020;21:208. doi: 10.1186/s13059-020-02119-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.El-Kebir M. SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics. 2018;34(17):671–679. doi: 10.1093/bioinformatics/bty589. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, Seidman D, Hach F, Hajirasouliha I, Sahinalp SC. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res. 2019;29(11):1860–1877. doi: 10.1101/gr.234435.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Satas G, Zaccaria S, Mon G, Raphael BJ. SCARLET: single-cell tumor phylogeny inference with copy-number constrained mutation losses. Cell Syst. 2020;10(4):323–3328. doi: 10.1016/j.cels.2020.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Rabiee M, Mirarab S. Forcing external constraints on tree inference using astral. BMC Genomics. 2020;21(Suppl 2):218. doi: 10.1186/s12864-020-6607-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
47.Kuritzin A, Kischka T, Schmitz J, Churakov G. Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLOS Comput Biol. 2016;12(3):1–20. doi: 10.1371/journal.pcbi.1004812. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Molloy EK, Gatesy J, Springer MS. Theoretical and practical considerations when using retroelement insertions to estimate species trees in the anomaly zone. Syst Biol. 2021;71(3):721–740. doi: 10.1093/sysbio/syab086. [DOI] [PubMed] [Google Scholar]
49.Mendes FK, Hahn MW. Why concatenation fails near the anomaly zone. Syst Biol. 2017;67(1):158–169. doi: 10.1093/sysbio/syx063. [DOI] [PubMed] [Google Scholar]
50.Springer MS, Gatesy J. The gene tree delusion. Mol Phylogenet Evol. 2016;94:1–33. doi: 10.1016/j.ympev.2015.07.018. [DOI] [PubMed] [Google Scholar]
51.Doronina L, Hughes GM, Moreno-Santillan D, Lawless C, Lonergan T, Ryan L, Jebb D, Kirilenko BM, Korstian JM, Dávalos LM, Vernes SC, Myers EW, Teeling EC, Hiller M, Jermiin LS, Schmitz J, Springer MS, Ray DA. Contradictory phylogenetic signals in the laurasiatheria anomaly zone. Genes. 2022 doi: 10.3390/genes13050766. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Korstian J, Paulat N, Platt R, II, Stevens R, Ray D. Sine-based phylogenomics reveal extensive introgression and incomplete lineage sorting in myotis. Genes. 2022;13(3):399. doi: 10.3390/genes13030399. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Gatesy J, Springer MS. Phylogenomic coalescent analyses of avian retroelements infer zero-length branches at the base of neoaves, emergent support for controversial clades, and ancient introgressive hybridization in afroaves. Genes. 2022 doi: 10.3390/genes13071167. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Cloutier A, Sackton TB, Grayson P, Clamp M, Baker AJ, Edwards SV. Whole-genome analyses resolve the phylogeny of flightless birds (Palaeognathae) in the presence of an empirical anomaly zone. Syst Biol. 2019;68(6):937–955. doi: 10.1093/sysbio/syz019. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

13015_2023_248_MOESM1_ESM.pdf^{(919.1KB, pdf)}

Additional file 1. Proofs of Lemmas 1, 3, and 5, supplemental figures S1–S2, and supplemental tables S1–S16.

Data Availability Statement

Not applicable.

[CR1] 1.Lim B, Lin Y, Navin N. Advancing cancer research and medicine with single-cell genomics. Cancer Cell. 2020;37(4):456–470. doi: 10.1016/j.ccell.2020.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biol. 2016;17:86. doi: 10.1186/s13059-016-0936-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat Rev Genet. 2017;18(4):213–229. doi: 10.1038/nrg.2016.170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Wilkinson M, Cotton JA, Creevey C, Eulenstein O, Harris SR, Lapointe F-J, Levasseur C, Mcinerney JO, Pisani D, Thorley JL. The shape of supertrees to come: tree shape related properties of fourteen supertree methods. Syst Biol. 2005;54(3):419–431. doi: 10.1080/10635150590949832. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5(5):568–583. doi: 10.1093/oxfordjournals.molbev.a040517. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164(4):1645–1656. doi: 10.1093/genetics/164.4.1645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011;62(6):833–862. doi: 10.1007/s00285-010-0355-7. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Degnan JH. Anomalous unrooted gene trees. Syst Biol. 2013;62(4):574–590. doi: 10.1093/sysbio/syt023. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007;56(1):17–24. doi: 10.1080/10635150601146041. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Roch S, Steel M. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 2015;100:56–62. doi: 10.1016/j.tpb.2014.12.005. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010;26(22):2910–2911. doi: 10.1093/bioinformatics/btq539. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014;30(17):541–548. doi: 10.1093/bioinformatics/btu462. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31(12):44–52. doi: 10.1093/bioinformatics/btv234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinf. 2018;19(6):153. doi: 10.1186/s12859-018-2129-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Dibaeinia P, Tabe-Bordbar S, Warnow T. FASTRAL: improving scalability of phylogenomic analysis. Bioinformatics. 2021;37(16):2317–2324. doi: 10.1093/bioinformatics/btab093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Mahbub M, Wahab Z, Reaz R, Rahman MS, Bayzid MS. wQFM: highly accurate genome-scale species tree estimation from weighted quartets. Bioinformatics. 2021;37(21):3734–3743. doi: 10.1093/bioinformatics/btab428. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Han Y, Molloy EK. Improving quartet graph construction for scalable and accurate species tree estimation from gene trees. Genome Res. 2023 doi: 10.1101/gr.277629.122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Zhang C, Mirarab S. Weighting by gene tree uncertainty improves accuracy of quartet-based species trees. Mol Biol Evol. 2022;39(12):215. doi: 10.1093/molbev/msac215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2(5):1–7. doi: 10.1371/journal.pgen.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010;10:302. doi: 10.1186/1471-2148-10-302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency. BMC Genomics. 2020;21(1):136. doi: 10.1186/s12864-020-6519-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol. 2016;17(1):69. doi: 10.1186/s13059-016-0929-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Wu Y. Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach. Bioinformatics. 2019;36(3):742–750. doi: 10.1093/bioinformatics/btz676. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Kizilkale C, Mehrabadi FR, Azer ES, Pérez-Guijarro E, Marie KL, Lee MP, Day C-P, Merlino G, Ergün F, Buluç A, Sahinalp SC, Malikić S. Fast intratumor heterogeneity inference from single-cell sequencing data. Nat Comput Sci. 2022;2:577–583. doi: 10.1038/s43588-022-00298-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Fisher RA. On the dominance ratio. Proc R Soc Edinb. 1923;42:321–341. doi: 10.1017/S0370164600023993. [DOI] [Google Scholar]

[CR26] 26.Wright S. Evolution in mendelian populations. Genetics. 1931;16(2):97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Warnow T. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge: Cambridge University Press; 2017. [Google Scholar]

[CR28] 28.Jiang T, Kearney P, Li M. A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application. SIAM J Comput. 2001;30(6):1942–1961. doi: 10.1137/S0097539799361683. [DOI] [Google Scholar]

[CR29] 29.Lafond M, Scornavacca C. On the weighted quartet consensus problem. Theor Comput Sci. 2019;769:1–17. doi: 10.1016/j.tcs.2018.10.005. [DOI] [Google Scholar]

[CR30] 30.Nute M, Chou J, Molloy EK, Warnow T. The performance of coalescent-based species tree estimation methods under models of missing data. BMC Genomics. 2018;19(Suppl 5):286. doi: 10.1186/s12864-018-4619-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Springer MS, Molloy EK, Sloan DB, Simmons MP, Gatesy J. ILS-aware analysis of low-homoplasy retroelement insertions: inference of species trees and introgression using quartets. J Hered. 2019;111(2):147–168. doi: 10.1093/jhered/esz076. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Dai J, Rubel T, Han Y, Molloy EK. Leveraging Constraints plus dynamic programming for the large dollo parsimony problem. In: Belazzougui D, Ouangraoua A, editors. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), vol. 273. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany. 2023. pp. 5–1523. 10.4230/LIPIcs.WABI.2023.5

[CR33] 33.Kozlov A, Alves JM, Stamatakis A, Posada D. Cell Phy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data. Genome Biol. 2022;23:37. doi: 10.1186/s13059-021-02583-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res. 2019;29(11):1847–1859. doi: 10.1101/gr.243121.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 2017;18:178. doi: 10.1186/s13059-017-1311-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Snir S, Rao S. Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans on Comput Biol Bioinf. 2010;7(4):704–718. doi: 10.1109/TCBB.2008.133. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Snir S, Rao S. Quartet MaxCut: a fast algorithm for amalgamating quartet trees. Mol Phylogenet Evol. 2012;62(1):1–8. doi: 10.1016/j.ympev.2011.06.021. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Avni E, Cohen R, Snir S. Weighted quartets phylogenetics. Syst Biol. 2014;64(2):233–242. doi: 10.1093/sysbio/syu087. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic tree reconstruction from quartets: a heuristic approach. PLoS ONE. 2014;9(8):1–13. doi: 10.1371/journal.pone.0104008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Weber LL, Sashittal P, El-Kebir M. doubletD: detecting doublets in single-cell DNA sequencing data. Bioinformatics. 2021;37(Suppl–1):214–221. doi: 10.1093/bioinformatics/btab266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Mallory XF, Edrisi M, Navin N, Nakhleh L. Methods for copy number aberration detection from single-cell DNA-sequencing data. Genome Biol. 2020;21:208. doi: 10.1186/s13059-020-02119-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.El-Kebir M. SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics. 2018;34(17):671–679. doi: 10.1093/bioinformatics/bty589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, Seidman D, Hach F, Hajirasouliha I, Sahinalp SC. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res. 2019;29(11):1860–1877. doi: 10.1101/gr.234435.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Satas G, Zaccaria S, Mon G, Raphael BJ. SCARLET: single-cell tumor phylogeny inference with copy-number constrained mutation losses. Cell Syst. 2020;10(4):323–3328. doi: 10.1016/j.cels.2020.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Rabiee M, Mirarab S. Forcing external constraints on tree inference using astral. BMC Genomics. 2020;21(Suppl 2):218. doi: 10.1186/s12864-020-6607-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Kuritzin A, Kischka T, Schmitz J, Churakov G. Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLOS Comput Biol. 2016;12(3):1–20. doi: 10.1371/journal.pcbi.1004812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Molloy EK, Gatesy J, Springer MS. Theoretical and practical considerations when using retroelement insertions to estimate species trees in the anomaly zone. Syst Biol. 2021;71(3):721–740. doi: 10.1093/sysbio/syab086. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Mendes FK, Hahn MW. Why concatenation fails near the anomaly zone. Syst Biol. 2017;67(1):158–169. doi: 10.1093/sysbio/syx063. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Springer MS, Gatesy J. The gene tree delusion. Mol Phylogenet Evol. 2016;94:1–33. doi: 10.1016/j.ympev.2015.07.018. [DOI] [PubMed] [Google Scholar]

[CR51] 51.Doronina L, Hughes GM, Moreno-Santillan D, Lawless C, Lonergan T, Ryan L, Jebb D, Kirilenko BM, Korstian JM, Dávalos LM, Vernes SC, Myers EW, Teeling EC, Hiller M, Jermiin LS, Schmitz J, Springer MS, Ray DA. Contradictory phylogenetic signals in the laurasiatheria anomaly zone. Genes. 2022 doi: 10.3390/genes13050766. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Korstian J, Paulat N, Platt R, II, Stevens R, Ray D. Sine-based phylogenomics reveal extensive introgression and incomplete lineage sorting in myotis. Genes. 2022;13(3):399. doi: 10.3390/genes13030399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Gatesy J, Springer MS. Phylogenomic coalescent analyses of avian retroelements infer zero-length branches at the base of neoaves, emergent support for controversial clades, and ancient introgressive hybridization in afroaves. Genes. 2022 doi: 10.3390/genes13071167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Cloutier A, Sackton TB, Grayson P, Clamp M, Baker AJ, Edwards SV. Whole-genome analyses resolve the phylogeny of flightless birds (Palaeognathae) in the presence of an empirical anomaly zone. Syst Biol. 2019;68(6):937–955. doi: 10.1093/sysbio/syz019. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model

Yunheng Han

Erin K Molloy

Abstract

Supplementary Information

Introduction

Background

Phylogenetic trees

Mutations and models of evolution

Fig. 1.

Definition 1

No anomalous quartets under an unbiased error and missingness model

Fig. 2.

Definition 2

Theorem 1

Lemma 1

Lemma 2

Lemma 3

Proof

Table 1.

Corollary 1

Proof

Anomalous triplets under an unbiased error and missingness model

Definition 3

Theorem 2

Lemma 4

Lemma 5

Proof

Corollary 2

Phylogeny estimation from quartets

Definition 4

Fig. 3.

Theorem 3

Lemma 6

Proof

Lemma 7

Proof

Corollary 3

Proof

Theorem 4

Discussion

Supplementary Information

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases