Abstract
Single-cell sequencing (SCS) data have great potential in reconstructing the evolutionary history of tumors. Rapid advances in SCS technology in the past decade were followed by the design of various computational methods for inferring trees of tumor evolution. Some of the earliest methods were based on the direct search in the space of trees with the goal of finding the maximum likelihood tree. However, it can be shown that instead of searching directly in the tree space, we can perform a search in the space of binary matrices and obtain maximum likelihood tree directly from the maximum likelihood matrix. The potential of the latter tree search strategy has recently been recognized by different research groups and several related methods were published in the past 2 years. Here we provide a review of the theoretical background of these methods and a detailed discussion, which are largely missing in the available publications, of the correlation between the two tree search strategies. We also discuss each of the existing methods based on the search in the space of binary matrices and summarize the best-known single-cell DNA sequencing data sets, which can be used in the future for assessing performance on real data of newly developed methods.
Keywords: tumor evolution, single-cell DNA sequencing, combinatorial optimization
1. INTRODUCTION
Cancer is an evolutionary disease characterized by successive rounds of mutation and selection. Deciphering the evolutionary history of a tumor represents an important problem in cancer studies and can help us in better understanding some of the key aspects of tumor initiation, progression, metastatic spread, and many others.
Rapid advances in sequencing technologies in the past decade were followed by the development of various computational methods for studying tumor evolution (Kuipers et al., 2017a; Schwartz and Schäffer, 2017). Most of the earliest methods were designed for traditional bulk sequencing data, which are obtained by sequencing a mixture of DNA originating from a large number of cells. Since bulk data do not contain information about the cell of the origin of a particular read, studying tumor evolution by the use of this data type is very challenging. For example, in many cases, it is impossible to unambiguously differentiate between several topologically very distinct trees of tumor evolution that describe a given bulk data set equally well. We refer to Kuipers et al. (2017a) and Malikic (2019) for a more detailed discussion of the limitations of the use of bulk data in this context.
The second main group of methods for inferring tumor evolutionary history consists of methods designed for single-cell sequencing (SCS) data. Due to its high potential in studying tumor evolution, the advent of SCS has attracted a lot of attention in the field of tumor phylogenetics (see Table 1 in the Appendix for a list of the best-known existing single-cell data sets and publications introducing computational methods in which these data sets were reanalyzed). In contrast to the bulk data, the high-resolution data generated by sequencing individual tumor cells enable researchers to infer tumor evolutionary history at unprecedented detail. However, due to the elevated noise rates present in a typical SCS data set, inferring trees of tumor evolution from SCS data is not straightforward and it necessitated the development of automated computational methods. In the past few years, we have been witnessing very high interest in the design of such methods and many novel and sophisticated methods were proposed. They are based on the use of various techniques, including heuristics, Markov Chain Monte Carlo search, integer linear programming (ILP), constraint satisfaction programming (CSP), and others.
Table 1.
Summary of the Existing Single-Cell DNA Sequencing Data Sets That Were Reanalyzed in One or Several Publications That Introduce Methods for Studying Tumor Evolution from Single-Cell Sequencing Data
| Reference to original study | Cancer type | Number of cells | Number of mutations | Publications where data set was reanalyzed |
|---|---|---|---|---|
| Hou et al. (2012) | Myeloproliferative neoplasm | 58 | 712 | Kim and Simon (2014) Jahn et al. (2016) Ross and Markowetz (2016) Kuipers et al. (2017b) Ciccolella et al. (2018) |
| Xu et al. (2012) | Kidney | 17 | 50 | Jahn et al. (2016) Kuipers et al. (2017b) |
| Li et al. (2012) | Bladder | 44 | 443 | Ross and Markowetz (2016) |
| Wang et al. (2014) ER+ |
Breast | 47 | 40 | Jahn et al. (2016) Kuipers et al. (2017b) Ciccolella et al. (2018) |
| Wang et al. (2014), TNBC |
Breast | 16 | 519 | Singer et al. (2018) Malikic et al. (2019a) Ramazzotti et al. (2019) |
| Gawad et al. (2014) Patient 1 |
Leukemia | 111 | 20 | Kuipers et al. (2017b) Malikic et al. (2019a) |
| Gawad et al. (2014) Patient 2 |
Leukemia | 115 | 16 | Kuipers et al. (2017b) Malikic et al. (2019a) Malikic et al. (2019b) Weber and El-Kebir (2020) |
| Gawad et al. (2014) Patient 3 |
Leukemia | 150 | 49 | Kuipers et al. (2017b) Singer et al. (2018) Wu (2019) |
| Gawad et al. (2014) Patient 4 |
Leukemia | 143 | 78 | Kuipers et al. (2017b) Ciccolella et al. (2020) Ciccolella et al. (2018) |
| Gawad et al. (2014) Patient 5 |
Leukemia | 96 | 105 | Kuipers et al. (2017b) Ciccolella et al. (2020) |
| Gawad et al. (2014) Patient 6 |
Leukemia | 146 | 10 | Kuipers et al. (2017b) Weber and El-Kebir (2020) |
| McPherson et al. (2016) Patient 2 |
Ovarian | 588 | 37 | Kuipers et al. (2017b) |
| McPherson et al. (2016) Patient 3 |
Ovarian | 672 | 60 | Kuipers et al. (2017b) |
| McPherson et al. (2016) Patient 9 |
Ovarian | 420 | 37 | Kuipers et al. (2017b) |
| Wu et al. (2016) Patient CRC0827 |
Colorectal | 48 | 77 | Zafar et al. (2017) |
| Leung et al. (2017) Patient CRC1 (CO5) |
Colorectal | 178 | 16 | Zafar et al. (2017) El-Kebir (2018) Zafar et al. (2019) Malikic et al. (2019a) |
| Leung et al. (2017) Patient CRC2 (CO8) |
Colorectal | 182 | 36 | Zafar et al. (2019) Malikic et al. (2019a) Malikic et al. (2019b) Satas et al. (2020) |
Note that, due to different filtering and mutation calling criteria, the number of single cells and mutations used in some publications might differ from the numbers shown here. We have not included data from Morita et al. (2020) as currently only limited information about mutation calls from this data set is available (see Weber and El-Kebir, 2020).
In this article, we provide review of one methodology that is common to a group of methods for tree inference from single-cell DNA sequencing data. The review can be divided into three main conceptual parts.
In the first part, which consists of Section 2, we provide a brief background on cancer evolution and SCS, followed by a simple and intuitive overview of the tree inference from SCS data.
The second part, consisting of Sections 3, 4, and 5, is more formal. In Section 3, we first provide a formal description of the mutation tree of tumor evolution and binary genotype matrix obtained from SCS data. Next, we summarize the tree scoring scheme commonly used in the existing methods for finding the maximum likelihood tree of tumor evolution from SCS data. We then provide an overview of the maximum likelihood tree search strategy first proposed in Single Cell Inference of Tumor Evolution (SCITE; Jahn et al., 2016), one of the earliest of these methods. Our main goal here is describing how the search in the space of trees performed in SCITE can be replaced by search in the space of conflict-free binary matrices (see below for the definition). In addition, we describe how the search in the space of matrices can be expressed as an instance of ILP and routinely solved by the use of the available ILP solvers. The potential of tree inference by performing a search in the matrix space has already been recognized by several research groups, and in Section 4, we summarize the existing methods based on this tree search strategy. In Section 5, we present a special-use case of conflict-free matrices in developing time-efficient methods for identifying the presence of divergent subclones in a given tumor.
Lastly, in the third part, consisting of Sections 6 and 7, we provide several directions for future work, followed by a short conclusion.
Note that here we focus on somatic single-nucleotide variants (SNVs) as genetic footprints of cancer evolution and any use of the term “mutation” hereafter refers to this type of genetic mutation. In addition, unless otherwise stated, we assume that commonly used infinite site assumption (ISA) holds. This assumption states that each mutation is acquired exactly once during the course of tumor evolution and is never lost (i.e., it is inherited by all descendants of the cell where it occurred for the first time). In Sections 4.4 and 4.5, we discuss two methods based on the search in the space of conflict-free matrices that both relax this assumption. Namely, for each mutation, both methods allow it to be lost up to k times due to one or combination of deletion, loss of heterozygosity, or reverse mutation events, where k is a given constant. By this relaxation, these methods allow mutational losses, which cause most of the ISA violations.
2. BACKGROUND
2.1. Cancer onset and evolution
Tumor growth is an evolutionary process where somatic mutations are constantly acquired in tumor cells over time. For a given sequenced tumor and a set of somatic mutations detected in it, looking back in time, there was a point where none of the cells in the tissue of tumor origin harbored any of these mutations (Fig. 1A). Then in some of the cells, the first mutation was acquired. By the means of cell division, the number of cells harboring this mutation grew. Then at some other point in time, the second of the mutations was acquired in some of the cells and the process of tumor growth and mutation acquisition continued over time. Consequently, what we typically have at the time of clinical diagnosis is a heterogeneous tumor consisting of multiple genetically distinct populations of cells. Each population can be associated with a genotype, which is a binary vector having the length equal to the number of detected somatic mutations and i-th coordinate set to 1 if and only if the population harbors i-th mutation.
FIG. 1.
(A) Illustration of the process of tumor growth and mutation acquisition. (B) Mutation tree of tumor evolution for the tumor shown in (A). Each node represents a genetically distinct population of cells. In this figure, we label nodes by the set of mutations harbored by the related cellular population. Note that in (A) we only show the growth of populations of cells harboring at least one somatic mutation. Contamination of normal cells not harboring any of these somatic mutations is assumed to be (almost always) present in the sampled tumor.
The whole process of tumor evolution can be depicted by mutation tree (Jahn et al., 2016; Fig. 1B). The nodes of the mutation tree correspond to the populations of cells present in the tumor, and edges are labeled by mutations that occur between parent and child populations. We provide more formal definition of this type of tree in Section 3.1.
2.2. Reconstructing tumor evolutionary history from SCS data
In an SCS experiment, a number of single cells are sampled from tumor and their DNA is sequenced. The output of the SCS experiment, after the mutation calling step, can be given in the form of a binary single-cell genotype matrix, where rows and columns correspond to cells and mutations, respectively. Given the observed genotype matrix obtained by sequencing single cells from one or multiple tumor biopsies of the same patient, in this review, we focus on reconstructing the most likely mutation tree describing the evolutionary history of the sequenced cells.
In the (ideal) case where the observed genotype matrix is noise-free, by considering single cells as extant species (taxa) and mutations as characters, we can use some of the methods for species evolution to obtain an evolutionary history of the sequenced cells. Under the ISA, this is equivalent to solving the Binary Perfect Phylogeny Problem for which a well-known linear time algorithm exists (Gusfield, 1991). Relaxing the ISA so as to allow at most one loss of each mutation is known as the Persistent Phylogeny Problem (Bonizzoni et al., 2012), and a polynomial time solution for the case where minimizing the number of lost mutations is not a part of the objective was presented in Bonizzoni et al. (2016). There also exist numerous other methods that offer polynomial-time solutions for various other problems, including these where the goal is to infer whether the phylogeny has a specific topology (Benham et al., 1995; Maňuch et al., 2009). For a broader summary of the related methods, we refer to Della Vedova et al. (2017).
In practice, as single cell does not contain a sufficient amount of DNA required as input to modern sequencing technologies (Van Loo and Voet, 2014), several DNA amplification cycles are typically performed before sequencing. While there exist multiple amplification techniques, various types of noise are introduced during the DNA amplification step. For example, in amplification based on the use of PCR, nucleotide copying errors can give a signal for the presence of a mutation at a wild-type site (false positive). Similarly, the amplification failures, where one or multiple of homologous copies of a region do not get amplified, are common. Failures in producing a sufficient number of copies of the variant allele at heterozygous mutated loci can result in false-negative mutation calls due to reduced signal for mutation presence. While the false-positive rate of SCS data is typically below 0.001, the false negatives are more common with the rate between 0.1 and 0.3 in most of the available data sets.
Due to the presence of false-positive and false-negative mutation calls in the observed matrix (Fig. 2A), this matrix usually does not satisfy Four Gamete Condition (Estabrook et al., 1976; Meacham, 1983; Gusfield, 1991; Lam et al., 2009). In other words, the observed matrix usually contains a triplet of cells and a pair of columns such that in one of the three cells both mutations are reported to be present, in one of them it is reported that only the first mutation is present and in the remaining cell it is reported that only the second mutation is present.1 For simplicity, we say that such a triplet of cells and a pair of mutations form a conflict (Hujdurović et al., 2015). One example of conflict is highlighted in Figure 2B.
FIG. 2.
(A) Example of noisy single-cell genotype matrix obtained from SCS data. In this toy example, nine single cells were sampled from the tumor shown in Figure 1. An entry at the intersection of a cell and a mutation is colored in blue if the mutation is reported to be present in the cell after the mutation calling step. Otherwise, it is colored in light gray. Some of the mutation calls are false positives or false negatives and such entries of the matrix are marked by orange or red frames. (B) A triplet of cells (rows) and a pair of mutations (columns) that form a conflict and the matrix entries at their intersection are highlighted. (C) It is not possible to place the two mutations from the conflict in a mutation tree shown in (C) such that the observed genotype of each of the three cells involved in the conflict matches the genotype of some of the tree nodes. Namely, in this tree, the green mutation is an ancestor of the blue mutation implying that any node harboring the blue mutation also harbors the green mutation. Therefore, the genotype of the first (topmost) of the three cells from the conflict does not match the genotype of any of the tree nodes. An analogous argument applies to any tree in which the green mutation is an ancestor of the blue mutation. Under the same requirement and using the similar arguments we can rule out trees in which the blue mutation is placed as an ancestor of the green mutation (due to the third cell from the conflict), as well as trees in which the two mutations are placed on different branches so that neither is an ancestor of the other (due to the second cell from the conflict). SCS, single-cell sequencing.
The presence of conflict in the observed matrix implies that there do not exist a mutation tree and an assignment of cells to the nodes of the tree such that the observed genotypes of cells match the genotypes of nodes to which the cells are assigned (Fig. 2B, C). This prevents us from reconstructing the tree of tumor evolution directly from the observed matrix and requires the development of automated computational methods for finding the trees that best match the observed data. Many such methods were developed in the past few years (see the last column in Table 1) and several of them are discussed in more detail in this review. In the next section, using one real data set as input example, we provide a simple illustration of the two predominant tree search strategies used in most of these methods.
2.3. Finding the most likely evolutionary history of a triple-negative breast cancer patient
We focus on inferring the mutation tree of tumor evolution of triple-negative breast cancer (TNBC) patient from Wang et al. (2014). In total, 16 cancerous single cells were sequenced in the original study and the mutation calls across 22 selected sites were recently published in a matrix format in Ramazzotti et al. (2019). We show this matrix in Figure 3A. The estimated false-positive and false-negative mutation rates for this data set are and , respectively.
FIG. 3.
(A) The observed single-cell genotype matrix for a triple-negative breast cancer patient, for which sequencing data were made available in Wang et al. (2014). The mutation calls in the matrix format shown here were recently published in Ramazzotti et al. (2019). Row and column labels represent labels of single cells and mutations, respectively. All labels are sorted alphabetically. (B) Mutation tree of tumor evolution inferred by SCITE using the matrix shown in (A) as the input. Nodes of the tree are labeled arbitrarily. Mutations colored with the same (nonblack) color can be permuted in any order without change in the likelihood score of the tree. (C) ML conflict-free matrix reported by PhISCS using the matrix shown in (A) as the input. The mutation tree implied by this matrix is also the tree shown in (B). Attached right to the matrix, we show the node of origin of each single cell according to this solution. The node of origin of a given cell is the node that has a genotype that matches the genotype of the cell defined by the matrix shown in (C). ML, maximum likelihood.; PhISCS, Phylogeny of tumors using Integrated bulk and Single-Cell Sequencing data; SCITE, Single Cell Inference of Tumor Evolution.
To obtain the maximum likelihood mutation tree, we first run SCITE (Jahn et al., 2016), one of the earliest methods for inferring the tumor evolutionary history from SCS data. To find a such tree, SCITE performs search directly in the space of mutation trees and computes the score of a given tree by comparing observed genotypes of cells and genotypes of tree nodes. For a given cell, it is assumed that its true genotype matches the genotype of one of the tree nodes. Due to false mutation calls, for many cells it is not possible to find a node having a genotype identical to their observed genotypes. When matching a cell and a node, the absence of a mutation in the genotype of the node and its presence in the observed genotype of the cell imply false-positive mutation call. Similarly, the presence of a mutation in the genotype of the node and its absence in the observed genotype of the cell imply false-negative mutation call. When searching for the tree node, which has a genotype that is the most similar to the observed genotype of the cell, the goal is to minimize the weighted number of false-positive and false-negative mutation calls, where weights depend on probabilities of false-positive and false-negative mutation calls. In Figure 3B, we show the mutation tree for this data set that was reported by SCITE.
As a next step, we run Phylogeny of tumors using Integrated bulk and Single-Cell Sequencing data (PhISCS)2 (Malikic et al., 2019b), a method based on search in the space of binary matrices that do not contain a triplet of rows (cells) and a pair of columns (mutations) that form a conflict. For any such matrix, also named conflict-free matrix, its rows are compared with rows of the observed noisy genotype matrix and the score is computed based on the false-positive and false-negative mutation calls that it implies (i.e., based on the difference between the entries of the two matrices). Once the best scoring (i.e., maximum likelihood) conflict-free matrix is found, the maximum likelihood mutation tree can be reconstructed directly from this matrix. In Figure 3C, we show the maximum likelihood conflict-free matrix reported by PhISCS. The tree implied by this matrix is also the tree shown in Figure 3B.
As can be observed from above, the two methods produce the same mutation tree in this example. This is expected considering that this is an illustrative example with small input size, which enabled both PhISCS and SCITE to converge to the optimal solution, and the fact that the sets of the optimum trees for the two methods are always same. (We provide detailed proof of this in Section 3.) However, although the solutions on this example are identical, the running time of the approaches based on the search in the space of binary matrices, compared with these that perform search in the space of trees, can be quite different on larger input sizes; the conceptual and computational simplicity of searching in the space of binary matrices, in comparison with trees, can in some cases lead to a few orders of magnitude difference between the running times, favoring the search in the matrix space [see Sadeqi Azer et al. (2020b) for some direct running time comparisons].
In addition, methods based on the search in the space of matrices typically provide some measure of quality of the reported solution, which is either a guarantee of optimality or upper/lower bound on the theoretically best possible solution. Note that in most cases the upper/lower bounds or optimality guarantee is provided directly by solvers for ILP and/or CSP that are used in implementing these methods. Due to this implementation detail, these methods are also expected to benefit from further developments and improvements in efficiency of these solvers. Lastly, from our own experience in implementing various methods [e.g., PhISCS (Malikic et al., 2019b)], we also believe that most of the methods based on the search in the space of binary matrices are relatively simple in terms of effort required for implementation. Considering all of these, it is not surprising that this approach for inferring trees of tumor evolution has recently attracted the attention of different research groups, and we devote most of the remaining part of this review to providing a simple description of the underlying methodology and review of the related methods.
3. INFERENCE OF TUMOR EVOLUTIONARY HISTORY FROM SCS DATA
In this section, we provide a formal overview of the two predominant tree search strategies used in most of the available methods for reconstructing tumor evolutionary history from SCS data. We describe the commonly used tree scoring and show that the search in the space of mutation trees can be replaced by search in the space of conflict-free matrices. As we aimed to make this section mathematically formal and precise, there will be a few overlaps in the definitions (e.g., definition of the mutation tree) with what has been introduced and presented in the related informal descriptions provided above.
3.1. Mutation tree of tumor evolution
One of the most convenient ways of depicting the evolutionary history of a given tumor is by the use of a mutation tree (Jahn et al., 2016). A mutation tree on mutations is a rooted tree that consists of nodes and each of its edges is labeled by exactly one of these m mutations. Due to ISA, each mutation is assigned to exactly one edge so we have a one-to-one correspondence between the tree edges and the mutations. Nodes of a given mutation tree can be labeled arbitrarily. Here, we use labeling with non-negative integers with the root being labeled by 0. A simple example of mutation tree is shown in Figure 4. Note that in this figure, one should focus only on the green nodes, the solid edges, and their labels as the main constituents of the mutation tree. In the figure we also show nine single cells (yellow nodes) sampled in some SCS experiment.
FIG. 4.

An example of mutation tree (green nodes, solid edges) from which nine single cells were sampled. Each single cell is connected to its node of origin in the mutation tree via a dashed edge.
Each node i of the mutation tree can be associated with a binary genotype vector . We assume that the root node is mutation-free and define that for all p. For any other node i, if and only if mutation Mp occurs on the path from the root to the node i. Intuitively, represent all possible genotypes present during the course of tumor evolution, where genotypes are defined on the set of m mutated loci. The first cell with genotype is born when some cell with genotype Gj acquires mutation Mp, where node j is a parent of node i and Mp is the label of the edge connecting j and i (note that in this sentence we assume that each daughter cell differs from its parent cell by at most one mutation, but the assumption is made solely for simplifying the presented intuitive description of Gi). Note also that since we expect that each of the cells originates from one of the nodes of the tree, the true genotype of the cell is expected to match some of the node genotypes .
For a given mutation tree and two distinct mutations Mp and Mq, we define that Mp is an ancestor of Mq if and only if the edge labeled by Mp is contained within the path starting at the root of the tree and ending at the edge labeled by Mq (including this edge). If neither Mp is an ancestor of Mq nor Mq is an ancestor of Mp, then we say that Mp and Mq belong to different branches of the tree.
3.2. Binary genotype matrix obtained from SCS data
Assume that we have performed an SCS experiment where n single cells, denote them as , were sampled from the same patient. The true genotypes of these cells can be arranged in a single-cell binary genotype matrix A with the columns of A corresponding to mutations and the rows representing genotypes of single cells.
Due to the noise present in SCS data, it is usually very unlikely to observe the matrix A directly from the raw sequencing data. Assume that, after processing the raw sequencing data and based on some mutation calling criteria, we found an evidence for m putative mutations that we denote as . Assume for simplicity that for each single cell and each mutation we make a binary decision about the presence or absence of the mutation in the cell (later we also reference more general models where, instead of making a binary choice, variant and reference read counts at the mutation loci are considered in order to get a confidence score for mutational presence). Then the output of the above steps can be represented as a binary matrix D consisting of n rows and m columns, which, respectively, correspond to the sets of sequenced single cells and putative mutations (Fig. 2A). The value of is set to 1 if our mutation calling procedure predicted that mutation Mj is present in cell Ci. Otherwise, we set .
3.3. Scoring an arbitrary mutation tree
Given a matrix D introduced above, using an efficient time algorithm introduced in Gusfield (1991), we can determine whether there exists a mutation tree such that each of the rows of D matches a binary genotype of some of the tree nodes (Gusfield, 1991). The same algorithm can be used to obtain a mutation tree in cases when it exists. If it is not possible to directly reconstruct a tree from D, which is usually a case for the existing single-cell data sets, then we search for a tree that best explains D. Searching for such tree requires some notion of tree score, where the score is assigned to a given tree under the assumption that the observed mutation calls are given by D and that probabilities of observing false-positive and false-negative mutation calls in D are provided. Below we describe a commonly used approach for scoring an arbitrary mutation tree. Note that we focus on the trees defined on the set of mutations as these are the only candidate solutions in our case.
As mentioned earlier, for a given mutation tree, we expect that each of the sequenced single cells originates from one of the tree nodes. The most commonly used way of defining a score of a particular tree T is the following. (i) For a given row i in D (i.e., cell Ci), compute the similarity score of the row vector to the genotype of each of the tree nodes. (ii) Among all of the computed scores, find the largest one and set it as the score of the cell Ci. Per this solution, the true genotype of the cell Ci then equals to the genotype associated with the node for which the largest score is attained. (iii) Repeat the previous two steps for all rows and compute the score of T by combining the scores of individual rows.
Two important questions that remain are: (i) a definition of the similarity score between the rows of D and the genotypes of the tree nodes and (ii) an algorithm for searching the space of all candidate trees.
For defining the similarity score, first note that in the above we are associating a binary vector with a binary vector , where . Pairs such that imply a false-positive mutation call in D for cell Ci and mutation Mj. Similarly, a pair implies a false-negative mutation call. It is well known that false-positive and false-negative noise rates in SCS data are usually very different. While false positives are typically rare (at rate ), for most of the available data sets, the false-negative rate is estimated to be between and . Therefore, defining the similarity score for Di and arbitrary Gv via naive counting of the number of coordinates at which these two vectors are equal would not properly discriminate between the two noise types. Instead, what is commonly adopted is the use of probability of observing genotype Di given that the true genotype is Gv. Assuming that the mutated sites evolve independently and denoting the false-positive and false-negative noise rates of SCS data as and , respectively, this probability equals
where
| (1) |
The genotype Ei, defined above as the most similar to Di among all node genotypes, is then given by the following formula:
and the likelihood score of the tree T, denoted as , is given as follows:
3.4. Tree inference by search in the space of mutation trees
The first approach for finding the maximum likelihood tree that we summarize here is a direct search in the space of mutation trees on mutations , denoted below as
. This approach was first used in Jahn et al. (2016), but was later adopted (with some modifications and extensions) in several other methods [e.g., SiFit (Zafar et al., 2017) or simulated annealing single-cell inference (SASC) (Ciccolella et al., 2020)]. Note that the space of mutation trees on m mutations consists of (m + 1)m- 1 distinct trees (Jahn et al., 2016) prohibiting the exhaustive search for almost all values of m encountered in practice. Instead, the approach proposed in Jahn et al. (2016) starts with a random tree from
, which is set as a current tree. Given the current tree T, the new tree is proposed by the use of several types of moves in the tree space (e.g., swapping labels of two arbitrary edges of T). Based mostly on the scores and , the proposed tree is either accepted and becomes the new current tree, or is rejected. The tree proposing step is repeated for a given number of iterations and the whole procedure can be restarted for a given number of repeats, both specified by a user. In the end, the best-found tree is returned as the most likely evolutionary history of the sequenced tumor.
3.5. Tree inference by search in the space of conflict-free matrices
We now present an alternative approach for finding the most likely evolutionary history of a tumor that is based on the search in the space of conflict-free matrices. A conflict-free matrix is any binary matrix X such that for each triplet of rows and each pair of columns p and q
If there exist a triplet of rows and a pair of columns for which equality holds in the above, then we say that they form a conflict (Hujdurović et al., 2015). We denote the space of all conflict-free matrices of size as .
We now first show that the maximum likelihood mutation tree can be obtained from the maximum likelihood conflict-free matrix, and then provide an exact ILP formulation for finding such matrix.
3.5.1. Search for the maximum likelihood tumor evolutionary history in the space of mutation trees versus the space of conflict-free matrices
Using the same notation as above, let T be an arbitrary mutation tree and let E denote the maximum likelihood matrix associated with it. It can be easily shown that matrix E is conflict-free. Namely, if Mp and Mq are two arbitrary mutations, then either one of them is an ancestor of the other or they belong to different branches of the tree. If Mp is the ancestor of Mq then, due to the ISA, any cell harboring Mq also harbors Mp, implying that for each row i. Similarly, if Mq occurs before Mp, then for each row i. On the other hand, if Mp and Mq belong to different branches of the tree, then obviously for each row i. Hence, regardless of the relative placement of mutations Mp and Mq in T, at least one of the triplets , , and is not present among values , and so, there does not exist a conflict in E involving columns p and q. As this holds for an arbitrary pair of mutations Mp and Mq, we can conclude that E is a conflict-free matrix.
The observation that E is conflict-free directly implies that
where is computed using the scoring scheme presented in Equation 1 and the following formula:
Since T is an arbitrary tree, we have
| (2) |
Assume now that X is a conflict-free matrix. Define that column p contains column q if for all rows i. Also, define that columns p and q are disjoint if there does not exist a row i such that . It is straightforward to prove that for each two nonzero columns p and q of conflict-free matrix, either one of them contains the other or they are disjoint. Consider the rows of X as single cells and its columns as mutations with mutation Mp present in cell Ci if and only if . Since X is a binary matrix such that for each two of its nonzero columns one of them contains the other or they are disjoint, there exists a phylogenetic tree such that: (i) the set of mutations assigned to the edges of the tree equals the set ; (ii) each of the single cells represents exactly one leaf of the tree; and (iii) the set of mutations occurring on the path from the root of the tree to the cell Ci equals to the set of mutations present in Ci (this set is given by the vector ). The last statement represents one of the classic results in tumor phylogenetics and its proof can be found in Gusfield (1991).
By detaching single cells from the phylogenetic tree implied by X, we obtain a mutation tree from
. In other words, for each conflict-free matrix X, we have a corresponding mutation tree T such that each row of the matrix is a genotype of some node of the tree. Therefore , which implies
| (3) |
From Equations (2) and (3) we conclude that
hence, we can perform a search for the conflict-free matrix X for which is maximized and then obtain tree T from X. We also have a guarantee that if we find an optimal matrix X, then the related tree will be among the maximum likelihood trees (note that there might exist multiple such trees/matrices).
An overview (with focus on input and output) of the process of finding the maximum likelihood tree for each of the two tree search strategies discussed here is shown in Figure 5. In this figure, we also illustrate how SCS can fail to detect all mutations present in a given tumor. In addition, the presented example also shows two limitations of the use of single-cell data in tree reconstruction, both of which are primarily a consequence of sampling biases and noise. First, the tree inferred from the single-cell data does not fully match the ground truth tree even on the set of detected mutations (the relative order of mutations M3 and M5 is swapped). Second, in the inferred most likely genotype matrix shown in the right part of the upper row in Figure 5, not all false positives and false negatives present in the input are corrected (because there exists a solution with a better score compared to the solution which corrects all false positives and false negatives present in the input).
FIG. 5.
Overview (with focus on input and output) of the methods for inferring the evolutionary history of tumors by (i) performing a direct search in the space of trees (upper row), and (ii) search in the space of conflict-free matrices (lower row). For simplicity, we assume that the input is given in a form of a binary genotype matrix, although some extensions are possible as discussed in Section 4. The input matrix is shown in the left part of each of the two rows. True mutation tree and sampling of single cells are assumed to match those shown in Figure 4. Entries 1 and 0 colored in red represent false positives and false negatives, respectively. Here we assume that during the sequencing process mutation M1 was not reported in any single cell and therefore it is not a part of the input/output. The output of the direct search in the tree space is a tree (upper row, middle) from which we can directly obtain the ML genotype matrix of the sequenced cells (upper row, right). This matrix can be obtained in time that is linear in the input size by comparing genotypes of the tree nodes with the observed genotypes of cells and taking the best matches (IDs of the best matching nodes are shown in the square brackets next to cell IDs). It can be shown that the ML genotype matrix is conflict-free and, according to the scoring presented in Sections 3.3 and 3.5, it has the same score as the inferred tree. On the contrary, by performing a search directly in the space of conflict-free matrices, we obtain the best scoring matrix (lower row, middle) from which a mutation tree (lower row, right) can be reconstructed in time linear in the matrix size [e.g., by use of the algorithm presented in Gusfield (1991)]. In Section 3.5 we provide a formal proof that the two tree search strategies, when they converge, yield trees that have the same score. Note that these trees are typically identical or have only some minor differences in cases where multiple equally likely optimal solutions exist. In this figure, we assume that mutation trees are used for depicting and searching for the most likely evolutionary history of a given tumor, although some other tree representations (e.g., cell lineage or clonal tree) can be used as well. In addition, we also assume that the false-positive rate () and the false-negative rate () of single-cell data used for obtaining solutions are and , respectively. NP, non-deterministic polynomial time.
3.6. Exact ILP formulation for finding maximum likelihood conflict-free matrix
In this subsection, we provide an ILP formulation to search for the conflict-free matrix X for which is maximized. Note that, to better distinguish the two tree search strategies, we opt to keep using X in our notation instead of using E.
Matrix X can be defined as a set of binary variables , where and . First, we have to ensure that variables form a conflict-free matrix. To achieve this, we use the formulation from Gusfield et al. (2007) and introduce a set of binary variables , defined for each pair of columns p and q and each . Our aim is that these variables satisfy the following: if there exists a row i such that , . To enforce that this holds, we introduce the following constraints:
Finally, to enforce that there does not exist any conflict involving columns p and q, it is now sufficient to add the following constraint:
It is trivial to verify that for any conflict-free matrix , it is possible to set values of variables B so that all of these constraints are satisfied; hence, by adding the constraints, no matrix from is excluded from the set of potential solutions.
After we have enforced that X is a conflict-free matrix, observe that the scoring scheme presented in Equation (1) can be rewritten in its equivalent form
Next, note that
Since D is known, the last expression is a linear combination of unknown integer variables . As maximizing is equivalent to maximizing , we set our objective to maximize . Clearly, we are now left with solving an instance of ILP presented above, which can be routinely done by the use of the existing ILP solvers. Note that this formulation involves binary variables and linear constraints.
4. THE EXISTING METHODS FOR TREE INFERENCE BASED ON THE SEARCH IN THE SPACE OF BINARY MATRICES
The potential of the use of search in the space of binary matrices has gained a lot of attention in the past 2 years and several methods based on this tree-inference strategy have been proposed. In this section, we summarize five such methods that infer complete trees of tumor evolution.
4.1. PhISCS (Phylogeny of tumors using Integrated bulk and Single-Cell Sequencing data)
The ILP formulation for reconstructing tumor phylogenies from single-cell data presented above was first presented in PhISCS (Malikic et al., 2019b). In addition to the ILP, the equivalent CSP formulation was introduced in the same work. This formulation enables the use of various, typically noncommercial, CSP solvers for solving this problem. Furthermore, the experimental results suggest that CSP implementation can be a running time-efficient alternative to the ILP implementation (without trade-off between the running time and tree reconstruction accuracy)3.
In PhISCS, the subperfect phylogeny problem is also introduced and solutions based on the ILP, as well as CSP, proposed. In subperfect phylogeny problem, in order to account for the possible presence of violations of the ISA, a number of user-specified mutations are allowed to be eliminated. The eliminated mutations do not contribute to the objective score and are not included in the reported tree, whereas the set of noneliminated mutations is required to satisfy the ISA.
4.2. PhISCS-BnB (Phylogeny Inference using Single-Cell Sequencing via Branch and Bound)
With the technological improvements and decrease in the cost of sequencing, we expect that many of the future SCS data sets will consist of sequencing data of hundreds or thousands of single cells with high sequencing breadth that will enable detection of a large number of somatic SNVs. Time-efficient analysis of such data sets will require a design of more efficient computational methods that scale well to the larger inputs. Recently developed Phylogeny Inference using Single-Cell Sequencing via Branch and Bound (PhISCS-BnB) (Sadeqi Azer et al., 2020b) is one of the methods designed to fill this gap. As demonstrated on the simulated data, PhISCS-BnB finds a solution with the optimality guarantee typically 10–100 times faster than PhISCS. It can successfully find the optimal solution within 24 hours on most of the inputs as large as , whereas convergence of PhISCS is usually limited to matrices of size or smaller. PhISCS-BnB was also compared against SCITE in terms of tree reconstruction accuracy. The obtained results suggest that on larger inputs (e.g., genotype matrices of size ), even when SCITE is given a significant running time advantage, trees reported by PhISCS-BnB are more similar to the ground truth simulated trees.
PhISCS-BnB is based on the branch and bound algorithm and its implementation does not require any commercial solver. However, in comparison with the alternative methods, it uses a simpler model in which only the presence of false-negative mutation calls in the input matrix D is considered. Consequently, the objective function presented in Section 3 can be replaced by the objective where the goal is to minimize the sum of the variables over the pairs for which (i.e., minimize the number of false negatives). Since in most of the real applications the existence of false-positive mutation calls is currently inevitable, the robustness of the method to the presence of this type of noise was also assessed. False positives were simulated at a rate up to and reported results suggest that their presence can in some cases be challenging for convergence to the optimal solution or proving optimality. However, the presence of false positives did not have a significant impact on tree reconstruction accuracy achieved by PhISCS-BnB.
4.3. scVILP (single-cell Variant calling via Integer Linear Program)
Due to nonuniform sequencing coverage and amplification biases inherent to SCS, the observed total coverage and variant allele frequencies across sites and cells usually take a wide range of values. Depending on these values, we might adjust our confidence in mutational presence/absence. For example, observing 10 variant reads out of 22 total reads spanning a given genomic locus gives higher confidence in mutational presence than observing 2 variant out of 10 total reads. These differences are not well reflected in noisy genotype matrix D, which does not contain the exact information about read counts. In single-cell Variant calling via Integer Linear Program (scVILP; Edrisi et al., 2019), the scoring scheme presented in Equation (1) is extended so that the read counts are directly factored in the likelihood. The input to this method consists of a matrix H, where represents the number of reference and variant reads observed at putative mutation loci j in cell i. Similarly, as in PhISCS, the likelihood function is a linear combination of unknown states with the coefficients depending on the read counts and a few other given constants. The search for the optimal matrix X is then performed in the space of conflict-free matrices analogously as described in Section 3.6.
In order to improve the running time performance, the authors also propose a divide-and-conquer approach where the matrix H is first split into smaller matrices formed by combining subsets of columns of H. Then for each matrix , an instance of ILP is solved taking as the input. Let denote the optimal genotype matrices reported by these ILPs. Matrices are combined into a matrix
, which obviously does not necessarily represent a conflict-free matrix. To eliminate conflicts in
, a two-step approach is taken. First, the smallest set
of columns needed to be removed from
to obtain a conflict-free matrix is identified. In the second step, corrections of entries of
are performed with the aim of eliminating conflicts. Importantly, the only corrections allowed are these in the columns from
. This usually largely reduces the size of the resulting conflict-elimination problem since the size of
is expected to be small. In summary, after these corrections in
are made, it is returned as a solution. This solution does not necessarily have a likelihood as high as the likelihood of the optimal matrix X, but solving all these smaller ILP instances separately typically takes less time than solving a single instance that takes the entire H as the input and returns X.
4.4. SPhyR (Single-cell Phylogeny Reconstruction)
Single-cell Phylogeny Reconstruction (SPhyR; El-Kebir, 2018) is the first of the two methods discussed in this review that explicitly models violations of ISA. Before providing more details of this method, note that two of the commonly used models of evolution in the species phylogenetics are Camin–Sokal (Camin and Sokal, 1965) and Dollo (Dollo, 1893; Rogozin et al., 2005) parsimony. The Camin–Sokal model allows each mutation (character) to be gained multiple times, but does not allow losses of mutations. On the contrary, in the Dollo parsimony model, each mutation can be gained at most once, but can be lost multiple times. While the Camin–Sokal model is suited for modeling violations of ISA due to parallel mutations, Dollo parsimony is well suited for modeling ISA violations that are due to mutational losses. In tumor evolution, due to frequently observed chromosomal instability and deletions that affect large parts of the genome, mutational losses are typically more common than parallel mutations (in addition, note that losses can also be caused by reverse mutations). It is therefore not surprising that the first methods that model violations of ISA primarily focus on modeling losses of mutations.
SPhyR is based on the k-Dollo evolutionary model, which is a restricted version of Dollo parsimony in which any SNV is gained only once but can be lost up to k times, where k is a given non-negative integer constant. This implies that an arbitrary mutation j can be either present at a given node or is in one of the up to distinct states of mutational absence. The first of these states corresponds to the mutational absence before the gain of the SNV, whereas each of the remaining states corresponds to mutational absence due to SNV loss (note that distinct losses can only occur on different branches of the tree).
To encode the unknown true state of a mutation j in a cell i and differentiate between distinct causes for mutational absence, binary variables are introduced, where . The variable is set to 1 if and only if cell i originates from a node that harbors mutation j, and the variable is set to 0 if and only if cell i originates from a node that does not harbor mutation j and the absence of the mutation in this node is not due to the loss of a previously gained mutation (i.e., the node is not among the descendants of the node where mutation j occurs). For , the variable is set to 1 if and only if mutation j is absent in cell i due to -st mutational loss. Combining variables for all pairs of cells and mutations results in an unknown binary matrix A, which has n rows and . Analogous to the rows of the matrix X introduced in Section 3, rows of A (indirectly) define genotypes of nodes of origin of the related cells.
However, as mutational losses are allowed in this model, to ensure that there exists an evolutionary tree with genotypes of its nodes matching genotypes implied by rows of A, there are now in total (k + 1)4 + 2k2(k + 1)2 + k4 matrices of size that form “conflicts” and are not allowed as submatrices of A (here, submatrix is any matrix consisting of elements at the intersection of a pair of columns and triplet of rows of A). Ensuring that there does not exist any such conflict can also be achieved by the set of ILP constraints and we refer to El-Kebir (2018) for full details. Note that the implementation described in El-Kebir (2018) is based on a time-efficient heuristic. Motivated by the clonal theory of cancer evolution (Nowell, 1976; Kuipers et al., 2017a), the model also allows clustering of cells and mutations, which can further improve the running time. It is assumed that all cells clustered together originate from the same node of the evolutionary tree and therefore have equal genotypes (analogous applies to the mutations).
4.5. gpps (General Parsimony Phylogeny from Single cell)
The last combinatorial method that we discuss is General Parsimony Phylogeny from Single cell (gpps; Ciccolella et al., 2018), which is also based on the use of k-Dollo evolutionary model and thus allows violations of ISA. From the methodological point of view, gpps is based on the elegant extension of the methodology described in Section 3. To allow k mutational losses, instead of , a set of variables is introduced. Intuitively, represents the existence of a gain of mutation j on the path from the tree root to the node of origin of cell i, whereas , for , represents the existence of loss of mutation j on such path. To ensure that the number of losses does not exceed the number of gains, the following constraint is added:
The true status (i.e., presence or absence) of mutation j in cell i, previously denoted as , is now given by the following formula:
In addition to the inequality constraints presented above that limit the number of mutational losses, the matrix Y with i-th row equal to
must be a conflict-free. This can be achieved by introducing a set of binary variables B and adding linear constraints analogously as described in Section 3. It can be shown that the above constraints are sufficient to ensure that there exists k-Dollo phylogeny consistent with X, that is, genotypes of leaves (single cells) of the phylogeny match genotypes given by the rows of the matrix X.
As running the ILP solver until the optimal solution is found can be prohibitively slow, gpps also offers an option of running the solver for a given user-specified period of time and then running an additional algorithm based on the Hill Climbing search. The main goal of this algorithm is finding the more likely phylogeny than the possibly suboptimal one reported by the ILP solver, which is used as the starting point of the Hill Climbing search.
5. SPECIAL USE CASE OF CONFLICT-FREE MATRICES: INFERENCE OF THE EXISTENCE OF DIVERGENT SUBCLONES IN A GIVEN TUMOR
In this section, we present one special use case of conflict-free matrices. Namely, we focus on the simple classification problem where the goal is to determine whether a given tumor contains divergent subclones evolving on separate tree branches (i.e., whether the most likely mutation tree has a nonlinear topology). This problem obviously differs from the problem discussed in the previous sections as it does not require inferring complete trees of tumor evolution. The problem has recently attracted attention in the field of tumor phylogenetics and two different and very fast methods for solving it were proposed in Sadeqi Azer et al. (2020a) and Weber and El-Kebir (2020). Both of these methods extensively rely on a subclass of binary matrices, named staircase matrices (Sadeqi Azer et al., 2020a). Here, we first provide a formal definition of staircase matrices, which is equivalent to the definition provided in Sadeqi Azer et al. (2020a). Next, we establish some important and, to the best of our knowledge, previously unpublished dependencies between staircase matrices and topology of the tree of the tumor evolution. In the end of this section, we provide a discussion of some noteworthy details of the methods from Sadeqi Azer et al. (2020a) and Weber and El-Kebir (2020) and their applications.
Definition: A binary matrix satisfies the staircase property if (i) for all pairs of integers such that and , and (ii) for all pairs of integers such that and . Any matrix satisfying the staircase property is also called a staircase matrix.
Lemma 1 If T is a mutation tree with linear topology and A is a noise-free single-cell genotype matrix, where single cells originate from the nodes of T, then the rows and the columns of A can be reordered so that the resulting matrix satisfies the staircase property. On the other hand, if the rows and the columns of a binary matrix A can be reordered so that the newly obtained matrix satisfies the staircase property, then the matrix A is conflict-free and implies a mutation tree with linear topology.
Proof: For simplicity of notation, for any genotype matrix mentioned in this proof, the terms “row i”/“cell i” and “column j”/“mutation j” stand for the cell corresponding to the row i and the mutation corresponding to the column j of the matrix, respectively.
First, we prove that the rows and the columns of a noise-free single-cell genotype matrix, where single cells were sampled from a tree with linear topology, can be reordered so that the resulting matrix satisfies the staircase property. We first illustrate this on a simple example shown in Figure 6. Now, consider a general case and assume that we have sampled cells C1, C2, …, Cn from a mutation tree having a linear topology and the set of mutations . We sort rows of the single-cell genotype matrix such that the following is satisfied in the sorted matrix: for each , if rows i and , respectively, represent genotypes of cells Ca and Cb, then the node of origin of Ca is either the same as the node of origin of Cb or is its ancestor. In other words, we sort rows in the genotype matrix from row 1 to row n in the same order as related cells are attached to the nodes of the mutation tree when the tree is traversed from the root toward the leaf. Let S denote the matrix obtained after this sorting. Observe that for any mutation j, due to the sorting of cells and ISA, if cell harbors mutation j, then the same holds for the cell (recall that cell i corresponds to i-th row of S and is therefore not necessarily equal to Ci). In other words, , which directly implies that . Once we have established this set of inequalities, to ensure that the inequalities also hold true, it suffices to reorder columns of S so that they are sorted (from left to right) in increasing number of ones that they contain.
FIG. 6.
Mutation tree with linear topology from which 12 single cells were sampled (left). True single-cell genotype matrix A of the sampled single cells (middle). Matrix S obtained by permuting rows and columns of A such that all the inequalities and from the definition of staircase matrix are satisfied (right).
In the second part of the proof, assume that a binary matrix A can be transformed, by permuting its rows and columns, into a matrix S that satisfies the staircase property. Assume also for the moment that no column in A consists entirely of zeros. We prove that A is conflict-free and that the topology of the tree implied by A is linear. Note that changing the order of rows in a matrix does not introduce nor resolve conflicts so A is conflict-free if and only if S is conflict-free. Similarly, if A is conflict-free, then the topology of the tree implied by A is linear if and only if the topology of the tree implied by S is linear. Therefore, it suffices to prove that S is conflict-free and that the topology of the tree implied by S is linear. To prove that S is conflict-free, consider an arbitrary pair of columns , where . Observe that for each cell i we have , which implies that . Therefore, an arbitrary pair of mutations (columns) of S cannot form a conflict implying that this matrix is conflict-free. Now, since S is conflict-free, based on the results presented in Section 3, there exists a mutation tree of tumor evolution implied by S. Assume that the topology of this tree is not linear. Then there exist mutations p and q such that and p and q belong to different branches of the tree. Recall our assumption that each column of A (equivalently of S) has at least one nonzero entry. This implies that each mutation is present in at least one single cell. Consequently, there is at least one single cell, denote it as i, sampled from the subtree below the edge labeled by p. As q does not label any edge of this subtree, we must have , which is impossible since implies that . The observed contradiction completes the proof that the tree implied by S is linear.
The above proof was done under the assumption that the matrix A does not contain a column consisting entirely of zeros. If such columns exist, then we recommend filtering them from A. Namely, according to A, such mutations are absent in all single cells, and therefore, they are noninformative in tree reconstruction. Furthermore, in practice, the presence of such columns is very likely due to the false-positive mutation calls in the noisy single-cell matrix from which A was obtained. However, if we still insist on the mutation tree that contains these mutations, we can first form a linear tree on the nonzero columns of A and then extend it, starting at the leaf, by adding mutations that correspond to all-zero columns. The resulting tree is still implied by matrix A and has a linear topology. However, note that in this case some other topologies might describe the tumor evolutionary history equally well as the placement of mutations associated with all-zero columns is uncertain.
Note that from this proof it follows that the first set of inequalities, that is is in principle already sufficient for deciding whether a given matrix has a staircase property. In addition, using the algorithm presented in Gusfield (1991), one can easily convert a given binary matrix into a staircase matrix or prove that such conversion is not possible. This can be done in time, which is linear in the input size and is time required by the algorithm from Gusfield (1991).
5.1. Deep learning approach for fast inference of the existence of divergent subclones
In Sadeqi Azer et al. (2020a), it was demonstrated that deep learning can be successfully used in studying tumor evolution. One of the methods presented in this article, which is not given any particular name, is a fast approach for discriminating between linear and nonlinear evolution. This solution is based on the use of a two-layer fully connected artificial neural network. For a given observed single-cell genotype matrix and noise rates of sequencing data, it reports the probability that the most likely mutation tree has a linear topology. Interestingly, the proposed solution achieves a speed-up of 100 times or more over the previously published alternatives, which first have to reconstruct the entire mutation tree before testing for its topology. This method was tested on TNBC patient from Wang et al. (2014), Patient 6 from a leukemia data set published in Gawad et al. (2014), and a CRC1 patient as given in Leung et al. (2017). In all cases, the reported probabilities were highly concordant with results from the previous detailed analyses where complete tumor phylogenies of these tumors were inferred.
5.2. Phyolin
The second method that we discuss in this section is Phyolin (Weber and El-Kebir, 2020). Similar to PhISCS-BnB (see Section 4.2), this method relies on the assumption that the rate of false-positive mutation calls in the observed single-cell data is very low and it only allows flips (i.e., false-negative mutation calls). The objective is to find a staircase matrix that differs from a given observed genotype matrix by the smallest possible number of flips.
The authors first show that the problem of finding this matrix is non-deterministic polynomial-time-hard (NP-hard) and then express it as an instance of constraint optimization. Once the best solution is found, its implied false-negative rate is computed. This rate, denote it as , is then compared with the estimated false-negative rate of SCS technology (which is given as a part of the input). In case that , the linear topology is reported as the topology of the most likely mutation tree. Otherwise, it is reported that the most likely mutation tree has a nonlinear topology (i.e., contains at least one branching event). The main motivation for making a decision using this rule is based on the following two observations: (i) staircase matrices typically contain more entries equal to 1 than conflict-free matrices corresponding to branching tree topologies (as an extreme example of this, one might consider chain and star topologies and their corresponding matrices); and (ii) if the most likely mutation tree is not linear, then trying to convert the observed matrix into a staircase matrix by the use of flips (which is exactly what is being done in the above optimization problem) will require more flips than what would be required if it was converted to the most likely conflict-free matrix. Consequently, in such cases, the estimated false-negative rate would typically be higher than the expected rate .
The main advantage of Phyolin over the deep learning approach discussed above is that it does not require training each time when there is a significant change in the input dimensions. The method was tested on 14 patients. The first two of these are Patients 2 and 6 from Gawad et al. (2014; Table 1). Data for the remaining 12 patients were obtained from Morita et al. (2020). Since only noise-corrected genotype matrices could be obtained from Morita et al. (2020), the authors generated the final input provided to Phyolin by adding false negatives to these matrices. For this data set, the number of mutations was very low (range 3–7 per patient), but a very high number of single cells were sequenced (range 4027–9279 per patient). Phyolin showed very good performance, both in terms of the running time and accuracy. Running time per instance was less than 9 minutes, with all of its results agreeing very well with the solutions reported in the original study.
6. FUTURE WORK
As the sizes of single-cell data sets are increasing with technological improvements and reduced sequencing cost, one of the most important directions for future work will be improving the running times of the available methods. Some work in this direction has been done in PhISCS-BnB, which in some cases achieves significant running time improvements over the available alternatives. However, this method is currently mostly applicable to data sets characterized by very low levels of false positives. One of the important directions for future work is extending this method to explicitly handle this type of noise in order to obtain more reliable tumor phylogenies in cases where a number of false-positive mutation calls are present in the input single-cell genotype matrix [see Patient CRC2 from Leung et al. (2017) for an example of such data set]. Notably, some reductions in running times of the existing tools are expected to come with improvements in the available ILP and CSP solvers. However, as problems that we discussed here are NP-hard (Weber and El-Kebir, 2020; El-Kebir, 2018), running time will very likely remain an important challenge for a long time and will require further improvements in the algorithmic design and implementation.
While PhISCS-BnB and Phyolin do not model false positives and as such have a simple objective of minimizing the number of false negatives, the other methods presented in this work assume that some estimates of false-positive and false-negative rates of single-cell data are given as a part of the input. While we expect that these rates stabilize in the future as SCS technology matures, some of the existing data sets lack reliable estimates of one or both of these rates. Extending the methods to enable search for the best combination of noise rates would make them simpler to use and more reliable in cases where input consists of such data. One simple solution, which comes at an increase in the overall running time, is performing a grid search over various combinations of false-positive and false-negative rates.
Two of the presented methods, namely SPhyR and gpps, model violations of ISA due to mutational losses and reverse mutations. However, neither of them accounts for possible existence of parallel mutations. Although violations of ISA due to the parallel mutations are expected to be less frequent than due to the mutations affected by deletions or reverse mutations, there exists some evidence of the presence of parallel mutations even in data sets consisting of a small number (e.g., 18) of mutations (Kuipers et al., 2017b). PhISCS is currently the only presented method that enables detection of such mutations, but without providing their placement in the reported phylogenetic tree. This can be achieved as a postprocessing step on the output of PhISCS or by extending some other methods to explicitly model for parallel mutations.
Extending the existing methods by integrating additional data types and/or genetic markers can help in refining the reported evolutionary histories of tumors. PhISCS is the only method discussed in this work that allows (optional) integration of both bulk and SCS data in a joint inference scheme. The results reported in the original study (Malikic et al., 2019b) suggest that bulk data have some complementary strengths to single-cell data and that the integration of the two data types can result in improved tree reconstruction accuracy. Integration of bulk and SCS data is one interesting direction for future extensions of methods SPhyR and gpps. Integration of bulk data can also help in improving mutation calling accuracy in scVILP and is worth future research.
Note that all of the methods discussed in this review perform tree reconstruction only by the use of SNVs. An example of their extension to account for additional genetic markers is addition of the information obtained from copy number evolution, as was recently done in Satas et al. (2020), and this represents another important direction for future investigation. Information from copy number evolution can impose additional constraints and provide valuable prior information to the existing methods that allow violations of ISA. As a very simple illustrative example, consider the methods gpps or SPhyR. Violation of ISA under the k-Dollo model, which is used in these methods, can occur either due to a deletion of a mutated allele or due to a reverse mutation. However, the latter is typically less likely than the former. Consequently, for a mutation M from a region not affected by any deletion or loss of heterozygosity events, both of these methods can be extended so that they assign a lower probability for having losses of mutation M.
As mentioned above, the information on copy number evolution has been used in Satas et al. (2020), where a method named single-cell algorithm for reconstructing loss-supported evolution of tumors (SCARLET) was introduced. SCARLET utilizes copy number trees, which describe the evolutionary history of a set of sequenced single cells using copy number aberrations (CNAs) as genetic markers of tumor evolution. However, SCARLET is still primarily based on the use of SNVs and uses the information on copy number evolution only to constrain the set of possible SNV-based phylogenies. It also requires that the copy number tree is either given as a part of the input or performs exhaustive enumeration of all copy number trees, which is feasible only in cases where the number of distinct copy number profiles of single cells is very small.
However, the inference of copy number tree for cases with a nontrivial number of distinct copy number profiles is a very challenging task due to several reasons. For example, these events cannot be described by a simple binary presence/absence state commonly used for heterozygous SNVs. Namely, CNAs have their magnitude, given by a non-negative integer that represents the total number of copies of the region affected by CNA. In more refined models, instead of using only the total copy number, allele-specific copy numbers are considered and their encoding requires at least two integers per event (one per each of the two homologous copies of the affected region). In a simple model of copy number evolution where (i) homologous copies of genomic regions are treated separately (i.e., as distinct regions); (ii) copy number event either increases the number of copies of the affected region from 1 to 2 (gain) or decreases it from 1 to 0 (loss); and (iii) for each genomic position at most one gain or loss overlapping with it is allowed, some of our preliminary (unpublished) results suggest that the model presented in this work can be extended to find the most likely copy number tree by performing search in the space of binary (not necessarily conflict-free) matrices. However, similarly to SNVs, ISA can also be violated for CNAs and we speculate that, to some extent, this can be modeled by borrowing the ideas used in gpps or SPhyR.
It is important to note that the additional level of complexity encountered when working with CNAs arises from the overlapping nature of these events, where an endpoint of one CNA belongs to a genomic region affected by some other CNA. While independence assumption is typically made in methods based on the use of SNVs, the possible dependency between distinct CNAs can make this assumption misleading in some cases and it is currently unclear to us how this could be modeled in an approach dependent on the use of binary matrices. We refer a reader to Mallory et al. (2020) for some additional challenges encountered when working with CNAs, as well as for in-depth review on the existing methods for detection of CNAs from single-cell DNA sequencing data and their use in studying intratumor heterogeneity and evolution.
7. CONCLUSION
In this review article, we provided a detailed theoretical background of one methodology for searching for the most likely tree of a tumor evolution, which has recently attracted attention in the field of tumor phylogenetics. This methodology is based on the search in the space of binary matrices and represents an alternative to the commonly used direct search in the space of trees. We presented formal proof that the two tree search strategies produce the same set of optimal solutions. However, performing a search in the space of matrices is typically simpler to implement and can be expressed as a simple instance of ILP or CSP and solved by using some of the available CSP/ILP solvers. Many of these solvers also provide a guarantee of optimality of reported solutions or some metrics of their quality in cases where they do not prove optimality within a given time limit. On the contrary, these are typically not available for the alternative methods that perform a search in the very large space of mutation trees.
By exploiting some important dependencies between linear tree topology and values in the matching binary matrix, two fast algorithms for detecting the existence of divergent subclones in tumors were recently introduced. These methods were also discussed above and they suggest the potential applications of exploring the space of binary matrices beyond the problem of finding the complete most likely tree of tumor evolution.
Note that the primary focus of this review is the methodology, and thus, we paid most of our attention to the key methodological details of the reviewed methods. We leave for future work benchmarking, analysis of running times, comparison of tree reconstruction accuracy, and others.
Lastly, doublets (sometimes also referred to as multiplets), which we have not discussed above, represent another source of noise present in some of the available single-cell DNA sequencing data sets. They are introduced when two or more cells get sequenced together and treated as a single cell. Notably, none of the methods presented in this review (as well as many other of the existing methods) comes with an option of handling this type of noise. While extending the methods by explicitly modeling the possibility of the presence of doublets is one potential solution, it would very likely come at high cost in terms of the running time. However, it is important to note that doublet rates are dropping with improvements in single-cell isolation techniques, and nowadays, they represent only a small fraction (typically <) of sequenced cells. Therefore, the use of some of the available methods for doublet identification [e.g., Single Cell Genotyper (Roth et al., 2016)] and their filtering from the input during the data preparation step will likely remain a commonly used solution as it is time-efficient, largely resolves doublet noise, and requires filtering of only a small fraction of the input data.
Appendix
Appendix: Summary of the Best-Known Single-Cell DNA Sequencing Data Sets
Here we provide a summary of the best-known existing real single-cell DNA sequencing data sets that have been used in the publications introducing new computational methods for studying tumor evolution. In Table 1, for each data set, we provide its size in terms of the number of cells and mutations, a cancer type, as well as the original study where the data set was first made available. In addition, we also list the existing publications introducing new computational methods for studying tumor evolution where these data sets were reanalyzed, and, in most cases, some novel insights about the evolutionary history of the sequenced tumors suggested. It is beyond the scope of this review to provide details of single-cell sequencing technologies used for generating these data sets.
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no competing financial interests.
FUNDING INFORMATION
This work is supported in part by the Intramural Research Program of the National Institutes of Health, National Cancer Institute. F.R.M. and M.H.E. were supported in part by Indiana University Grand Challenges Precision Health Initiative.
If we strictly follow the definition of Four Gamete Condition (Lam et al., 2009), we observe that violation of this condition also requires the existence of cells in which both mutations are absent. However, we assume that all mutations considered in this work are somatic, which implies that they are already absent from normal cells that correspond to the root of the phylogenetic tree.
While PhISCS enables integration of bulk and single-cell data in a joint inference scheme, it can also be run in a mode where input consists only of single-cell data and this is the version of PhISCS that we focus on and discuss in this article.
See also Brown et al. (2020) for additional comparisons of ILP and CSP in solving hard problems in Computational and Systems Biology.
References
- Benham, C., Kannan, S., Paterson, M., et al. 1995. Hen's teeth and whale's feet: Generalized characters and their compatibility. J. Comput. Biol. 2, 515–525. [DOI] [PubMed] [Google Scholar]
- Bonizzoni, P., Braghin, C., Dondi, R., et al. 2012. The binary perfect phylogeny with persistent characters. Theor. Comput. Sci. 454, 51–63. [Google Scholar]
- Bonizzoni, P., Carrieri, A.P., Della Vedova, G., et al. 2016. Solving the persistent phylogeny problem in polynomial time. arXiv preprint arXiv:1611.01017. [Google Scholar]
- Brown, H., Zuo, L., and Gusfield, D.. 2020. Comparing integer linear programming to SAT-solving for hard problems in computational and systems biology, 63–76. In Martin-Vide, C., Vega-Rodriguez, M.A., and Wheeler, T., eds. Algorithms for Computational Biology. AlCoB 2020. Lecture Notes in Computer Science, vol 12099. Springer International Publishing, Cham, Switzerland. [Google Scholar]
- Camin, J.H., and Sokal, R.R.. 1965. A method for deducing branching sequences in phylogeny. Evolution 19, 311–326. [Google Scholar]
- Ciccolella, S., Ricketts, C., Soto Gomez, M., et al. 2020. Inferring cancer progression from Single-Cell Sequencing while allowing mutation losses. Bioinformatics 37, 326–333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ciccolella, S., Soto Gomez, M., Patterson, M.D., et al. 2020. gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinformatics. 21, 413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Della Vedova, G., Patterson, M., Rizzi, R., et al. 2017. Character-based phylogeny construction and its application to tumor evolution, 3–13. In Kari, J., Manea, F., and Petre, I., eds. Unveiling Dynamics and Complexity. CiE 2017. Lecture Notes in Computer Science, vol 10307. Springer, Cham, Switzerland. [Google Scholar]
- Dollo, L. 1893. The laws of evolution. Bull. Soc. Bel. Geol. Paleontol. 7, 164–166. [Google Scholar]
- Edrisi, M., Zafar, H., and Nakhleh, L.. 2019. A combinatorial approach for single-cell variant detection via phylogenetic inference, 22:1–22:13. In Huber, K.T., Gusfield, D., eds. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), volume 143 of Leibniz International Proceedings in Informatics (LIPIcs). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany. [Google Scholar]
- El-Kebir, M. 2018. SPhyR: Tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics. 34, i671–i679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Estabrook, G.F., Johnson, Jr C.S., and McMorris, F.R.. 1976. A mathematical foundation for the analysis of cladistic character compatibility. Math. Biosci. 29, 181–187. [Google Scholar]
- Gawad, C., Koh, W., and Quake, S.R.. 2014. Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. Proc. Natl. Acad. Sci. USA. 111, 17947–17952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gusfield, D. 1991. Efficient algorithms for inferring evolutionary trees. Networks. 21, 19–28. [Google Scholar]
- Gusfield, D., Frid, Y., and Brown, D.. 2007. Integer programming formulations and computations solving phylogenetic and population genetic problems with missing or genotypic data, 51–64. In Lin, G., ed. Computing and Combinatorics. COCOON 2007. Lecture Notes in Computer Science, vol 4598. Springer, Berlin, Heidelberg. [Google Scholar]
- Hou, Y., Song, L., Zhu, P., et al. 2012. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell. 148, 873–885. [DOI] [PubMed] [Google Scholar]
- Hujdurovic, A., Kačar, U., Milanic, M., et al. 2015. Finding a perfect phylogeny from mixed tumor samples, 80–92. In Pop, M., and Touzet, H., eds. Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science, vol 9289. Springer, Berlin, Heidelbeg. [Google Scholar]
- Jahn, K., Kuipers, J., and Beerenwinkel, N.. 2016. Tree inference for single-cell data. Genome Biol. 17, 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim, K.I., and Simon, R.. 2014. Using single cell sequencing data to model the evolutionary history of a tumor. BMC Bioinform. 15, 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuipers, J., Jahn, K., and Beerenwinkel, N.. 2017a. Advances in understanding tumour evolution through single-cell sequencing. Biochim. Biophys. Acta. 1867, 127–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuipers, J., Jahn, K., Raphael, B.J., et al. 2017b. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 27, 1885–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lam, F., Gusfield, D., and Sridhar, S.. 2009. Generalizing the four gamete condition and splits equivalence theorem: Perfect phylogeny on three state characters, 206–219. In Salzberg, S.L., and Warnow, T., eds. Algorithms in Bioinformatics. WABI 2009. Lecture Notes in Computer Science, vol 5724. Springer, Berlin, Heidelberg. [Google Scholar]
- Leung, M.L., Davis, A., Gao, R., et al. 2017. Single-cell DNA sequencing reveals a late-dissemination model in metastatic colorectal cancer. Genome Res. 27, 1287–1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, Y., Xu, X., Song, L., et al. 2012. Single-cell sequencing analysis characterizes common and cell-lineage-specific mutations in a muscle-invasive bladder cancer. GigaScience. 1, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malikic, S. 2019. Inference of tumor subclonal composition and evolution by the use of single-cell and bulk DNA sequencing data. PhD thesis, Applied Sciences: School of Computing Science. Available at: https://summit.sfu.ca/item/19469. Last viewed July 26, 2020. [Google Scholar]
- Malikic, S., Jahn, K., Kuipers, J., et al. 2019. a. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat. Commun. 10, 2750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malikic, S., Mehrabadi, F.R., Ciccolella, S., et al. 2019. b. PhISCS: A combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res. 29, 1860–1877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallory XF, Edrisi, M., Navin, N., et al. 2020. Methods for copy number aberration detection from single-cell DNA-sequencing data. Genome Biol. 21, 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manuch, J., Patterson, M., and Gupta, A.. 2009. On the generalised character compatibility problem for non-branching character trees, 268–276. In Ngo, H.Q., ed. Computing and Combinatorics. COCOON 2009. Lecture Notes in Computer Science, vol 5609. Springer, Berlin, Heidelberg. [Google Scholar]
- McPherson, A., Roth, A., Laks, E., et al. 2016. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer. Nat. Genet. 48, 758–767. [DOI] [PubMed] [Google Scholar]
- Meacham, C.A. 1983. Theoretical and computational considerations of the compatibility of qualitative taxonomic characters, 304–314. In Felsenstein, J., ed. Numerical Taxonomy. NATO ASI series (Series G: Ecological Sciences), vol 1. Springer, Berlin, Heidelberg. [Google Scholar]
- Morita, K., Wang, F., Jahn, K., et al. 2020. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics. Nat. Commun. 11, 5327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nowell, P. 1976. The clonal evolution of tumor cell populations. Science. 194, 23–28. [DOI] [PubMed] [Google Scholar]
- Ramazzotti, D., Graudenzi, A., De Sano, L., et al. 2019. Learning mutational graphs of individual tumour evolution from single-cell and multi-region sequencing data. BMC Bioinform. 20, 210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogozin, I.B., Wolf, Y.I., Babenko, V.N., et al. 2005. Dollo parsimony and the reconstruction of genome evolution. Parsim. Phylog. Genom. 190, 200. [Google Scholar]
- Ross, E.M., and Markowetz, F.. 2016. OncoNEM: Inferring tumor evolution from single-cell sequencing data. Genome Biol. 17, 69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roth, A., McPherson, A., Laks, E., et al. 2016. Clonal genotype and population structure inference from single-cell tumor sequencing. Nat. Methods. 13, 573–576. [DOI] [PubMed] [Google Scholar]
- Sadeqi Azer, E., Haghir Ebrahimabadi, M., Malikić, S., et al. 2020. a. Tumor phylogeny topology inference via deep learning. iScience. 23, 101655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadeqi Azer, E., Rashidi Mehrabadi, F., Malikić, S., et al. 2020. b. PhISCS-BnB: A fast branch and bound algorithm for the perfect tumor phylogeny reconstruction problem. Bioinformatics. 36(Suppl. 1), i169–i176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Satas, G., Zaccaria, S., Mon, G., et al. 2020. SCARLET: Single-cell tumor phylogeny inference with copy-number constrained mutation losses. Cell Syst. 10, 323.e8–332.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartz, R., and Schäffer, A.A.. 2017. The evolution of tumour phylogenetics: Principles and practice. Nat. Rev. Genet. 18, 213–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singer, J., Kuipers, J., Jahn, K., et al. 2018. Single-cell mutation identification via phylogenetic inference. Nat. Commun. 9, 5144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Loo, P., and Voet, T.. 2014. Single cell analysis of cancer genomes. Curr. Opin. Genet. Dev. 24, 82–91. [DOI] [PubMed] [Google Scholar]
- Wang, Y., Waters, J., Leung ML, et al. 2014. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature. 512, 155–160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weber, L.L., and El-Kebir, M.. 2020. Phyolin: Identifying a linear perfect phylogeny in single-cell DNA sequencing data of tumors. In Kingsford, C., and Pisanti, N., eds. 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, Dagstuhl, Germany. [Google Scholar]
- Wu, H., Zhang, X.-Y., Hu, Z., et al. 2016. Evolution and heterogeneity of non-hereditary colorectal cancer revealed by single-cell exome sequencing. Oncogene. 36, 2857–2867. [DOI] [PubMed] [Google Scholar]
- Wu, Y. 2019. Accurate and efficient cell lineage tree inference from noisy single cell data: The maximum likelihood perfect phylogeny approach. Bioinformatics 36, 742–750. [DOI] [PubMed] [Google Scholar]
- Xu, X., Hou, Y., Yin, X., et al. 2012. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell. 148, 886–895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zafar, H., Navin, N., Chen, K., et al. 2019. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res. 29, 1847–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zafar, H., Tzen, A., Navin, N., et al. 2017. SiFit: Inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 18, 178. [DOI] [PMC free article] [PubMed] [Google Scholar]





