Abstract
Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees. These networks are able to incorporate reticulate evolutionary events such as hybridization, introgression, and lateral gene transfer. Recently, network-based Markov models of DNA sequence evolution have been introduced along with model-based methods for reconstructing phylogenetic networks. For these methods to be consistent, the network parameter needs to be identifiable from data generated under the model. Here, we show that the semi-directed network parameter of a triangle-free, level-1 network model with any fixed number of reticulation vertices is generically identifiable under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints.
Keywords: Phylogenetic networks, Identifiability, Reticulation, Markov processes
Introduction
Typically, the goal of a phylogenetic analysis is to find a tree that describes the evolutionary relationships among a set of taxa. However, because trees, as directed graphs, have acyclic skeletons, they cannot represent reticulate evolutionary events, such as hybridization, introgression, and lateral gene transfer. Recognizing this limitation, it has become increasingly common to use phylogenetic networks in order to more accurately describe the history of some sets of taxa (Bapteste et al. 2013). This increasing attention to phylogenetic networks has led to many new results about the combinatorial properties of phylogenetic networks (Huson et al. 2010; Gusfield 2014), (Steel 2016, Chapter 10), as well as to new methods for inferring phylogenetic networks from biological data.
Many of these new methods for inferring phylogenetic networks are based on constructing networks from small sets of inferred trees (Baroni et al. 2005; Huber et al. 2011; Nakhleh et al. 2005; Yang et al. 2014) or adapting variants of maximum parsimony and neighbor joining (Bryant and Moulton 2004; Jin et al. 2007). Several others are model-based methods that are designed to infer various features of a species networks from data generated by a network multispecies coalescent model. These include, for example, the methods implemented in Phylonet (Than et al. 2008; Wen et al. 2018) as well as SNaQ (Solís-Lemus and Ané 2016; Solís-Lemus et al. 2017) and NANUQ (Allman et al. 2019). Now that network-based Markov models of DNA sequence evolution have been developed (see e.g. Nakhleh 2011, §3.3), it seems natural to use these models in order to add other model-based techniques to the set of tools for network inference. However, in order to consistently infer a parameter using a model-based approach, that parameter must be identifiable from some feature of the model. The question of parameter identifiability is significant and has been explored for several different phylogenetic models. For example, there are numerous identifiability results for tree-based Markov models (Allman et al. 2011; Allman and Rhodes 2006; Chang 1996; Rhodes and Sullivant 2012) and there are similar results for networks that provide the theoretical justification for methods such as SNaQ (Solis-Lemus et al. 2020) and NANUQ (Baños 2019) mentioned above. In this work, we explore the identifiability of the network parameter in network-based Markov models.
Formally, network-based Markov models are parameterized families of probability distributions on n-tuples of DNA bases. The parameterization is derived by modeling the process of DNA sequence evolution along an n-leaf leaf-labelled topological network, which we call the network parameter of the model. Given an n-taxa sequence alignment, a probability distribution in a network-based Markov model specifies the probability of observing each of the possible site-patterns at a particular site. Indeed, in a model-based approach, an n-taxa sequence alignment is usually regarded as an observation of n independent and identically distributed site-patterns. A sequence alignment can therefore be viewed as an approximation of a probability distribution, with the probability for each site-pattern being proportional to the number of times it appears in the alignment. Given a collection, or class, of network-based Markov models, the network parameter is identifiable if any expected site pattern probability distribution p in the model belongs to at most one model in the class. Identifiability, as just defined, is very strong and certainly not satisfied for any reasonable collection of models. Thus, in practice, one often aims at proving that a parameter is generically identifiable. If the network parameter of a class of models is generically identifiable then a probability distribution p from one of the models almost surely belongs to no other model in the class.
The generic identifiability of the tree and network parameters of several phylogenetic models has been shown by adopting techniques from algebraic geometry (Allman et al. 2011; Gross and Long 2017; Hollering and Sullivant 2020; Long and Kubatko 2018). These results apply to several types of mixture models, network models, and multispecies coalescent models. Even though tree-based Markov models of sequence evolution are naturally defined on rooted trees, in many of these works, the tree parameter is assumed to be an unrooted tree. The reason for this is that given an expected site pattern probability distribution from a tree-based Markov model, the location of the root of the tree is not identifiable [see, for example, Sect. 8.5 in Semple and Steel (2003) or Chapter 15 in Sullivant (2018)]. Similarly, with network-based Markov models, even though we define the models on rooted networks, we will only be able to establish generic identifiability when the network parameter is assumed to be a semi-directed network. Semi-directed networks are unrooted versions of rooted networks, which retain information about which vertices are reticulation vertices (and which edges are reticulation edges). In Gross and Long (2017), algebraic techniques were used to show that the network parameter is generically identifiable when the underlying Markov process is subject to the Jukes–Cantor (JC) transition matrix constraints and the network parameter is assumed to be a semi-directed network with exactly one cycle of length at least four. Recently, in Hollering and Sullivant (2020), this result was extended using an algebraic matroid approach to include the Kimura 2-parameter and Kimura 3-parameter constraints (K2P, K3P).
Theorem 1
(Gross and Long 2017; Hollering and Sullivant 2020) The network parameter of a network-based Markov model under the Jukes–Cantor (Gross and Long 2017), Kimura 2-parameter (Hollering and Sullivant 2020), or Kimura 3-parameter (Hollering and Sullivant 2020) constraints is generically identifiable with respect to the class of models where the network parameter is an n-leaf semi-directed network with exactly one undirected cycle of length of at least four.
Still, these identifiability results only apply for networks with a single reticulation vertex. In this paper, we prove the following, extending the results to triangle-free, level-1 semi-directed networks, that is, triangle-free semi-directed networks where every undirected cycle contains a single reticulation vertex.
Theorem 2
The network parameter of a network-based Markov model under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints is generically identifiable with respect to the class of models where the network parameter is an n-leaf triangle-free, level-1 semi-directed network with reticulation vertices.
To illustrate the implications of Theorem 2, suppose that p is an expected site pattern probability distribution that belongs to a Markov model on a rooted phylogenetic network N. If it is known that N is level-1 with triangle-free skeleton and r reticulation vertices, then from p, it is possible (almost surely) to determine the unrooted skeleton of N as well as which vertices (edges) are hybrid vertices (edges).
Our proof is largely combinatorial, as we are able to use the algebraic results for small networks obtained in Gross and Long (2017) and Hollering and Sullivant (2020), in addition to a few new ones, as building blocks. We begin in Sect. 2 by describing more precisely the models we consider as well as the algebraic approach to establishing generic identifiability. In Sect. 3, we prove a few novel results about the algebra of 4-leaf level-1 networks and collect the other required algebraic results. In Sect. 4, we prove several combinatorial properties of level-1 phylogenetic semi-directed networks that we will need to prove the main result. Finally, with these results in place, in Sect. 5, we prove Theorem 2.
Preliminaries
We begin this section by defining the graph theoretic terminology that we will use throughout the paper. Then, in Sect. 2.2, we introduce network-based Markov models on rooted networks, and in Sect. 2.3, we show that we can also define a network-based Markov model on a semi-directed network. Finally, we describe the connection between network-based Markov models and algebraic varieties and formally define what it means for two networks to be distinguishable and precisely what it means for the network parameter of a class of models to be generically identifiable.
Graph theory terminology
A (rooted binary) phylogenetic network N on a set of leaves is a rooted acyclic directed graph with no edges in parallel such that the root has out-degree two, each vertex with out-degree zero has in-degree one, the set of vertices with out-degree zero is , and all other vertices either have in-degree one and out-degree two, or in-degree two and out-degree one. The skeleton of a phylogenetic network is the undirected graph that is obtained from the network by removing edge directions.
A vertex is a tree vertex if it has in-degree one and out-degree two. A vertex is a reticulation vertex if it has in-degree two and out-degree one, and the edges that are directed into a reticulation vertex are called reticulation edges. Let denote the number of reticulation vertices in network . Since is binary, it can be shown that it has exactly vertices and internal vertices. A rooted phylogenetic network with no reticulation vertices is a rooted phylogenetic tree.
The level of a phylogenetic network is the maximum number of reticulation vertices in a biconnected component of the network. Of particular interest in this paper are level-1 networks, which can also be characterized as phylogenetic networks where no vertex belongs to more than one cycle in the network’s skeleton (Rossello and Valiente 2009).
More specifically, we will be concerned with a particular kind of level-one network, in which only the reticulation edges are directed.
Definition 1
A semi-directed network is a mixed graph obtained from a phylogenetic network by undirecting all non-reticulation edges, suppressing all vertices of degree two, and identifying parallel edges.
Note that deciding whether a mixed graph, a graph with some edges directed and others undirected, is a semi-directed network can be done in quadratic time in the number of edges [Corollary 4 of Huber et al. (2019)]. The unrooted skeleton of a phylogenetic network is the skeleton of its associated semi-directed network (including leaf labels).
In a semi-directed network, the reticulation vertices are the vertices of indegree two and the level is defined the same as for a rooted phylogenetic network. A triangle-free level-1 semi-directed network is a level-1 semi-directed network where every cycle in the unrooted skeleton has length greater than three. We will also refer to level-1 semi-directed networks with exactly one reticulation vertex as k -cycle networks, where k is the length of the unique cycle in the unrooted skeleton.
We finish these preliminaries with one additional bit of graph theory terminology that will be useful throughout. Let be a partition of with A, B non-empty. An edge e in a network separates A and B if every path (not necessarily directed) between any and contains e. If e separates A and B then we call e a cut-edge and we say has an split.
Network based Markov models
We begin this section by describing a model of DNA sequence evolution along an n-leaf rooted binary phylogenetic network. For the description below, we assume that the network belongs to the set of tree-child networks (Cardona et al. 2007), which contains the set of level-one networks. In a tree-child network, every internal vertex has at least one child vertex that is either a tree vertex or a leaf.
Let be an n-leaf phylogenetic network and let be the root of the network. Let be the set of (row) stochastic matrices and let be the dth dimensional probability simplex, i.e. . We associate to each node v of N a random variable with state space , corresponding to the four DNA bases. The nodes of the network, including the interior nodes, represent taxa, and the random variable is meant to indicate the DNA base at the particular site being modeled in the taxon at v.
Now, let be the distribution at the root with , and associate to each edge of a transition matrix where the rows and columns are indexed by the elements of the state space. With u a parent of v, the matrix is equal to the conditional probability . When is a rooted tree, the probability of observing a particular n-tuple at the leaves of is straightforward to compute. Letting be the vertex set of , we first consider an assignment of states to the vertices of by where is the state of . Then, under the assumption of a tree based Markov model, the probability of observing the assignment can be computed using the distribution at the root and the transition matrices. Specifically, letting be the set of edges of , this probability is equal to
The probability of observing a particular assignment of states at the leaves can be obtained by marginalization, i.e. summing over all possible assignments of states to the internal nodes. In particular, if is an assignment of states to the leaves of and is the restriction of to the entries corresponding to the leaves of , the probability of observing is then
When the rooted network contains at least one cycle in its skeleton, there is no longer a unique path between each leaf and the root, and thus reticulation edge parameters are introduced. In this case, suppose has r reticulation vertices . Since each has in-degree two, there are two edges, and , directed into . Assign a parameter to and the value to . For , independently delete , keeping , with probability , otherwise, delete and keep . Intuitively, the parameter corresponds to the probability that a particular site was inherited along edge . Encode this set of choices with a binary vector where a 0 in the ith coordinate indicates that edge was deleted. Since is assumed to be a tree-child network, after deleting the r edges, the result is a rooted n-leaf tree . Since there are four DNA bases and n leaves of the network, there are possible site-patterns, or assignment of states, that could be observed at the leaves of . The probability of observing the site-pattern is
1 |
While seemingly complicated, the above expression is a polynomial in the numerical parameters of the model: the root distribution, the entries of the transition matrices, and the reticulation edge parameters. Thus the map defined by the network
from the numerical parameter space to the probability simplex is a polynomial map. The image of the map is called the model associated to , denoted . Note the model is the set of all possible probability distributions obtained by fixing the network and varying the numerical parameters. See Fig. 1 for an example of a network with its numerical parameters.
When we place no restrictions on the entries of the transition matrices (other than that they are stochastic) the underlying substitution process is known as the general Markov model. Network-based phylogenetic models with a general Markov substitution process are studied for example in Casanellas and Fernández-Sánchez (2020). However it is quite common in phylogenetics to consider models with additional constraints, effectively reducing the dimension of the parameter space . For example, in the Kimura 3-parameter DNA substitution model, the root distribution is uniform and each transition matrix is assumed to have the following form, where the rows and columns are indexed by the DNA bases A, G, C, T,
In the Kimura 2-parameter model (K2P), and Jukes–Cantor models, additional restrictions are placed on the entries of the transition matrices ( for K2P and for JC).
In order to not overload the word “model," we will refer to these restrictions on the transition matrices as constraints. For example, we will refer to the image of under the Jukes–Cantor DNA substitution model as the model associated to under the Jukes–Cantor constraints.
We end this section on network-based Markov models by noting that there exist other natural extensions of tree-based Markov models. For example, in Francis and Moulton (2018), the authors consider a network model adapted from Thatte (2013) and are able to establish identifiability for the entire class of tree-child networks. The stronger identifiability results come at the expense of some modeling flexibility, but the difference can illustrate the possible gains that can be made by considering different processes.
Semi-directed network models
In this section, we show how to associate a model to a phylogenetic semi-directed network N for the group-based models considered in this paper. We will see that for a given set of constraints (JC, K2P, K3P), if is a phylogenetic network and N is the semi-directed network attained from as in Definition 1, then . We start by showing that the model associated to a rooted network N does not depend on the location of the root. Then, we show that the associated model does not change if we suppress degree two vertices or remove parallel edges in the network. Thus, the phylogenetic semi-directed network N contains all of the information necessary to recover .
For a tree-based phylogenetic model under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints, we may relocate the root and suppress vertices of degree two without changing the underlying model [see, for example, Sect. 8.5 in Semple and Steel (2003) or Chapter 15 in Sullivant (2018)]. That we can relocate the root is easily observed since each of the transition matrices is symmetric and the root distribution is uniform, so that To see that we may suppress vertices of degree two without changing the model, suppose the edges e and f are incident to a vertex of degree two and that the Markov transition matrices and satisfy the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints. Then the transition matrix will satisfy the same constraints, so we may suppress the vertex of degree two and assign this transition matrix to the newly created edge to obtain the same site pattern probability distribution from the model. These results imply that the location of the root of the rooted tree parameter in a tree-based Markov model cannot be identified from an expected site-pattern in the model. Or, viewed another way, these results mean that we can associate a tree-based Markov model to an unrooted tree and consider the tree parameter in a tree-based Markov model to be an unrooted tree.
A similar result holds for the network-based Markov models considered in this paper. For a fixed choice of parameters in a network model, the associated site pattern probability distribution is the weighted sum of site-pattern probability distributions from the constituent tree models. The weights are determined by the reticulation edge parameters. Since relocating the root in each of the trees does not affect the tree models, the network model will remain the same if we relocate the root of the network and redirect the edges in any way that preserves the direction of the reticulation edges. For example, in the rooted network in Fig. 1, we could suppress the existing root vertex, subdivide the edge directed into the leaf vertex labeled by z to create a new root, and then redirect edges away from the new root in a way that preserves the directions of the reticulation edges.
If a child of the root vertex is a reticulation vertex, then unrooting and suppressing the root will may result in a pair of parallel reticulation edges in the semi-directed network. However, under the JC, K2P, and K3P constraints, we may identify any pair of parallel edges without altering the model. The reason for this is that the sets of transition matrices under each of these constraints are closed under convex sums. So if a network contains a set of parallel reticulation edges with transition matrices and , we can replace these edges with a single edge with transition matrix and obtain the same site-pattern probability distribution in the model, where is the reticulation edge parameter for the edge e.
Together, these arguments give us the following proposition.
Proposition 1
Let and be two tree-child phylogenetic networks with associated phylogenetic semi-directed networks and . Under the JC, K2P, or K3P constraints, if then .
Thus, the model associated to a rooted phylogenetic network is entirely determined by the associated phylogenetic semi-directed network. Although we note that the arguments above are specific to the JC, K2P, and K3P constraints, and similar arguments might not work for other network-based Markov models.
Proposition 1 suggests that we may regard the network parameter of a network-based Markov model as a phylogenetic semi-directed network. Given a phylogenetic semi-directed network N, we can determine the model by choosing any rooted network for which N is the associated semi-directed network and defining . Therefore, for the rest of this paper, we will assume that the network parameter of each model is an n-leaf phylogenetic semi-directed network. Indeed, this is necessary to obtain any identifiability results, as the location of the root in a rooted network is not identifiable from an expected site pattern probability distribution in the model.
Markov models as algebraic varieties
In this paper, we prove generic identifiability using tools from combinatorics and computational algebraic geometry. In order to understand within an algebraic-geometric framework, we consider the complex extension of , which we denote as .
Let be the set of all polynomials on variables with coefficients in . The ideal associated to is the set of polynomials that vanish on the image of , i.e.
The elements of are called phylogenetic invariants. Each polynomial in vanishes on , that is, each polynomial yields zero when we substitute the entries of any probability distribution . Phylogenetic invariants are the defining polynomials of the variety associated to , which we will refer to as the network variety. Specifically,
Elements of are polynomial relationships among the entries of p that hold for all distributions . If we look back at equation (1), it is reasonable to assume that such relationships may be quite complicated since each probability coordinate is parameterized by a polynomial that is the sum of terms. Because of this, we perform a linear change of coordinates on both the parameter space and the image space called the Fourier-Hadamard transform (Evans and Speed 1993; Hendy and Penny 1996). After the transform, the invariants are expressed in the ring of q-coordinates,
As an example of how the Fourier-Hadamard simplifies the resulting algebra, for a tree-based phylogenetic model, the parameterization of each q-coordinate is a monomial in the Fourier parameters and the phylogenetic tree ideal is generated by binomials. Working in the transformed coordinates is common when working with group-based models and it is what enables us to compute the required network invariants. While the details of the Fourier-Hadamard transform are outside the scope of this paper, we give here a brief description of how to parametrize a phylogenetic network model under the Jukes–Cantor, Kimura 2-parameter, and Kimura 3-parameter constraints. More details can be found in Sturmfels and Sullivant (2005) and Chapter 15 of Sullivant (2018).
First, we will describe how to determine the Fourier parametrization of a phylogenetic tree, T. As in Sturmfels and Sullivant (2005) and Sullivant (2018), we begin by identifying the four DNA bases with elements of the group as follows , , and . Under the Kimura 3-parameter constraints, there are then four Fourier parameters associated to each edge i, denoted as , , , and (after transformation, the stochastic condition on the transition matrices forces ). Letting be the site pattern , the parametrization is then given by
where is the set of edges of T and is the split induced by e in T. All addition is in the group .
Notice that this is a monomial, in which there is one parameter associated to each edge of the tree T. In order to parametrize a phylogenetic network, we take the sum of the monomials corresponding to all trees created by removing reticulation edges from the network. The monomials are weighted by the corresponding reticulation edge parameters.
Generic identifiability
A model-based approach to network inference selects the model from a set of candidate models that best fits the observed data according to some criteria and returns the network parameter of this model. In our setting, the observed data are the aligned DNA sequences of the taxa under consideration, from which we construct the observed site pattern probability distribution. In the ideal setting, if we had access to infinite noiseless data generated by a network-based Markov model, then the observed site pattern distribution would be equal to an expected site pattern distribution in the model. Inferring the correct network parameter in this case would be as simple as determining which model from a set of candidate models the site pattern probability distribution belongs to. However, even in this idealized setting, it may be that the observed site pattern distribution belongs to the models corresponding to two distinct networks, making it impossible to determine which network produced the data. Thus, a desirable theoretical property for a class of network models is that each distribution in one of the models belongs to no other model, or that the network parameter be identifiable.
Let be a set of leaf-labelled networks. More formally, the condition that the network parameter is identifiable with respect to a collection of models is equivalent to the condition that for all distinct , , meaning the two models do not intersect. Since this notion of identifiability is rather strong, the more practical notion of generic identifiability is more commonly explored.
Definition 2
Let be a class of phylogenetic network models. The network parameter is generically identifiable with respect to the class if given any two distinct n-leaf networks , the set of numerical parameters in that maps into is a set of Lebesgue measure zero.
To establish generic identifiability, we can use algebraic geometry by considering the family of irreducible algebraic varieties , where is the network variety associated to N. Generic identifiability is then closely related to the concept of distinguishability.
Definition 3
(Gross and Long 2017) Two distinct n-leaf networks and are distinguishable if is a proper subvariety of and , that is, and . Otherwise, they are indistinguishable.
Proposition 2
(Gross and Long 2017, Proposition 3.3) Let be a class of phylogenetic network models. The network parameter of a phylogenetic network model is generically identifiable with respect to if given any two distinct n-leaf networks , the networks and are distinguishable.
The condition that the network parameter be generically identifiable then amounts to showing that for all the networks and are distinguishable, or equivalently, and . Proving that this condition is satisfied can then be done either by explicit computation of the ideals associated to and (as in Gross and Long (2017)), or by arguing that certain phylogenetic invariants must exist [as in Hollering and Sullivant (2020)].
Distinguishability of 4-leaf semi-directed networks
Our aim is to prove Theorem 2, by showing that any two distinct n-leaf r-reticulation triangle-free level-1 semi-directed networks are distinguishable. In order to show this, we will require a number of results concerning 4-leaf networks which we prove in Lemma 1 below.
Up to leaf relabeling, there are six different 4-leaf level-1 semi-directed networks which are depicted in Fig. 2. In Lemma 1, we assume that and are two distinct 4-leaf semi-directed networks. We then consider all cases where and are each either a quartet tree (Q), a single triangle network (), a double-triangle network (DT), or a 4-cycle network (4C), and compare the resulting varieties. We only need to consider four possibilities for each of and , because under the JC, K2P, and K3P constraints, the variety of a triangle or double-triangle semi-directed network is determined by the unrooted skeleton of the network. This can be shown by first observing that under the JC, K2P, and K3P models, the ideals of all of the 3-leaf semi-directed triangle networks are identical. The proof then follows by applying the same toric fiber product argument that is described in the remark following Proposition 4.5 in Gross and Long (2017).
The results of Lemma 1 are summarized in Table 1 and the caption of that table contains the key to the symbols. To give a couple of examples, part (ii) of the lemma corresponds to the (2, 2) entry of the table. The symbol indicates that the networks are distinguishable, but only if and have distinct unrooted skeletons. The results of part (iii) of the lemma are represented by the entries (4, 1) and (4, 2) (when and is a 4-cycle network) and by (2, 1) (when and is a 3-cycle, or triangle network). And of course, these results are also represented by the entries (1, 4), (2, 4), and (2, 1) when the roles of and are reversed.
Table 1.
Q | DT | 4C | |||
---|---|---|---|---|---|
Q | |||||
DT | |||||
4C |
Lemma 1
Let and be distinct 4-leaf level-1 semi-directed networks. Then under the JC, K2P, or K3P constraints:
-
(i)
If and are both trees, then and are distinguishable;
-
(ii)
If and are both single-triangle networks and have different (leaf-labelled) unrooted skeletons, then and are distinguishable;
-
(iii)
If is a -cycle network with and is a tree or a -cycle network with , then ;
-
(iv)
If and are both 4-cycle networks, then and are distinguishable;
-
(v)
If is a double-triangle network and a single-triangle network or a tree, then ;
-
(vi)
If is a double-triangle network and is a 4-cycle network, then and are distinguishable;
-
(vii)
If and are both double-triangle networks and have different (leaf-labelled) unrooted skeletons, then and are distinguishable.
See Table 1 for an overview.
The proof of Lemma 1 will be given below. We first outline the proof strategy. Some parts of the lemma will follow immediately from results in Gross and Long (2017) and Hollering and Sullivant (2020). In Gross and Long (2017), the proofs were obtained by computing Gröbner bases for all of the ideals involved and then comparing the ideals. However, this was only possible because the constraints considered were the Jukes–Cantor constraints, the most restrictive that we consider. In Hollering and Sullivant (2020), the authors extend the results to the K2P and K3P constraints using a method based on the theory of algebraic matroids. This method is preferable when there are fewer constraints since the Gröbner bases computations are difficult if not impossible to carry out. Here, we find the required invariants by modifying this method slightly. Specifically, we apply the random search strategy described in that paper to locate small subsets of variables that are likely to contain distinguishing invariants. We then perform our computations in a much smaller subring of the original variables. This greatly reduces the size of the required computations and allows us to generate specific invariants without computing Gröbner bases for the ideals.
In order to reduce the total number of invariants required to prove each part, we take advantage of the symmetry between networks. As an example, suppose that the statement in part (vii) is false. Then there must exist two double-triangle networks with distinct skeletons, and , that are not distinguishable. All of the network varieties are parameterized, and hence irreducible as algebraic varieties, which means we may assume that if two networks are not distinguishable then one is contained in the other. Thus, without loss of generality, we may assume that , which implies the reverse inclusion of ideals, . Up to relabeling, every double-triangle network has the same unrooted skeleton. Thus, we can obtain any arbitrary double-triangle network from by permuting leaf labels. If we apply the same permutation to the leaf labels of , we obtain another double-triangle network for which . Since our choice of is arbitrary, if we can show that there is a single double-triangle network with ideal not contained in the ideal of any other double-triangle network, then we arrive at a contradiction, and have thus proven part (vii). Therefore, in order to prove part (vii), it will suffice to produce a single invariant that vanishes on exactly one of the double-triangle network varieties. A similar argument applies in each of the other parts.
In order to prove some parts of the lemma, we require two or more invariants to distinguish all of the relevant networks, though all parts can be proven using some combination of just the following six polynomial invariants:
In the supplementary Macaulay2 (Grayson and Stillman 2002) files, available at
https://github.com/colbyelong/DistinguishingLevel1PhylogeneticNetworks,
we provide the code to verify that these polynomials vanish or do not vanish on the referenced varieties as claimed.
Proof
(Proof of Lemma 1) Statement (i) is a well-known result for the JC, K2P, and K3P constraints and can be verified using the Small trees catalog (Casanellas et al. 2005). For the Jukes–Cantor constraints, (ii)-(iv) follow from Proposition 4.6, Corollary 4.8, and Corollary 4.9 in Gross and Long (2017).
To prove (ii) for the K2P and K3P constraints we require a set of invariants that vanishes on exactly one of the single-triangle networks. The set is confirmed to be such a set for both constraints in the supplementary files. Statements (iii) and (iv) are proven for the K2P and K3P models by Lemmas 28 and 29 in Hollering and Sullivant (2020).
To prove (v), we require a set of invariants that vanishes on one of the tree varieties, but on none of the double-triangle network varieties, and a set of invariants that vanishes on one of the single-triangle networks varieties, but on none of the double-triangle network varieties. The set is shown to be the required set for both parts under K2P and K3P, and the set works for the JC constraints.
For (vi), we must first show that there is a set of invariants that vanishes on one of the double-triangle network varieties but on none of the 4-cycle network varieties. The set works for all constraints and thus establishes that if is a double-triangle and is a 4-cycle network, then . We prove that , and hence that the networks are distinguishable, by constructing a set of invariants that vanishes on one of the 4-cycle network varieties but on none of the double-triangle network varieties. For the JC constraints, this set is . For the K2P and K3P constraints, this set is .
The invariant also establishes (vii), since it vanishes on exactly one of the double-triangle networks under JC, K2P, and K3P.
We also need a result on 4-leaf networks that does not fit into Table 1. To state this result we first need some definitions concerning the type of splits in a network.
Definition 4
For networks and , we say is a common split if is a split in both and ; it is non-trivial if . Two splits in and in are conflicting if are all non-empty.
Lemma 2
Let and be distinct 4-leaf level-1 semi-directed networks. If have conflicting splits, then and are distinguishable under the JC, K2P, or K3P constraints.
Proof
Note that 4-cycles have no non-trivial splits, so we just need to compare trees, single-triangle networks, and double-triangle networks. Moreover, Table 1 shows that we only need to verify that in the following cases:
-
(i)
when is a tree or triangle network and is a double-triangle network with a conflicting split and
-
(ii)
when is a tree and is a triangle network with a conflicting split.
The invariant can be used to verify case (i) for all three constraints. The invariant can be used to verify case (ii) for JC, and can be used to verify case (ii) for K2P and K3P.
Finally we require Lemma 3, which allows us to use the above small networks as building blocks to prove the claim about larger networks. To state Lemma 3, we first define the restriction of a network to a subset of leaves.
Definition 5
Let N be an n-leaf semi-directed network on and let . The restriction of N to A is the semi-directed network obtained by taking the union of all directed paths between leaves in A (where undirected edges are treated as bidirected) and then suppressing all degree two vertices and removing parallel edges.
Lemma 3 is essentially a one-way version of Proposition 4.3 from Gross and Long (2017), and we use a piece of the proof of that proposition below.
Lemma 3
Let and be distinct n-leaf semi-directed networks on and let . If , then .
Proof
Let . Then . In the proof of Proposition 4.3 from Gross and Long (2017), it is shown that if , then there exists a polynomial invariant f contained in , which implies that , and so .
Lemma 3 implies that in order to prove Theorem 2 it will suffice to show that for any distinct triangle-free level-1 semi-directed networks and , there either exists a set with such that and are distinguishable, or sets with such that and .
Combinatorial properties of triangle-free level-1 semi-directed networks
If is a partition of such that contains an split, then denote by N/X the network , for an arbitrary . Observe that the unrooted skeleton of does not depend on the choice of x. Observe also that .
Observation 1
If and are distinct n-leaf semi-directed networks and is a common split, then either and are distinct or and are distinct.
The next lemma follows immediately from Lemma 3 and the definition of .
Lemma 4
Let and be distinct n-leaf semi-directed networks on . Suppose is a partition of such that and both contain an split. If then .
Let be an n-leaf triangle-free level-1 semi-directed network on and C a cycle in . Let be the cut-edges incident to C. Then the partition induced by C is the partition of such that if and only if x is separated from C by . We say is below the reticulation vertex if is the edge incident to the reticulation vertex in C. If is below the reticulation vertex in C then we also say that x is below the reticulation vertex for any .
We say a set of three or more leaves meet at a cycle C if each leaf in appears in a different set of the partition induced by C. We say that they induce a cycle in if is a t-cycle network. Note that if the set of leaves induce a cycle then they must meet at a cycle, but the converse does not hold unless one of is below the reticulation vertex. As an example consider the network in Fig. 4a: meet at the cycle but do not induce a cycle, whereas also induce a cycle.
Observe that if () meet at a cycle, then they meet in exactly one cycle in N, i.e., this cycle is unique in N. Denote this cycle by . (Note that is not well-defined if do not all meet at a cycle.)
Let and be distinct n-leaf triangle-free level-1 semi-directed networks on , and let be a cycle in that induces a partition with below the reticulation vertex. Let be a cycle in that induces a partition , with below the reticulation vertex. We say that refines if is a refinement of , i.e., if and every pair of leaves a, b that are contained in different sets in also appear in different sets in . See Fig. 3.
We recall a combinatorial result from Baños (2019) on four-leaf induced cycles. We state the result using notation and terms from this paper.
Lemma 5
(Lemmas 12 and 13 of Baños (2019)) Let N be an n-leaf triangle-free level-1 semi-directed network on . If two distinct subsets of four leaves induce a 4-cycle, where three leaves in the two sets are the same, then the five leaves (the union of the two sets) meet at the same cycle. In other words, let be leaves of N such that and are both 4-cycle networks. Then and meet at the same cycle.
Lemma 6
Let and be distinct n-leaf triangle-free level-1 semi-directed networks on . Suppose that for any , if is a 4-cycle, then . Then every cycle in is refined by a cycle in .
Proof
Let be a cycle in that induces a partition with below the reticulation vertex. Choose any . As is a 4-cycle, is the same 4-cycle. So let . We claim that is the desired cycle of that refines .
To see this, first consider any where . Then a, b, c, d all meet at and so is well-defined. Since , we can replace a with and have that the set of leaves also meet at . By similar arguments, we also have that meet at and meet at . Moreover each of these sets of 4 leaves induces a cycle in (as d is below the reticulation vertex in ), and so also induce a cycle in . Thus we have that , , , are all 4-cycles, and in particular , , , are all well-defined. (See Fig. 4.) By Lemma 5, we must have that .
We thus have that for with , the set of leaves all meet at .
Now consider any two leaves such that appear in different sets in . By choosing additional leaves from other sets, such that one of is in , we have that where , for some . Then by the above we have that . In particular, appear in different sets in the partition induced by . This implies that the partition induced by is a refinement of the partition induced by . Moreover, observe that is below the reticulation vertex in if and only if (since the only element of below the reticulation vertex in is the one from ). Thus, the partition induced by is with below the reticulation and a refinement of . Therefore, refines .
Lemma 7
Suppose that every cycle in is refined by a cycle in . If has a non-trivial split, then either share a non-trivial common split or they have conflicting splits.
Proof
Let be a non-trivial split in . Fix an arbitrary , and take the edge e in furthest from b such that e separates b from A. If e separates A from B, then is a non-trivial common split and we are done.
Otherwise, let u be the vertex in e nearer to A. If u is on a cycle, then denote this cycle by . Let be the partition induced by , noting by construction that for the set containing b (since is the set of leaves reachable from C via e). If for any j, then the corresponding edge leaving C is an edge that is further away from b than e and which separates A from b, contradicting the choice of e. So we may assume that the partition must subdivide A (that is, A has non-empty intersection with at least two sets ).
Furthermore must subdivide B, as otherwise the set (which contains b) contains all of B and also none of A, which would imply that is a common split. So is a cycle in whose induced partition subdivides both A and B. As every cycle in is refined by a cycle in , this implies that some cycle in also subdivides both A and B. But this contradicts the fact that contains an split. (See Fig. 5a.)
If u is not on a cycle, let f and g be the other edges incident to u. By choice of e, neither f nor g can separate A from b. Thus there is at least one element reachable from u via f, and at least one element reachable from u via g. As e does not separate A from B, there is at least one that is reachable from u via either f or g, say (without loss of generality) f. Then let be the split induced by f, with Y the set containing b. Observe that while . Thus we have that are all non-empty, and so and have conflicting splits. (See Fig. 5b.)
Distinguishability of triangle-free level-1 networks
Theorem 2 follows as a corollary of the next lemma.
Lemma 8
Let and be distinct n-leaf triangle-free level-1 semi-directed networks on and . Then under the JC, K2P, and K3P constraints.
Proof
We prove the claim by induction on , the number of leaves in and . For the base case, if then either or . If , then and are both trees. If and , then and are both 4-cycles. If and , then is a 4-cycle network and is a tree. For each of these cases, by Lemma 1, it follows that . Note that we must have and as these networks are triangle-free. Thus, this covers all cases for .
So now assume that and that the claim is true for all smaller values of n. We first show that we may assume that any set of 4 leaves that induces a 4-cycle in induces the same 4-cycle in .
Indeed, suppose this is not the case, and consider some arbitrary with such that is a 4-cycle but is not the same 4-cycle. If is a different 4-cycle or a double-triangle, then by Lemma 1, and are distinguishable (and in particular, ).
Otherwise, is either a tree or a 3-cycle network, and Lemma 1 implies that . In either case, and hence, by Lemma 3, we have .
So we may now assume that any set of 4 leaves that induces a 4-cycle in induces the same 4-cycle in . By Lemma 6, this implies that every cycle in is refined by a cycle in . By Lemma 7, and must have either a non-trivial common split or conflicting splits, or must have no non-trivial split. It remains to complete the proof in these three cases.
Firstly, if have conflicting splits, then by Lemma 2 we have , as required.
Secondly, suppose that is a non-trivial common split, and consider , , as defined in the beginning of Sect. 4. Since , each of these networks has fewer than n leaves. Thus by the induction hypothesis, if are distinct and , then , from which it follows that . A similar argument holds if are distinct and . But at least one of these cases must hold. Indeed, since , it must hold that , or and . If (or ) then those networks are clearly distinct. Otherwise we have and . We must have that are distinct or are distinct, since and are distinct. Thus we either have that are distinct and , or are distinct and . In either case we have , as required.
Finally, suppose that has no non-trivial split. Then is an n-cycle network, that is, has a single cycle and every leaf is incident to a vertex on the cycle. If , then and are both networks with exactly one cycle of length at least four. It then follows from Theorem 1, together with Proposition 2, that and are distinguishable (and, in particular, ). If on the other hand , then consider two cycles and in , with the subset of below the reticulation in , and the subset of below the reticulation in . Since and are different cycles, . But then this contradicts the fact that every cycle in is refined by a cycle in , as the single cycle in would have to have both and as the set of leaves below the reticulation. Thus in all cases we have either a contradiction or , which completes the proof of Lemma 8.
We are now ready to prove Theorem 2, which we restate for convenience.
Theorem 2 The network parameter of a network-based Markov model under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints is generically identifiable with respect to the class of models where the network parameter is an n-leaf triangle-free, level-1 semi-directed network with reticulation vertices.
Proof
Let be a class of triangle-free, level-1 network models with a fixed number of reticulation vertices. Let be distinct n-leaf triangle-free level-1 semi-directed networks on with . By invoking Lemma 8 twice, we have and under the JC, K2P, and K3P constraints. By definition, and are distinguishable; as and were chosen arbitrarily from , it follows that the semi-directed network parameter of is generically identifiable under the JC, K2P, and K3P constraints.
Discussion
We have shown that triangle-free level-1 semi-directed networks are generically identifiable under the Jukes–Cantor, Kimura 2-parameter, and Kimura 3-parameter constraints. This means that, given a long enough multiple sequence alignment that evolved on a network of this class under one of these models, this network is, with high probability, the only network from the class that coincides with the given data. Roughly speaking, this means that the data provide sufficient information to reconstruct the network. To prove this result, we employed a blend of algebraic and combinatorial methods to show that any pair of networks are geometrically distinguishable.
Previously, it had been shown that networks cannot be identified from certain substructures. For example, networks cannot be inferred from their displayed trees since more than one network can display the same set of trees (Gambette and Huber 2012; Pardi and Scornavacca 2015). Similarly, a network cannot in general be reconstructed from its collection of proper subnetworks, since two distinct networks can have exactly the same set of proper subnetworks (Huber et al. 2015). Nevertheless, for certain restricted network classes it has been shown that those networks can be uniquely reconstructed from their subnetworks (Huber and Moulton 2013; Huebler et al. 2019; van Iersel and Moulton 2014; Nipius 2020). These proofs are related to our combinatorial results, in that our proof strategy for showing network distinguishability involved careful examination of induced 4-leaf subnetworks. However, there are some fundamental differences that prevent directly using known results on building networks from subnetworks. Firstly, the existing results focus either on directed (e.g. van Iersel and Moulton (2014)) or on undirected (e.g. van Iersel and Moulton (2018)) networks. Our results, as well as the ones in Allman et al. (2019), Baños (2019), Huebler et al. (2019), provide the first combinatorial results on semi-directed networks. The main obstacle, however, was that not all 4-leaf level-1 semi-directed networks are distinguishable under the considered models. Hence, two networks can be indistinguishable even if the sets of induced subnetworks are distinct. Consequently, we had a severely restricted set of building blocks available, requiring a combination of combinatorial and algebraic techniques.
On the algebraic front, the computations reveal differences between the relationships between the network ideals under the JC constraints and the relationships between the ideals under the K2P and K3P constraints that would be interesting directions for further exploration. In Hollering and Sullivant (2020), the authors remark that the phenomenon observed in Gross and Long (2017) under the JC constraints, where each triangle network variety is contained within several of the 4-cycle network varieties, does not occur under the K2P and K3P constraints. In other words, under the K2P and K3P constraints, 4-cycle networks and triangle networks are distinguishable. In our computations for this paper, we noticed another phenomenon that seems to only hold for JC constraints. In particular, under the JC constraints, the ideals for the double-triangle networks and the 4-cycle networks are of the same dimension and are all distinct. This is somewhat surprising as one might expect the additional reticulation vertex and associated reticulation parameters of the double-triangle network to increase the dimension of the model. Our numerical computations suggest that this is another unique feature of the JC constraints. However, establishing this result rigorously may require other methods, since we were unable to compute full generating sets for the vanishing ideals of the networks under the K2P and K3P constraints.
Additionally, from an algebraic perspective, we note that adapting the random search strategy described in Hollering and Sullivant (2020) is what allowed us to find candidate subsets of variables for locating the necessary invariants to establish our main result. Something similar will likely need to be employed if these results are to be extended to other families of networks. It would be interesting to understand the relative computational costs once a candidate subset of variables is found, of either computing invariants in a subring of the original variables as we did, or of computing the linear matroid of the Jacobian with symbolic parameters as was done in Hollering and Sullivant (2020).
Finally, a major open problem, which is the larger setting for this paper, is to determine whether generic identifiability results such as these can be extended to higher level networks. We expect finding the necessary invariants for the increased number of non-unique induced 4-leaf subnetworks will be challenging. Furthermore, the complexity of the combinatorial part of the proof will explode for higher levels. This question is open not only for the group-based models studied in this paper, but also for the general Markov model, which has just started to be studied in the context of networks Casanellas and Fernández-Sánchez (2020).
Footnotes
Leo van Iersel, Remie Janssen, Mark Jones and Yukihiro Murakami were partly supported by the Netherlands Organization for Scientific Research (NWO), Vidi Grant 639.072.602 and Mark Jones also by the gravitation Grant NETWORKS. Elizabeth Gross was supported by the National Science Foundation (NSF), DMS-1945584.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Elizabeth Gross, Email: egross@hawaii.edu.
Leo van Iersel, Email: L.J.J.vanIersel@tudelft.nl.
Remie Janssen, Email: R.Janssen-2@tudelft.nl.
Mark Jones, Email: M.E.L.Jones@tudelft.nl.
Colby Long, Email: clong@wooster.edu.
Yukihiro Murakami, Email: Y.Murakami@tudelft.nl, Email: yukimurakami07201994@gmail.com.
References
- Allman ES, Rhodes JA. The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol. 2006;13(5):1101–1113. doi: 10.1089/cmb.2006.13.1101. [DOI] [PubMed] [Google Scholar]
- Allman ES, Petrović S, Rhodes JA, Sullivant S. Identifiability of 2-tree mixtures for group-based models. IEEE/ACM Trans Comp Biol Bioinform. 2011;8(3):710–722. doi: 10.1109/TCBB.2010.79. [DOI] [PubMed] [Google Scholar]
- Allman ES, Baños H, Rhodes JA. Nanuq: a method for inferring species networks from gene trees under the coalescent model. Algorithms Mol Biol. 2019;14(1):24. doi: 10.1186/s13015-019-0159-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baños H. Identifying species network features from gene tree quartets under the coalescent model. Bull Math Biol. 2019;81(2):494–534. doi: 10.1007/s11538-018-0485-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bapteste E, van Iersel L, Janke A, Kelchner S, Kelk S, McInerney JO, Morrison DA, Nakhleh L, Steel M, Stougie L, et al. Networks: expanding evolutionary thinking. Trends Genet. 2013;29(8):439–441. doi: 10.1016/j.tig.2013.05.007. [DOI] [PubMed] [Google Scholar]
- Baroni M, Semple C, Steel M. A framework for representing reticulate evolution. Ann Comb. 2005;8(4):391–408. doi: 10.1007/s00026-004-0228-0. [DOI] [Google Scholar]
- Bryant D, Moulton V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol. 2004;21(2):255–265. doi: 10.1093/molbev/msh018. [DOI] [PubMed] [Google Scholar]
- Cardona G, Rosseló F, Valiente G. Comparison of tree-child phylogenetic networks. IEEE/ACM Trans Comp Biol Bioinform. 2007;6:552–569. doi: 10.1109/TCBB.2007.70270. [DOI] [PubMed] [Google Scholar]
- Casanellas M, Fernández-Sánchez J (2020) Rank conditions on phylogenetic networks. arXiv preprint arXiv:2004.12988
- Casanellas M, Garcia LD, Sullivant S. Catalog of small trees. Cambridge: Cambridge University Press; 2005. pp. 291–304. [Google Scholar]
- Chang J. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci. 1996;137(1):51–73. doi: 10.1016/S0025-5564(96)00075-2. [DOI] [PubMed] [Google Scholar]
- Evans S, Speed T. Invariants of some probability models used in phylogenetic inference. Ann Stat. 1993;21(1):355–377. doi: 10.1214/aos/1176349030. [DOI] [Google Scholar]
- Francis A, Moulton V. Identifiability of tree-child phylogenetic networks under a probabilistic recombination–mutation model of evolution. J Theor Biol. 2018;446:160–167. doi: 10.1016/j.jtbi.2018.03.011. [DOI] [PubMed] [Google Scholar]
- Gambette P, Huber KT. On encodings of phylogenetic networks of bounded level. J Math Biol. 2012;65(1):157–180. doi: 10.1007/s00285-011-0456-y. [DOI] [PubMed] [Google Scholar]
- Grayson D, Stillman M (2002) Macaulay2, a software system for research in algebraic geoemetry. Available at http://www.math.uiuc.edu/Macaulay2/
- Gross EK, Long C. Distinguishing phylogenetic networks. SIAM J Appl Algebra Geom. 2017;2:72–93. doi: 10.1137/17M1134238. [DOI] [Google Scholar]
- Gusfield D. ReCombinatorics: the algorithmics of ancestral recombination graphs and explicit phylogenetic networks. Cambridge: MIT press; 2014. [Google Scholar]
- Hendy MD, Penny D. Complete families of linear invariants for some stochastic models of sequence evolution, with and without molecular clock assumptions. J Comput Biol. 1996;3:19–32. doi: 10.1089/cmb.1996.3.19. [DOI] [PubMed] [Google Scholar]
- Hollering B, Sullivant S. Identifiability in phylogenetics using algebraic matroids. J Symb Comput. 2020 doi: 10.1016/j.jsc.2020.04.012. [DOI] [Google Scholar]
- Huber KT, Moulton V. Encoding and constructing 1-nested phylogenetic networks with trinets. Algorithmica. 2013;66(3):714–738. doi: 10.1007/s00453-012-9659-x. [DOI] [Google Scholar]
- Huber KT, van Iersel L, Kelk S, Suchecki R. A practical algorithm for reconstructing level-1 phylogenetic networks. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 2011;8(3):635–649. doi: 10.1109/TCBB.2010.17. [DOI] [PubMed] [Google Scholar]
- Huber KT, Van Iersel L, Moulton V, Wu T. How much information is needed to infer reticulate evolutionary histories? Syst Biol. 2015;64(1):102–111. doi: 10.1093/sysbio/syu076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huber KT, van Iersel L, Janssen R, Jones M, Moulton V, Murakami Y, Semple C (2019) Rooting for phylogenetic networks. ArXiv preprint arXiv:1906.07430
- Huebler S, Morris R, Rusinko J, Tao Y (2019) Constructing semi-directed level-1 phylogenetic networks from quarnets. ArXiv preprint arXiv:1910.00048
- Huson DH, Rupp R, Scornavacca C. Phylogenetic networks: concepts, algorithms and applications. Cambridge: Cambridge University Press; 2010. [Google Scholar]
- Jin G, Nakhleh L, Snir S, Tuller T. Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics. 2007;23(2):e123–e128. doi: 10.1093/bioinformatics/btl313. [DOI] [PubMed] [Google Scholar]
- Long C, Kubatko L. Identifiability and reconstructibility of a modified coalescent. Bull Math Biol. 2018;81:408. doi: 10.1007/s11538-018-0456-9. [DOI] [PubMed] [Google Scholar]
- Nakhleh L. Problem solving handbook in computational biology and bioinformatics. Berlin: Springer Science+Business Media, LLC; 2011. pp. 125–158. [Google Scholar]
- Nakhleh L, Ruths D, Wang LS (2005) Riata-HGT: a fast and accurate heuristic for reconstructing horizontal gene transfer. In: International computing and combinatorics conference. Springer, pp 84–93
- Nipius L (2020) Rooted binary level-3 phylogenetic networks are encoded by quarnets. Bachelor’s thesis, Delft University of Technology. http://resolver.tudelft.nl/uuid:a9c5a8d4-bc8b-4d15-bdbb-3ed35a9fb75d
- Pardi F, Scornavacca C. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput Biol. 2015;11(4):e1004135. doi: 10.1371/journal.pcbi.1004135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhodes JA, Sullivant S. Identifiability of large phylogenetic mixtures. Bull Math Biol. 2012;74(1):212–231. doi: 10.1007/s11538-011-9672-2. [DOI] [PubMed] [Google Scholar]
- Rossello F, Valiente G. All that glisters is not galled. Math Biosci. 2009;221:54–59. doi: 10.1016/j.mbs.2009.06.007. [DOI] [PubMed] [Google Scholar]
- Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press on Demand; 2003. [Google Scholar]
- Solís-Lemus C, Ané C. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 2016;12(3):e1005896. doi: 10.1371/journal.pgen.1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solís-Lemus C, Bastide P, Ané C. Phylonetworks: a package for phylogenetic networks. Mol Biol Evol. 2017;34(12):3292–3298. doi: 10.1093/molbev/msx235. [DOI] [PubMed] [Google Scholar]
- Solis-Lemus C, Coen A, Ane C (2020) On the identifiability of phylogenetic networks under a pseudolikelihood model. ArXiv preprint arXiv:2010.01758
- Steel M. Phylogeny: discrete and random processes in evolution. New Delhi: SIAM; 2016. [Google Scholar]
- Sturmfels B, Sullivant S. Toric ideals of phylogenetic invariants. J Comput Biol. 2005;12(2):204–228. doi: 10.1089/cmb.2005.12.204. [DOI] [PubMed] [Google Scholar]
- Sullivant S. Algebraic statistics. USA: American Mathematical Soc; 2018. [Google Scholar]
- Than C, Ruths D, Nakhleh L. Phylonet: a software package for analyzing and reconstructing reticulate evolutionary histories. BMC Bioinform. 2008;9:322. doi: 10.1186/1471-2105-9-322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thatte BD. Reconstructing pedigrees: some identifiability questions for a recombination-mutation model. J Math Biol. 2013;66(1):37–74. doi: 10.1007/s00285-011-0503-8. [DOI] [PubMed] [Google Scholar]
- van Iersel L, Moulton V. Trinets encode tree-child and level-2 phylogenetic networks. J Math Biol. 2014;68(7):1707–1729. doi: 10.1007/s00285-013-0683-5. [DOI] [PubMed] [Google Scholar]
- van Iersel L, Moulton V. Leaf-reconstructibility of phylogenetic networks. SIAM J Discrete Math. 2018;32(3):2047–2066. doi: 10.1137/17M1111930. [DOI] [Google Scholar]
- Wen D, Yu Y, Zhu J, Nakhleh L. Inferring phylogenetic networks using phylonet. Syst Biol. 2018;67(4):735–740. doi: 10.1093/sysbio/syy015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Grünewald S, Xu Y, Wan XF. Quartet-based methods to reconstruct phylogenetic networks. BMC Syst Biol. 2014;8(1):21. doi: 10.1186/1752-0509-8-21. [DOI] [PMC free article] [PubMed] [Google Scholar]